Update 04 June 2020
This tutorial already came out last summer, but is still fully up-to-date – it just took us a bit longer than expected to release the extension. We hope it helps you with your first queries, more advanced tutorials will follow soon! In the meantime, please feel free to contact us here or on Twitter if you have any questions.
Introduction
A few weeks ago, we announced our new InDatabase extension for RapidMiner. The extension will greatly improve your database work with RapidMiner, introducing features such as transaction safety, universal quoting, and autogenerated ID columns. You will be able to either write your queries using SQL or easily set up a query process using the extension’s operators. For more details on the extension itself and how it can help with managing databases, have a look here.
Today, we would like to show you more of the InDatabase extension in an introductory tutorial demonstrating its functionalities. We’ll do this with a predictive maintenance scenario: Imagine you’re working for a factory with various machines. These machines send sensor data such as voltage, pressure, or vibration, and whether or not one of several errors has occurred. For our predictive maintenance task, a database with several months of data collected from the machine has been compiled for us. This database also holds identifying info like an ID, the machine’s age, the manufacturer and the date and time of the measurement.
For this first of several planned tutorials regarding the InDatabase Extension, we’ll start with something simple: querying the database. In the tutorials to come, we’d like to show you, among other things, more complex queries, how to update the database while enjoying transaction safety and how to add new data coming in from another machine using the autogenerated ID feature.
The First Step: List Tables
First, let’s take a look at what we’re working with: We’ll use the List Tables operator to see how many and which tables are stored in the database. To do so, simply find the operator and add it to your process. Specify the database connection, connect the exa port to the results port of the process panel and execute the process.
The results show: we have two tables, machinery log and machines. Sounds like a good idea to have these tables in one example set to be able to carry out our predictive maintenance tasks.
Next: Using the Query Database Operator
SQL or Query-Builiding Operators?
First delete the List Tables operator, then drag and drop the extension’s Query Database operator on to the process panel. Taking a look at the parameters, you can see that you can define a database connection and enter a custom query using SQL. The parameters below this regard the custom query itself: enable universal quoting to be able to use single quote identifiers (‘) for table names and double quote values (“) for strings in your query, allowing for basic compatibility across database systems. You can also choose to use explicit parameter types and add parameters to a list. Also, you can specify the date format you wish to use. The optimize names feature at the bottom is also available should you choose not to write your own custom query but use the extension’s query building operators: this feature originates in our Jackhammer extension, and applies best practice regarding names to your example set, rendering attribute names to UpperCamelCase.
To demonstrate the query building operators, we will not write a custom query today, and thus only need to specify the database connection. To build a query with operators, we now add a Define Query operator to the process and here, too, specify the database connection. As soon as we connect it to the Query Database operator, the parameters there regarding custom queries will be hidden:
Query Building with Operators: Entering the Subprocess
Now enter the subprocess of the Define Query operator. This is where we’ll add the building blocks of our query. We’re starting with Use Table to specify which table we want to use. As we’d like to join the two tables into one example set, we’ll need two of these. In the parameters, select the table names, machinery_log and machines, respectively. If there are no suggestions when you click the drop down button, hit the refresh button to have them appear.
Following this, drag the Join operator onto the process. In the operator’s parameters you can select the join type (inner, left, or right), which we’ll leave at inner. You can also enter a suffix for the attributes which appear twice in the resulting example set after the join. Default here is _new, but you can change it to anything you like or makes sense for you and your data (for example you might want to change it if your tables already contain a column name ending in _new so as to avoid confusion).
As you can see, there is the little exclamation mark warning symbol above the Join operator and that is because we have not yet set the join conditions. Double click on the operator to enter the subprocess and then find the extensions’ Conditions folder in the operator panel to see the available conditions: the Condition operator is used for the Filter Rows operator, but relevant for now are Equals, And, and Or. We need the Equals operator so that we can perform the join on the ID column found in both tables. Select these columns in the operator’s parameters and make all connections.
We can now execute the process to have a first look at our ExampleSet:
Here we have the joined ExampleSet composed of our two tables. It correctly used the ID column to connect them. Now of course we have that attribute twice, so it would be nice to clean our ExampleSet of the spare one. Also notice that the attribute names are not yet in UpperCamelCase, i.e. we’re not employing best practice here. Lucky us that we can use the extension’s optimize names feature and only have to remember to check the relevant option in the parameters of the Query Database operator.
Finetuning: Filtering Attributes and Optimizing Names
To make the changes, go back to the process and enable optimize names.
Then add a Select Attributes operator behind the Query Database operator, and enter the necessary details. Note here that the attributes will reach the operator with their new names already, but if you have Synchronize Meta Data and Validate Automatically, found under the Process menu, activated, RapidMiner should be aware of this, meaning you can simply select the _new attribute from the dropdown menu and everything should work out fine. If it doesn’t, check you have the correct settings activated and try again, or, as a last solution, manually enter the name of the attribute as it will appear in UpperCamelCase into the field, so for our case this would be MachineIdNew.
We can now hit the play button and see whether the two changes worked out the way we intended to:
Looking good! We’re left with only one ID attribute and all names are in line with best practice. We can now use this ExampleSet to carry out predictive maintenance tasks. In the next tutorial we will show how to use the extension’s operators to construct a more complex query.