Today we would like to introduce to you our new In Database Extension: manage databases easily and efficiently with RapidMiner! The extension is fast, provides reliable transaction safety, and will surely facilitate your database tasks.
Databases: A Short Introduction
For the uninitiated, let’s first discuss some issues related to databases: consistency and completeness. A database can have more than one table – for example, one for customers, one for orders, and one for items in these orders. These tables need to be consistent: there must be a corresponding item in the corresponding table. Orders must be linked to a customer, items must be linked to the order. Otherwise you’ll get a messy database, with entries just floating around, no one knowing what they are there for.
When a new customers places an order, the process that happens in the database is a linear one: an entry is made for the customer, then for his order, then for the items he ordered, all in their respective tables. So far, so good. But what if, during this process, the power goes out, or the internet connection is broken and only half the entries are made? Then you would have a new entry for the new customer, and order number in the second table, but no items on that order – the customer would not get what he bought.
The solution for this are transactions: the whole process becomes one transaction which is either executed fully or not at all. So if someone were to pull the plug of your computer during the writing process, before the customer, the order, and the items on that order were all three entered into the database, the whole thing would be called off: instead of creating a new customer without an order or and order without items, you would simply have nothing at all, which is preferable as it keeps your database clean. The customer will have to repeat their order, instead having it apparently go through, waiting for the items to be shipped, which will never happen, and growing frustrated with your business.
Databases and RapidMiner
This example nicely illustrates the basic ideas of databases and transactions, but an example closer to a real-world scenario for using RapidMiner in connection with databases would be sensor data: In a factory, the machines send various data such as operating time, number of operating cycles, number of items produced, possibly temperature, humidity etc. Now imagine a power failure during the writing of the newest sensor data: we have operating time and items produced, but just before the value for temperature could be entered into the database, the computers shut down. If you were to reboot them and simply start over the process, crucial mistakes might creep into your database: this is because RapidMiner does not work with transactions. This means that when the power went out, the entries that had been written until then stayed in the database. Repeat the process and the entries that were in the database already will be made again, effectively doubling those values. 8 hours operating time become 16, 300 items produced are now 600 – you see how this might have more serious consequences than “just” an untidy database or an order gone missing.
Now, you might say, a power failure is unlikely nowadays and there are backup generators, so this really isn’t much of an issue at all, but there are other reasons why a process might be canceled half-way through, such as internal errors when a value does not match the specified required values.
The In Database Extension: Transaction Safety for RapidMiner
To avoid this issue, RapidMiner needs transactions, a feature we are introducing with the In Database Exentsion. The Write and Update operators work on a transactional basis: either the whole task is carried out or all changes are discarded. Furthermore, there is a Start Transaction operator with a subprocess level. Here, you can build a process which will work as one transaction, all or nothing. Additionally, you can use the Commit Transaction and Rollback Transaction operators for greater transactional control.
Below you can read about the operators in detail. Did we spark your interest? Check back soon and follow us on Twitter for a tutorial on how to use the In Database Extension, and of course to be informed about the release!
The Operators in More Detail
The operators of the extension can be divided into four categories: basic, management, transactions, and query building.
The Basic Operators
Query Database
The Query Database operator is the core of the database extension. It sends a query to the specified database and receives the delivered data. All query building operators from the Query Building folder (see below) must be placed inside the subprocess of this operator as they constitute the query executed by it. These operators make a process-based query possible.
Alternatively, you can enter a custom query in the operator’s parameters using SQL. Here, you have the convenient option of universal quoting: instead of having to adapt your code for the various requirements set by different databases, use single quotation marks (‘) for identifiers such as table names and double quotes (“) for values such as strings. They will automatically be converted into the required sign, allowing for basic compatibility between databases. In order to set parameters, use a question mark (?) in the the query string and enter the parameters with a click on the button “Edit Enumeration”.
Another feature going beyond the basic RapidMiner functions is the possibility to add an autogenerated ID column while writing or updating the database.
Additionally, you can enable optimize names, applying the Make Names Great Again functionality originally found in our Jackhammer Extension and thus converting column names to UpperCamelCase style, following best practice.
Query Database (Batch)
Query Database (Batch)
The Query Databse (Batch) operator splits the received data into batches and allows parallel execution of RapidMiner operations on these batches.
It receives a query object from the query object output port of an Query Database operator. This query is executed and the data is received and processed in batches. The operator contains a subprocess. The operations contained in this subprocess can be executed parallelly to reduce execution time. The size of the batches can be specified via the batch limit parameter.
Write Database and Update Database
The Write and Update operators can be used either on their own or within the subprocess of the Start Transaction operator. They will show transactional behavior also when used on their own, i.e. either finishing the whole task or discarding all changes made in case of an error being thrown up during the writing or updating process.
Execute Statement
The Execute Statement operator allows to execute an arbitrary SQL statement against a database. It either can work standalone or within a transaction in concert with other operators. You can use universal quoting for basic compatibility across databases.
Management Operators
List Table
The List Tables operator returns a list of all tables defined in the given database. It can be used for management tasks or automation within loops.
Drop Table
The Drop Table operator simply drops a table in an SQL database. The table is identified by name in the parameters. It can be used in combination with the List Tables operator’s returned list of tables to delete entire databases. So be careful! Best make sure that you are not operating with an admin account on important data warehouses!
Truncate Table
The Truncate Table operator strips a specified table of all content. That means the table is entirely wiped clear but the structure is left as it is. In contrast to the Drop Table all indices and constraints still remain and will be filled with new data, when new data is inserted.
Transaction Operators
Start Transaction
The Start Transaction operator will begin a transaction on a transactional database server. That means that any error during the processing of the transaction will discard any uncommitted changes. This operator uses an inner subprocess where the processing of the transaction can be performed.
Commit Transaction and Rollback Transaction
These two operators are used within the Start Transaction operator and can be used to gain greater control over the process.
The Commit Transaction operator will commit any uncommitted changes to the database. That means they will be persisted in the database and will become visible to other users. It will automatically start a new transaction afterwards.
The Rollback Transaction operator allows to rollback a currently active transaction without leaving the subprocess of the surrounding Start Transaction operator. This can be used when in need of direct control of a transaction.
Query Building Operators
If you choose not to use a custom SQL query, you have these operators available to build a query process:
Use Table – Select Rows – Filter Rows – Generate Columns – Group by – Union – Join – Rename Column – Limit – Order by
And as conditions: And – Or – Condition – Equals