Advanced Reporting Extension: An Introduction
Today we will talk about our Advanced Reporting Extension for RapidMiner and demonstrate some of its features. With its three operators, it looks tiny in comparison to the bulky Jackhammer Extension, but it adds them in a blind spot of RapidMiner and is designed to free the common data scientist of some of their worries.
The idea is to use the capabilities of RapidMiner to automate any regular reporting task that results in an Excel sheet. Many projects and data science departments are downright drowned in these requests, consuming all resources before you can get to the really fun part of data science. Now you can simply start at the beginning to create a nearly zero overhead reporting, even if you do not have or cannot use real business intelligence tools like tableau or qlik.
How does it work?
Step 1: Create a template in Excel
First, we will create a dummy sheet and add all of the desired layout components, diagrams, texts, and of course areas for data.
We can use any formatting, chart type or conditional coloring that we like, including those nice spark lines. Just one thing is important: We need to reserve space for inserting the data. What will happen later is that we overwrite parts of the content of the table with data from RapidMiner. So if we have more than three employees, we would need to either leave more space between the table and the diagram, or to just put the data into a separate sheet and reference this in the diagram. But if you are used to Excel reporting, you probably know all these tricks.
Insert some dummy values so that you can see the charts in action.
Do not forget to save the file as we will need it later.
Step 2: Create a process in RapidMiner to load the data
RapidMiner is very versatile when it comes to getting the data into the shape you want. It can read and combine many different formats and sources and then aggregate, join, pivot, and process the data into the shape you need it in.
On the right, you can see a process combining data from four different sources with multiple joins and preprocessing steps to match the data. Such a process could just deliver the data we want to put into our nice Worktime sheet.
Of course it could be much simpler and just contain a single SQL query or also be much more complex involving calling of web services, Big Data and analytics on hadoop, some machine learning or whatever. The trick is that we can leverage the entire flexibility of RapidMiner to get the data we want to put into an Excel sheet.
Step 3: Open Report
Once we have the data in the desired format, we add an Open Report (Excel) operator from our extension. You see it on the right hand side in the operator tree. We need to point the operator to two files: The template file we created and saved in Step 1 – you can either use the parameter form template file or the tem input port. The second file can be specified as target file parameter or by using the tar output port.
Why the ports for files? Because it allows you to handle the files conveniently in scenarios where you want to do other things with them later on in the process. You could even create a template file in a RapidMiner process, or less fancy and more realistic: Store the file in the repository of a RapidMiner Server to share among many users. The output file port is most useful if you want to either zip the result or return it as a web service result in a RapidMiner Server
Any data we want to insert into the Excel file, we need to forward to the input ports of the Open Report (Excel) operator. Don’t worry, there another input port will always appear when you have connected the last available one. We will use the data delivered to these ports in the inner sub-process to do the actual insertion.
Step 4: Insert Tabular Data
Entering the inner process of the Open Report (Excel), we can add the Write Data Entry (Excel) operator to insert an ExampleSet into the excel file. We have done so with the first ExampleSet on the screenshot below. The operator allows to select which attributes to use and where to place them. Therefore, you specify the sheet location via its index. Then you point it to a fill range. A range can either be open-ended by specifying the left upper cell of the area, or closed, if followed by a colon and the right lower cell. So, B2 would start in the second column, second row. B2:D4 would allow to fill 2 rows and 2 columns.
For our little employee table from Step 1, we set it to B11:C13. Unless we select fit to range, the process will now fail if our data does not fit into this range.
We will add another operator of this type to output the second table.
Step 5: Insert Data
The only thing missing is the version tag, so that people know what this report was about when they open them at a later time.
Therefore, we first use a Generate Macro operator from RapidMiner’s core functions to create a process variable (or macro as they call it) containing the current date and time. We then add a Write Cell (Excel) operator from the Advanced Reporting Extension and connect the ports. Although there will be no data flowing from the Generate Macro operator to the Write Cell (Excel) operator, the connection ensures that the Generate Macro will be executed first and set the process variable before it is read.
Then we just need to point the Write Cell (Excel) operator to the right fill position , which is F5 in our case. Set the value and type correctly and we are good to go.
A quick note about dates:
There is an unlimited number of different date formats out there. If you want to write a date to excel, you first need to parse the date format to the value it has in RapidMiner. So, if you enter something like 2017-03-29 23:59:59 as value, you should enter “yyyy-MM-dd HH:mm:ss” in the date format parameter of the Write Cell (Excel) operator. Once it knows the date, it will automatically transform it to the correct format of the Excel Template Sheet, where you set it via the Cell Format.
Once the sub-process is finished, the target file will be written and you just need to mail it to someone else and be done with it.
We would like to recommend to just automate about everything right from the beginning. There is no “I just need to do this once”. In 90 % of all cases, you will need to do it twice and then the additional overhead of the automation already would have paid off. So please feel free to download the extension, order a license and ask any questions you might have. In case you are not convinced yet, the free version lets you access the full functions and only limits the number of Write operators within each sub-process to one. Download it here.