Blog

Today we are happy to announce the Advanced Reporting Extension for RapidMiner. With it's three operators, it looks tiny in comparison to the bulky Jackhammer Extension, but it adds them in a blind spot of RapidMiner and are designed to take away some worries from the common data scientist.

The idea is to use the capabilities of RapidMiner to automate any regular reporting task that results in an Excel sheet. There have been many projects and data science departments that simply drown in these kind of request, consuming all resource before you can get to the really fun part of data science. Now you can simply start at the beginning to create a nearly zero overhead reporting, even if you don't have or can't use real business intelligence tools like tableau or qlik.

How does that work?

Step 1: Create a template in Excel

 

First we create a dummy sheet and add all of the desired layout components, diagrams, texts and of course areas for data.

We can use any formatting, chart type or conditional coloring that we like, including the nice spark lines. Just one thing is important: We need to reserve space for inserting the data. What will happen later is, that we overwrite parts of the content of the table with data from RapidMiner. So if we have more than three employees, we would need either let more space between the table and the diagram, or just put the data into a separate sheet and reference this in the diagram. But if you are used to Excel reporting, you probably know all these tricks...

Insert some dummy values so that you can see the charts in action.

Don't forget to save the file. We will need it later.

 

Step 2: Create a process in RapidMiner to load the data

RapidMiner is very versatile to get the data into the shape you want. It can read and combine many different formats and sources and then aggregate, join, pivot and process the data into the shape that you need it.

On the right you see a process combining data from four different sources with multiple joins and preprocessing steps to match the data. Such a process could just deliver us the data we want to put into our nice Worktime sheet. 

Of course it could be much simpler and just contain a single SQL query or also be very much more complex involving calling of webservices, Big Data and analytics on hadoop, some machine learning or whatever. The trick is that we can leverage the entire flexibility of RapidMiner to get the data we want to put into an Excel sheet.

Step 3: Open Report

 

Once we have the data in the desired format, we add an Open Report (Excel) operator from our extension. You see it on the right hand side in the operator tree. We need to point the operator on two files: The template file we created and saved in Step 1. You can either use the parameter form template file or the tem input port. The second file can be specified as target file parameter or by using the tar output port.

Why are there ports for the files? Because it allows you to handle the files conveniently in scenarios where you want to do stuff with them in the process later. You could even create a template file in a RapidMiner process, or less fancy and more realistic: Store the file in the repository of a RapidMiner Server to share among many users. The output file port is most useful if you want to either zip the result or return it as a webservice result in a RapidMiner Server Webservice or Web Application.

Any data we want to insert into the Excel file, we need to forward to the input ports of the Open Report (Excel) operator. Don't worry, there will always be another input port if you connect the last one. We will use the data delivered to these ports in the inner subprocess to do the actual insertion.

Step 4: Insert Tabular Data

 

If we entered the inner process of the Open Report (Excel), we can add the Write Data Entry (Excel) operator to insert an ExampleSet into the excel. We have done so with the first ExampleSet on the screenshot on the right. The operator allows to select which attributes to use and where to place it. Therefore you specify the sheet where it will be insert by it's index. Then you point it to a fill range. A range can be either open ended by specifying the left upper cell of the area or closed, if followed by a colon and the right lower cell. So B2 would start in the second column, second row. B2:D4 would allow to fill 2 rows and 2 columns.

For our little employee table from Step 1, we set it to B11:C13. Unless we select fit to range, the process will now fail if our data does not fit into this range.

We will add another operator of this type to output the second table.

Step 5: Insert Data

 

 

The only thing missing is the version tag, so that people know what this report was about, when they open them at some point later.

Therefore we first use a Generate Macro operator from RapidMiner's core functionality to create a process variable (or macro as they call it) containing the current date and time. We then add a Write Cell (Excel) operator from the Advanced Reporting Extension and connect the ports. Although there will be no data flowing from the Generate Macro operator to the Write Cell (Excel) operator, the connection makes sure that the Generate Macro will be executed first and set the process variable before it is read.

Then we just need to point the Write Cell (Excel) operator to the right fill position, which is F5 in our case. Setting the value and type correctly and we are good to go.

Short notice on dates: There is an unlimited number of different date formats out there. If you want to write a date to excel, you first need to parse the date format that the value has in RapidMiner. So if you enter something like 2017-03-29 23:59:59 as value, you should enter "yyyy-MM-dd HH:mm:ss" in the date format parameter of the Write Cell (Excel) operator. Once it knows the date, it will automatically transform it in the correct format of the Excel Template Sheet, where you set it with the Cell Format.

Once the subprocess is finished the target file will be written and you just need to mail it to someone else and be done with it.

We would like to recommend to just automate about everything right from the beginning. There will be nothing like a "I just need to do this once". In 90% of all cases, you will need to do it twice and then the additional overhead of the automation already would have paid off. So please feel free to download the extension, order a license and ask any questions you might have. In case you are not convinced, yet, the free version let's you access the full functionality and only limits the number of Write operators to one within each subprocess.

Download it here.

 

Undoubtedly, Data Science is one of the most important new capabilities in companies that will shape the future over the next decades. But how to establish Data Science with all its facets from the craftsmanship of big data architecture over the art of business intelligence towards the magic of predictive analytics in a company? Where do you start? Which steps do you need to take when you want to get there? And what actually is "there" at all?

 The Düsseldorf Data Science Meetup invited our Lead Data Scientist Sebastian Land to give some answers. His talk aggregated over ten years of experience in the field and joins it with in-depth understanding of the involved technologies. Enriched with our practical project experiences at Old World Computing, he, as a true data scientist by heart applied his predictive powers to picture possible ways with their chances but also cautioned about probable pitfalls and the cliffs of costly decisions on this journey into the future.

We thank the organizers for the opportunity to speak on such a great event in a great location and hope that the surprisingly many visitors could take home at least some light to shed at their own problems. Thank you for coming!

To preserve the knowledge hopefully gained there, we have made the slides available here.

We just published an update of the Statistics Extension that should make it compatible with RapidMiner 7.3. However, several restrictions of the new RapidMiner Security Mechanism are preventing existing functionality to work. While all operators are working as expected, you will not be able to enter the license key in the dialogue. 

Hence for now you need to go over the RapidMiner Settings / Preferences menu, switch to the Licenses tab and past your email and license key there as shown on the right.

 

While our Jackhammer performed very well in our projects, we are relieving it from the active duty. For now.

This actually does sound much more sad than it really is. The simple reality is that it won't be compatible with future versions of RapidMiner in it's current form and we don't want to promise something to our customers that we cannot stay up to. Therefore we will take some time to spit polish everything that is in there and will remain working and then he will be back, hammering away your data science problems. In the meantime we discovered so many useful features for large scale data science projects that we are going to implement, you should watch his return anyway.

 

 

So, finally we have done it! If you have followed our twitter posts or some remarks in the RapidMiner Community, you already may have been aware that we have been working on something larger. Now, after more than a month of spit polishing our brand new extension with documentation, tests and improved user interface, it's done. It's available now!

Wow, this really is a great feeling right now. I guess you would have an easier time sharing my feelings and excitement, if you knew what the extension is actually about. So please find below a very short collection of images. A long description of the features can be found here.

Loops to save time and complexity

The new loop operators allow to use arbitrary input ports. If you need to continuously change one specific object, you can get it from the loop port which is in the first iteration holding a copy of the same object as on the first input port. But once the first iteration is executed, the results from the first iteration will be forwarded back to the input loop port for the next one. No Remember / Recall needed to continuously aggregate data as above.

The screenshot shows how data is appended to previous results and then swapped onto the disk.

Continuous Tests for your process collection

Here you see a simple process that combines our new Loop Repository operator, that allows you to iterate over processes with the testing operator. It loads the process and the respective result and then compares it. If the result does not match, regardless whether it is data, performance vector or model, an exception is thrown and you know that something went wrong in your infrastructure.

On Disk Memory and Caching

The Storage Statistics Panel gives you an overview of how much memory you are consuming on your hard disk (o better solid state disk) and how many cache entries are currently stored.

Above you see how ten thousands of columns are stored, each of them in a data set with around 60000 rows. And this extra memory just comes with a speed penalty of factor three.

The new Cross Validation

The new cross validation looks nearly the same from the inside...

...but allows to generate final preprocessing models if they are built during the training phase. Training phase may also use other external input. Finally the cross validation can output test results for all rows like an X-Prediction with now time overhead. But that's not all...

Unlike the default version, it will make of your powerful multi core CPU like here on my Laptop.

The same is true for the loop operators that can execute iterations in parallel if the results are not depending on each other! A great way to speed up complex operations on larger (but not yet big) data sets.

New Panels

As you can not only use your multi core cpu for speeding up some operators, but you can also run multiple processes at once, right within your studio. Simply drag them onto the panel or select the play button!

You see above that they will be shown beside the tasks of operators. Finished processes allow to access their results until you clean the list. You can even access the logs of the processes during runtime!

Free Demo Version

The good news is: Everybody who wants to try it can just proceed with the download from our product page here. While the free demo version is restricted in some features, most of them are free to use for everybody! But if you like it, we will be very happy to welcome you as paying customer...

Of course we are also happy to receive some feedback about the features, what's great, what's missing and so on, but first you should get yourself familiar with the extension: Here's the list of the features.

As the benefit of some of the features may be not obvious for everybody, we will release a series of videos, explaining when, why and how to use them. Most of them are really meant for the hard boiled data scientist in a productive environment that not all of us might face everyday. On the other hand, the extension also features several operators that are very useful for the beginner as well as they make things simply easier to achieve.

We just released the third update version of the statistics extension since the initial version. This time we continued our effort on adding more functionality from the area of survival curves due to continued interest from our customer base. So we added further operators for generating Survival Curves following the Flemming-Harrington method and also one operator to get a hazard curve, complementing the surival curves. We chose the method following Nelson-Aalen as one of the most widely used ones.

 

The Nelson-Aalen Cummulative Hazard Curve

The Flemming-Harrington Survival Curve

Outlook

Our development resources will focus on another extension for the next weeks, but we will soon continue with the development of this extension. If you feel something missing for your particular application, please don't hesitate to send us some inspiration. Our consultants are not (yet) active in every single field of data science, so we might not know your problems and requirements if you don't tell us. We would be happy to understand your problems and will design our roadmap accordingly!

 

Chances are, you are right now thinking about buying a new laptop. Wouldn't it be nice to spend some time over the Easter holidays with a brand new computer? Whenever I unpack a new device the smell of not yet used electronics throws me back into some of the happiest hours of my life. Unfortunately a quite long throw, as it stands now, but nothing I can do about it, even if priorities definitively changed over the years. But if I see the box of a new computer, I still get excited to turn it on...

Unfortunately for us data scientists we are in constant conflict with the laws of physic and so far it always gained the upper hand. The limits imposed by it's laws forced intel and other processor manufacturers to step back from the race to higher clock speeds and instead they are now multiplying the number of cores more or less creatively. But as many data science problems are single thread problems if solved in a straight forward, generic way, we most of the time have to wait for our pc. And even if you unpacked a shiny new laptop, combining the power of eight cores, you may not benefit from it directly.

It can be quite frustrating, if your RapidMiner Process is just using up one or two cpu cores and the fan is not even turning on, because most of the powerful muscles of your cpu are not even moving. In this situation there are only two solutions: 

  • You could either use the remaining 6 cores to play one of your favorite computer games, guaranteeing to consume any available computational resource, or
  • You can think about your task at hand and identify parts that could be executed in parallel if arranged in a smart way.

Patterns for preparing RapidMiner processes for parallel execution

In principle two tasks can be executed in parallel, if they are not depending on the results of each other. In Data Science there is a very simple and yet very frequently used operation that consists of several independent tasks: The cross-validation. As you might remember from your data science education, each iteration of the cross validation is not depending on the others and  the most commonly used ten-fold cross validation has ten such iterations, plus potentially an eleventh run of the training phase to derive a final model on the full data set. (If you don't remember, you really should take a look at our training offers. Getting correct results is always more important than getting results faster...) These ten or eleven tasks can all be computed in parallel, using up up to 11 CPU cores, something still beyond the usual laptop.

 

We are currently working on an extension that employs an implementation of the cross-validation that, beside some other practical benefits over the standard implementation, executes it's iteration in parallel. In contrast to the other extension, providing a parallel implementation of a cross-validation, our is much more robust and allows to stack it with other parallel operators. On the right hand side you see a screenshot of my CPU utilization while I'm running a cross-validated Forward Selection using a cross-validated inner linear regression learner. You see, that all the CPU cores are nearly under full power. As a user, you perceive the speed up nearly linearly by the number of CPU cores, drastically improving your work efficiency. Because where you waited 8 minutes before to get a result, you have it now after roughly 1 minute 30 seconds.

There are two reasons why it is not factor 8: For one, there is additional overhead for copying the data 8 times to avoid any interference between the tasks and second, our extension so far leaves one cpu core free for other purposes like drawing the user interface and reacting on mouse clicks. But honestly, as user you are not perceiving a difference with times 7 or 8.

Parallel execution for preprocessing

Beside this core of predictive analytics, where you will use learning algorithms, one usually spends a lot of time in preprocessing the data. Admittedly the learning algorithms will usually consume most of the cpu cycles, but on the other hand we nowadays have very complex data types. There are text files and these are just the easiest types. Especially in industry applications there are measurement files containing pressure and force, recorded over the millisecond. There might even be audio files that capture what a microphone recorded with a resolution of 44100 hertz. This data can quickly aggregate over hundreds of GBs and preprocessing becomes an issue in both time as well as memory.

To mend this problem, our extension contains advanced operators for separating a task into many subtasks which will be executed in parallel. If we stick to the example of the audio files, we might simply generate a list of all the files, which is then separated into batches of a given size. These batches are then executed in parallel, so that there's one thread processing the audio files of that batch.

Of course you need to be careful: In many situations it's quite important to process a group of these files together because you might need to scale them all with the same factor. Our operators allow you to setup the group identity attributes, so that all files having the same values for these attributes are guaranteed to be processed together. 

With that combination of features, it's quite simple to first distribute a given task over subtasks, generate results for each of the subtask and then as last step combine the single results again to one large result.

Of course in many situations also the main memory is a limitation and you cannot load all the data at once. Due to the inherent flexibility of RapidMiner processes, it's possible to use our looping operators to avoid loading the entire data set. As you have seen in the example above, with the list of files you need to process, you simply can iterate over each file, extracting the features and discarding the raw data again or storing it in the repository one by one, avoiding any complete loading into memory.

The Jackhammer Extension - to be expected end of this quarter

We are currently completing our work on this extension and polishing the user experience. Following our principles, this extension has been developed to increase the productivity of our consultants and solve regularly occurring problems, with the benefit that it always has been tested in real world projects and the result that it is already remarkably stable. But still there's some way to go with documentation, and especially with the name. While we originally intended it as a collection of useful tools for analyzing data, it more and more showed muscles. So by now, the originally intended and name of "Analyst's toolkit" seems to be quite an understatement. And as RapidMiner by itself is a golden hammer that can be (mis-)used to solve nearly every data related problem, we now think of it more as the Jackhammer extension, as it will dig through big mountains of data easily without the (real big) investment in real big data infrastructure. 

With the newest release, the Statistics Extension is now fully compatible with RapidMiner Server. Together with the help of the great development team from RapidMiner we have been able to sort out any issues with the licensing. Please excuse any inconveniences that might have caused and we wish you a lot of fun with integrating the new operators in your Web Applications and Web Services!

 

To get to know how to setup the extension in RapidMiner Server, please take a look here.

Following our announcement we are going to implement the feature requests we received from our customers and clients. Unfortunately we have been quite side tracked by several projects (and other very cool stuff, that we will announce later) during the last months. On the other hand, this gave us the means to hire some new programmer, so that we now can continue on our mission to provide more cool stuff around RapidMiner to ease our daily lives as (Data) Scientists.

As a first result of that grown capacity, we can now present our first own visualization of a statistical property. We introduced some new object into RapidMiner, which holds the chart information. Within RapidMiner Studio it will automatically present the curve in the Result Perspective. Within RapidMiner Server it can be published for a Web App and presented there as PNG file to users. We hope that many find this way of charting useful and would be happy to receive feedback about it.


Of course there is also a way to get the results of the Kaplan-Meier estimation as a standard Data Table in RapidMiner, so that we not only have the visualization, but could also act on the results in the standardized and very powerful manner of RapidMiner processes.We could imagine a fully automated advertisement campaign, where survival curves of participants are used to check on the success. But many other applications are thinkable.


Unfortunately the update process is still not as easy as we would like. You still need to manually download the extension here. But we are collaborating with RapidMiner to extend their marketplace, so that we finally can publish our extension there and you can update it like every other one.

After we got the surprising chance to present our extension on the RapidMiner Wisdom in Slovenia, we got quite some interest and feedback. I want to use this to thank Simon and Ingo from RapidMiner for spontaneously inviting us for this presentation! Some of the feedback already made it into program code and hence we can not only present another testimonial, but also a new release!

What is new...

Beside a small work-around for server installations that was released for our customers in between, the update also contains two new operators, namely Extract Odds Ratios and Extract Risk Ratios. These two compute frequently used measures in clinical studies, where the effect of an exposure on a disease is characterized.

...and what will come

We are currently intensifying our research of what is used in medical and social studies to put together a road map for further releases. And to support this process, we are really welcoming comments and feedback about our extension. It does not need to be like the one from one of the audience members during the presentation:

"As a statistician I want to thank you from the depth of my heart for implementing the Split Data by Groups operator"

We can also live with notifications about shortcomings and required features, especially if the interrupt the workflow and force to switch tools.