The full list of features is available on the right. The Jackhammer extension covers several different issues that we experienced in our daily project work. All operators can be neatly integrated with any RapidMiner process so that you can utilize them as well in Studio as Server and in any situation where you find it useful. We will address the different features and project phases where they can be used below and will link on pages with additional information about the particular operators and how to use them.
One of the greatest features of RapidMiner is its ability to have a very flexible process flow. We can have different flows using branches, we can have loops over files and even create control tables that then will be used to control the process execution. The latter being incredible useful for projects where you need to combine the same building blocks in different situations in different ways or where you want to give an end user certain control over analysis steps.
However, while many of the existing operators are very helpful to have, it turned out that there are several short comings that you need to circumvent nearly all the time. We decided that it's about time to reduce the size of the processes by five or six operators per loop and invented new operators that have the necessary functionality to cover 98% of the use cases by themselves. If you have seen enough Remember / Recall ping pong with extra operators for initializing before the first loop execution, if you have enough of determining the exact order of the process and by accident destroy that later, well, now you don't need anything of that anymore. If you have never heard of that chances are you never have to put your brain on that again...
After you have seen running three processes in the background on your eight core machine, you are asking yourself: Why stop with three CPUs working for me? To avoid frustrating waiting time, we added some nice features: The mentioned loops all support parallel execution of their inner processes if process design allows for that. If your process depends on the results of the previous loop execution, the extension will automatically switch back to serial execution.
In any other situation the extension will grab as many CPU cores as there are (or as you allowed it to use) and execute iterations in parallel. This is of course especially valuable for long running sub processes. In the environment of Data Science this happens to be nearly always the learning algorithm. As you usually do the learning within a cross validation to get a reliable estimation of the predictive quality of a model, we have made sure that there's now a parallel version of a cross validation. In comparison to the cross validation offered by the free Parallel Extension, ours has some advantages: For one it's reliable, second you can stack it and it will not only work but also share the free CPUs with any background process or other parallel task. Then there are additional ports, that allow to handle the validation of pre-processing models and the usage of additional inputs for the training sub process. (Yes, forget about Remember / Recall!) Last but not least: It also supports the features of the original X-Prediction, so that you can get the actual predictions to the outside easily, for example for post processing or complex performance calculations.
In our projects we heavily rely on WebApps, built using RapidMiner Server. They are the perfect connection between the Data Scientist and the end user and are in many times decisive for project success. While they follow a very powerful idea, some of the features are only usable for the expert and if not properly setup, they may cause high load on the underlying infrastructure.
If you have ever get shouted at by your database admin to have overloaded the database by too many requests of the same query, well, our Cache operator will be your salvation. It will save a lot of valuable time for the end user, the data scientist and the troubled database admin. Because once executed, the operator will store the results of its subprocess and whenever it is executed again, it will return the cached result instead of executing the process again. If required, it is also possible to limit the validity of the cache entries by time or macro values or explicitly set it back. Furthermore you can access the cache from other processes of your WebApp, provided that you have the necessary user rights to do so.
Another very useful feature of the WebApps are the tree selection components. Unfortunately the relational data format that is needed by this component is rare in real world projects. You nearly always have to create it manually with a huge set of operators that block the sight on the real purpose of the process. And if you have an arbitrary deep tree structure you have been in deep trouble. Until now.
We addressed this issue with a set of operators that work on a special Tree-like Graph object. You can transform data and trees into each other with special operators to select parts of the tree, etc. With that it's become really easy to work with generic tree structures and the tree visualization component in an web app. And as you very often want to display parts of the repository as they might contain previously generated results, there's a special operator turning a part of the repository into a tree in just one step.
While being standard in modern software development projects, RapidMiner relied on external solutions to make sure that processes keep working over the time and changes of infrastructure and software updates.
While this obviously can not be done for processes that depend on data that's always new, we still can do that for all the processes that are part of libraries, used in many projects. If you followed the best practices of repository design, you will have stored all these processes in a dedicated folder and stored some example input with reference output along with the process.
With our new operators you can now easily test all these processes if they still work as expected. And as long as you can formulate a test in a way, that the result needs to be the same as before, you can use our new Assert Equality operator to check for that. And this works not only with data sets but with arbitrary RapidMiner objects.
There are common tasks in projects that need to be done, but RapidMiner does support them only in a elementary way. Again it's simply convenience to not have to build a process to solve that, but if you are under time pressure, you will no doubt appreciate the additional functionality of our advanced data transformation operators. Sort can sort according to any number of attributes and the set operators use any number of attributes (and especially any role, no more Set Role just for intersection) to check for equality and will process any number of data sets.
1 RapidMiner is a registered trademark of RapidMiner Inc., 10 Fawcett Street, Cambridge MA 02138, United States of America