Statistics Extension for RapidMiner1
RapidMiner1 is one of the best tools predictive modelling and the automation of data processing. The versatile platform covers all phases of a typical data science project: It allows to gather data from various sources, process it and store it in other target systems. The ability to derive predictive models and apply them during the data processing enables you to automatically make decisions and act on them potentially without any human interaction. But if manual interaction is necessary for trust building or regulatory reasons, the built in Web Application framework allows to create specialized web applications for the end users, perfectly hiding the necessary complexity of predictive analytics. In many scenarios this allows to reduce the manual work load, improve the reaction time and reduce error rates.
However, this modern and sophisticated approach often needs to be teamed up with a more common and well-established approach. If users need to be convinced of a new solution, especially if it embraces a completely new way of thinking, it shouldn't abandon known principles and experiences of the users, but happily incorporate them, use them and complement them. In many situations basic statistical functionality like tests for distributions and other methods of classical statistics have been used to accomplish goals that nowadays can be accomplished much more easily with predictive analytics. RapidMiner by itself lacks the cover both, requiring a complex combination of tools for a soft migration, but fortunately you now can mend the pain with our extension.
The Statistics Extension for RapidMiner provides a full set of new operators covering basic statistical functionality. The operators can be neatly integrated with any RapidMiner process, so that you can use tests and correlations as well as quantile, histogram and cross tables in WebApps, Scheduled Processes and any other context without having to integrate external programs that increased the maintenance effort and made processes far too complex.
The full feature list is available on the right. In general it covers all steps from the data preprocessing for typical statistical algorithms like e.g. grouping the data to the algorithms and operators to transform their results into something that can be incorporated into reports and WebApps.
Of course all operators can be used together with standard operators and are using standard data tables as in- and output. This allows easy integration in existing processes and several operators like the Split Data (by groups) come handy even if you don't want to perform tests or similar...
Areas of Application
As you can see on the right, the operators of this extension are grouped roughly into three categories: We have a set of operators that are just tools to prepare data for the other operators or further processing, a large set of operators for describing a data set and another one for testing hypothesis on data.
The descriptive operators can be used to easily create a description of a certain data set, building a bridge from the raw data to the end user that might use a Web Application to view it. It was never as simple to create a cross table, and thanks to the interactive components in RapidMiner's Web Apps, the end user can select what is counted himself, creating a self service description. In the same way, histograms and quartiles can be added, where necessary.
But cross tables, quantiles and histograms are not only useful to present to the user. Together with the tests, they very well can be used in many automated scenarios, for example during data import. Especially in areas where manual steps are involved in the data import, the investment into sanity checks BEFORE the actual import is well spent from our experience. All operators for cross table, quantiles and histograms deliver the results as standard data sets, so that you can base any subsequent check for similarity on them. For example, you could compute the quantiles of each attribute of a reference data set storing sensor data. Each day a scheduled process will import the newest data, but before doing so, a sanity check is performed by calculating the quantiles and comparing them with the reference data. If a sensor is broken and now delivers wrong data, these quantiles will heavily be distributed. The process could abort the import of defect data and instead inform someone of the problem via mail.
Another interesting application is to evaluate the results of predictive modeling projects. In many situations it is very hard to find the right performance measure, reflecting the real business value perfectly and taking the time axis into account. Nearly always you will need to go for a approximation, as the business value cannot be calculated exactly. (Especially not if it is the future business value, being another predictive analytics by it's own.) And if you set your predictive model active and act on it's recommendations, you will always have the problem, that this changes reality and your original training data is not representative for the problem anymore.
So in many situation you will head for a A-B testing scenario, where you treat customers differently to see, how in reality and ex-post, the model performed. Here the Survival and Hazard Curves related operators of the extension can become incredibly useful to describe the two groups over time.
|Users||1 Year Subscription||2 Year Subscription||Perpetual License||Maintenance|
|1 Named User||79 €||139 €||179 €||34 €|
|5 Named Users||329 €||569 €||749 €||149 €|
|Company License||750 €||1300 €||1700 €||340 €|
- Named User: A named user is simply an individual human being. That means if you have a license for one named user, this particular person is allowed to use the extension in RapidMiner Studio and Server across installations as long as only he has access to it. That means this person can create processes in Studio and put the processes on a Server to publish their results, but nobody without a license may modify the processes containing operators from this extension.
The person may not change over time, unless you ask for exchanging the license key and get the written approval from us. This will only happen with good reasons, for example if the person has left your company, etc...
- Company License: A company license allows everybody who is employed by your company to work with our extension. The extension may be installed on every RapidMiner Server within the company. That means that as a consultant company your consultants may use the extension during projects with your customer, but may not install the extension on the customer's server.
- Yearly Subscription: A subscription covers license and maintenance, but is only valid for a certain period of time. That means you need to renew your subscription if you want to continue the usage of this extension.
- Perpetual License: This is a license that is not limited in time and will allow you to use the program as long as you want. It includes maintenance for one year. If you don't have a maintenance contract anymore, you will not be able to use the latest product updates anymore and are stuck with the last version being published during your maintenance period.
- Maintenance: Maintenance covers as well support, as receiving updates. Subscriptions always include maintenance, will perpetual licenses need a separate maintenance contract after the first year.
For the last year, we have been using the Statistics Extension to incorporate many standard statistical procedures [...] into several of our highly automated processes within RapidMiner. This has led to significant process efficiencies by enabling easier automation and eliminating the need to move data between multiple platforms or utilize other tools such as R that require higher levels of technical knowledge.
Brian Tvenstrup - Chief Analytics Officer - Modern Marketing Concepts Inc.
From the perspective of a medical researcher, the Statistics Extension for RapidMiner, as provided by Old World Computing, is exactly what we were waiting for! This extension enables us to stick to one application in the entire data analysis cycle, including the preprocessing statistics requested by journals. There is an open minded interest of the developers to improve the extension by listening to their customers.
Dr. Sven Van Poucke - Anesthesiologist and Emergency Physician - Department of Anesthesiology, Critical Care, Emergency Medicine and Pain Therapy, Ziekenhuis Oost-Limburg Genk, Belgium
We've been very pleased with OldWorldComputing and how they have incorporated our customer feedback into subsequent relases of their tools.
Brian Tvenstrup - Chief Analytics Officer - Modern Marketing Concepts Inc.
1 RapidMiner is a registered trademark of RapidMiner Inc., 10 Fawcett Street, Cambridge MA 02138, United States of America