Statistics

RapidMiner is one of the best tools for predictive modelling and the automation of data processing. The versatile platform covers all phases of a typical data science project: It allows to gather data from various sources, to process it and to store it in other target systems. The ability to derive predictive models and apply them during the data processing enables you to automatically make decisions and act on them; potentially without any human interaction. But if manual interaction is necessary for trust building or regulatory reasons, the built in Web Application framework allows to create specialized web applications for end users, perfectly hiding the necessary complexity of predictive analytics. In many scenarios, this allows to reduce the manual work load, improves the reaction time and reduces error rates.

However, this modern and sophisticated approach often needs to be teamed up with a more common and well-established approach. If users need to be convinced of a new solution, especially if it embraces a completely new way of thinking, it should not abandon known principles and experiences of said users, but happily incorporate them, use them and complement them. In many situations, basic statistical functionality like tests for distributions and other methods of classical statistics have been used to accomplish goals that nowadays can be accomplished much more easily with predictive analytics. RapidMiner by itself does however not cover both, requiring a complex combination of tools for a soft migration. Fortunately, we have now provided a remedy with our statistics extension.

FEATURES

Descriptive Statistics

  • Extract Cross Table
  • Discretize by Quantiles
  • Extract Quantiles
  • Extract Histogram
  • Correlation Matrix (Pearson)
  • Correlation Matrix (Kendalls tau-b)
  • Correlation Matrix (Spearman)
  • Covariance Matrix
  • Extract Odds Ratios
  • Extract Risk Ratios
  • Extract and Visualize Survival Curves (Kaplan-Meier, Flemming-Harrington)
  • Extract and Visualize Hazard Curves (Nelson-Aalen)

Tests

  • T-Test (against expectation)
  • T-Test
  • Mann Whitney U-Test
  • Wilcoxon Signed Rank Test
  • One Way ANOVA Test
  • One Way ANOVA Test (Grouping)
  • G-Test
  • Chi Square Test
  • Kolmogorov Smirnov Test

Tools

  • Matrix to ExampleSet
  • Split Data (by groups)

The Extension

The Statistics Extension for RapidMiner provides a full new set of operators covering basic statistical functionality. The operators can be neatly integrated with any RapidMiner process, so that you can use tests and correlations as well as quantile, histogram and cross tables in WebApps, Scheduled Processes and any other context without having to integrate external programs that increase the maintenance effort and make processes far too complex.

The full feature list is available on the right. In general, it covers all steps from the data preprocessing for typical statistical algorithms like e.g. grouping the data to the algorithms and operators to transform their results into something that can be incorporated into reports and WebApps.

Of course all operators can be used together with standard operators and are using standard data tables as in- and output. This allows easy integration in existing processes and several operators like the Split Data (by groups) come in handy even if you don't want to perform tests or the similar tasks.

Areas of Application

As you can see above, the operators of this extension are roughly grouped into three categories: One set of operators is for preparing data for the other operators or further processing, a large set of operators is for describing a data set and another one is for testing hypotheses on data.

The descriptive operators can be used to easily create a description of a certain data set, building a bridge from the raw data to the end user that might use a Web Application to view it. It has never been this simple to create a cross table, and thanks to the interactive components in RapidMiner's Web Apps, the end user can select what is counted himself, creating a self service description. In the same way, histograms and quartiles can be added where necessary.

But cross tables, quantiles and histograms are not only useful to present to the user. Together with the tests, they can be used very well in many automated scenarios, for example during data import. Especially in areas where manual steps are involved in the data import, the investment into sanity checks BEFORE the actual import is from our experience well spent. All operators for cross table, quantiles and histograms deliver the results as standard data sets, so that you can base any subsequent check for similarity on them. For example, you could compute the quantiles of each attribute of a reference data set storing sensor data. Each day a scheduled process will import the newest data, but before doing so, a sanity check is performed by calculating the quantiles and comparing them with the reference data. If a sensor is broken and now delivers wrong data, these quantiles will heavily be distributed. The process could abort the import of flawed data and instead inform someone of the problem via email.

Another interesting application is to evaluate the results of predictive modelling projects. In many situations, it is very hard to find the right performance measure, reflecting the real business value perfectly and taking the time axis into account. You will nearly always need to go for an approximation, as the business value cannot be calculated exactly. (Especially not if it is the future business value, being another predictive analytics by its own.) And if you set your predictive model active and act on its recommendations, you will always have the problem that this changes reality and your original training data is not representative for the problem anymore.

So in many situation you will head for an A-B testing scenario where you treat customers differently to see how in reality and ex-post the model performed. Here the Survival and Hazard Curves related operators of the extension can become incredibly useful to describe the two groups over time.

Pricing

Users1-year subscription2-year subscriptionPerpetual LicenseMaintenance
1 named user79.00€ *139.00€ *179.00€ *36.00€ *
5 named users329.00€ *569.00€ *749.00€ *150.00€ *
Company license750.00€ *1,300.00€ *1,700.00€ *340.00€ *

* plus VAT