RapidMiner is a golden hammer ...
... but our extension will turn it into a golden jackhammer! It provides several new operators and interface enhancements, facilitating the handling of complex processes in large-scale projects. The extension was designed for scenarios that require a high grade of automation with very complex data structures. It improves overall efficiency and performance.
To this end, it contains operators that
- will ease project management and enable fully automated testing and deployment scenarios in productive environments with many servers,
- let high-load or low-latency scenarios benefit from the caching mechanisms now available for any part of a process,
- adds indexed collections to make processes more generic in situations where you have to perform the same task for thousands of units/users/machines,
- allow improved file handling even on remote locations over various protocols such as ftp, sftp, ftps ...,
- directly implement commonly needed functions that require combination of many core operators.
... now turn it into a golden jackhammer!
If you are an experienced user of RapidMiner, chances are you have already noticed RapidMiner is a golden hammer. Whatever kind of data science problem you are facing, it's likely there is a way to solve it with RapidMiner. Often enough, the harsh reality in projects shows that a golden hammer is needed nearly everywhere. The only alternative would be to use hundreds of different tools, making your tool belt so heavy you will be limited to crouching through your project.
For especially large construction sites, many hammers are required – or just one golden jackhammer! Many projects nowadays involve an amount of data that does not yet justify the usage and overhead of real big bata technologies, but still require some special treatment to be executed on commodity hardware. Our extension will turn your RapidMiner installation into an easy to use, golden jackhammer, hammering away a lot of your daily grieves.
Free Trial Period
Read below an extensive description of features and a list of operators contained in the Jackhammer Extension.
The operators added with version 2.3 are described in further detail here.
Our caching function can be useful in many situations, from development to deployment. Let's imagine you have external resources that take long to load such as large databases being queried by your process. If the server of that database is simultaneously used in other projects, it can take rather long to load the data. Especially during the modelling phase of a project, you will need to load the data very often even though it does not change at all. With the new cache operators, you can simply put the data loading and preprocessing steps into a subprocess. The cache operator will store the results of the subprocess in memory and, if necessary, on a local disk. If the process is re-executed, the cache operator will check whether the cache is still valid or the generating process has been changed. In case changes have been made or a freely selectable time period has expired, the subprocess will be executed again. Otherwise, the cached version is returned immediately and the process can continue without time-consuming computations.
Caching subprocess with slow database query, filtering with external data and preprocessing. Runtime of 3 minutes and 42 seconds reduced to 14 seconds.
While this allows you to concentrate on the modelling by eliminating unnecessary waiting times, caching becomes even more valuable in a deployment scenario. Let's assume that you are carrying out two queries from different databases for a process monitoring machines. The one database contains the new signals being sent from the machines, while the other one contains stable information for this particular machine. You can avoid the additional process time for the second query to complete by putting it into a cache. The cache can be parameterized to be sensitive to process variables (macros), so that it will store different results for different machines. Even if the process time is not important to you personally, the database load is probably still important to your admin.
Please also have a look at our blog for a series of tutorial regarding the use of the caching operators.
The Jackhammer Extension contains several additional operators for flexible loops. In contrast to the loops contained in RapidMiner, these do not suffer from compatibility problems to older versions. They do not only share the same design principles, i.e. have the same ports, but also come with all necessary ports to avoid unnecessarily complex process design. The flexible loops can be configured to skip particular iterations, or to end the process early on if a certain conditions has been fulfilled. All loops support parallel execution, processing especially large data sets quickly. Additionally, batch-processing allows local memory processing of larger quantities of data without having to fall back on external computational capacities.
All our loop operators have external input ports. Everything that arrives here is channeled to the inner input ports and available as a copy in every iteration. If an object is to be worked upon repeatedly in one loop, it can be forwarded to one of the new loop output ports. The object is then available at a loop input port for further processing in the next iteration. As usually an initialization is necessary, for example with an empty data set, the loop input ports will use the objects connected to the outer input ports in the first execution. The last results of the inner loop output ports are forwarded the the outside. The results of the inner output ports on the other hand are collected and and are made available outside as a collection.
Inner and outer ports of one of the loop operators.
Further, some loops have additional ports for specific purposes. For instance, the Loop Files (Advanced) Operator can receive a data set with file paths. The operator will then iterate only over the file paths, instead of searching a path completely or with filters. All advanced operator versions contain the basic functions of RapidMiner and add a few features, as described above:
- Loop (Advanced): equivalent to the core operator, but with improved ports.
- Loop Repository (Advanced): equivalent to Core Loop Repository, but enables access to the processes of the repositories (and thus, enables meta programming).
- Loop Files (Advanced): equivalent to Core Loop Files, but allows for definition of concrete file paths in an attribute (see above).
- Loop Batches: processes a given data set as batches of a specified size. Allows full parallelization of data preprocessing steps or successive execution of these on parts of the data, if the amount of data processed simultaneously is limited by memory capacity.
- Loop Groups: . processes a data set group-wise. A group can be defined by one or more attributes. The values identifying a group can be automatically extracted as a process variable.
- Loop Index: processes every entry of an Indexed Collection of objects (see below). The respective current index is provided as a process variable.
- Loop Remote Files: equivalent to Loop Files (Advanced), , but uses the network protocols FTP, SFTP, or FTPS to access the data.
- Read CSV (Batchwise): iteratively reads a specified number of rows from a CSV file and then continues with the following rows. This allows the processing of CSV files irrespective of their size, e.g. when importing a database. Parallel execution is available here as well, if preprocessing is time-consuming.
RapidMiner Collections are very useful, if a certain task is to be carried out repeatedly. With a control system for example, we monitor hundreds of machines in dozens of factories, in one churn application we are looking at thousands of customers etc.
It is a standard procedure to loop over the particular entities we are analyzing as it is usually unmanageable to create one process for each entity. Usually, you either store the results for each entity in RapidMiner's repository or, if you output data, you append the resulting collection. Both approaches have draw backs in very large settings with thousands of entities: If you store too many entries in the repository, it will become slow and management as deployment becomes a hell due to a large per entry overhead. If you combine data in one large table and you just want to access the data for a particular entity, you need to load and filter all, which is quite inefficient.
Indexed Model for different items in different stores. Included in the Indexed Model is a model wiith preprocessing and linear regression for every combination of store and item.
Our indexed collections solve both problems: they allow collection of data combined with fast availability through the allocation of indices. As in the standard collections, you can simply collect all objects in them, storing only one object in the repository, and thus making it manageable and fast. By assigning indices for the stored data, they are quickly retrievable and do not have to be loaded completed and then filtered. If the units to be analyzed have identifiers such as factory and machine ID or customer number, you simply get the entry for these values. If loading is eliminated with a Cache operator, access is virtually instantaneous.
The extension also contains a special operator to loop over these collections with the keys being made available as process variable. This can be nicely used to report on model performance for every single entity.
Additionally, the extension contains an operator making it possible to iterate over the indexed collections. Here, the respective index is made available as a process variable, making it possible to report on the performance of a model for the entry in question. Furthermore, there are operators to access single objects via an index, which can be useful for the creation of web services or web app deployment.
RapidMiner allows you to organize your projects into repositories that you can structure as you like. We at Old World Computing have developed a standardized best practice for repository design as we believe that avoiding unnecessary variability helps focusing on the real problems. Of course, collaboration with colleagues is also much easier when there is a uniform project structure.
Another advantage of this standardized layout is that you can automate several important tasks:
- Automated storage of results – results can be stored in subfolders named after the processes without having to enter the path every time.
- Continuous integration tests for all processes to ensure compatibility across RapidMiner versions or infrastructure updates
- Deployment from a central development server to a star-shaped staging infrastructure possibly with different servers per project.
This extensions contains the necessary functions to do exactly these tasks. Especially in large-scale projects, in which a whole team of data scientist is working together, the ensured re-usability as well as their monitoring is paramount. Therefore, the extension also contains specific operators
- providing extended functionalities for handling exceptions. If parts of the infrastructure become temporarily unreachable, the process can be continued at a later point instead of being cancelled.
- that end a process without throwing an error or enable skipping of a specific iteration of a loop.
- for debugging of library processes in case of errors in a comfortable way.
- for the definition of a distinct order of execution of processes without introducing an unnecessary subprocess level.
Remote File Handling
RapidMiner already contains operators to handle files. However, in many situations data is distributed across multiple systems. Especially external data providers tend to make files available on servers using ftps or sftp. Our extension contains a set of operators allowing to access and manage various different remote file protocols.
Once you have used this features, you'll never want to go without it again: our functionality for moving operators. In order to make space for one or more operators added later on, it moves all operators and notes without changing the process design.
Before using Shift Operators.
After using Shift Operators.
Of course there are even more operators included in the extension, all of which make life easier in many situations. For instance, in order to generate an empty data sets, it had been necessary up to now to first generate random data, only to then delete these. With Generate Empty Data, we have created an operator for this specific purpose. Generate Sequence renders the misuse of Generate ID to this end unnecessary. Below is a list of all included operators. If you have further questions regarding the extension and its functionalities, please contact us.
List of Operators
- Extract Macro (Advanced)
- Extract Macros from Example
- Extract Macro from Collection
- Extract Macro from Performance
- Execute Process
- Stop Process (Graciously)
- Handle Exception (Advanced)
- Determine Order
- Retrieve Cache
- Clear Cache
- Combine Indexed Objects
- Extend Indexed Object
- Select by Key
- Loop (Advanced)
- Loop Batches (Advanced)
- Loop Groups (Advanced)
- Loop Index
- Loop Remote Files
- Loop Files (Advanced)
- Loop Repository (Advanced)
- Loop Control – Skip Iteration
- Loop Control – Break Iteration
- Time – Loop Time Windows
- Time – Loop Time Window Intervals
- Cross Validation (Advanced)
- Split Validation (Advanced)
- Sliding Window Validation (Advanced)
- Assert Equality
- Validity – Declare Valid Values
- Read Matlab
- Store Result
- Read Object
- Open Process
- Open Remote File
- Write Remote File
- Delete Remote File
- Move Remote File
- Loops – Read CSV (Batchwise)
- Compression – Read GZIP File
- Compression – Write GZIP File
- Cryptography – Encrypt File
- Cryptography – Decrypt File
- Cryptography – Extract Hash
- Generate Sequence
- Generate Sequence Data
- Generate Description Data
- Generate Group Sequences
- Generate Group Indices
- Generate Empty Data
- Generate Data from Macros
- Generate Data from Expressions
- Indexed Model
- Combine Indexed Models
- Extend Indexed Model
- Discretize by Specification Data
- Sort (Advanced)
- Union (Advanced)
- Set Minus (Advanced)
- Intersect (Advanced)
- Lag (Advanced)
- Resample Series
- Resample Multiple Series
- Aggregate Windows
- Aggregate Time Windows
- Define Windows
- Extract Tree from Repository
- Tree to Data (relational)
- Data to Tree (relational)
- Data to Tree (wide)
- Tree to Data (wide)
- Select Descendants
- Select Ancestors
- Rename (Advanced)
- Nominal to Date (advanced)
- Nominal to Polynominal
- Make Names Great Again
- Create Features
- Generation – Generate Concatenation (Advanced)
- Generation – Generate Hash
|No. of users||1-year subscription||2-year subscriptopm||Perpetual license||Maintenance|
|1 named user||249.00€ *||439.00€ *||569.00€ *||114.00€ *|
|5 named users||1,029.00€ *||1,799.00€ *||2,319.00€ *||464.00€ *|
|FCompany license||2,369.00€ *||4,139.00€ *||5,329.00€ *||1,066.00€ *|
* excl. VAT
Questions, feedback, or ideas regarding our extensions?
Please contact us.