RapidMiner is a golden hammer ...
... and our extension will turn it into a golden jackhammer! It provides several new operators and interface enhancements for experienced users in real world, possibly large-scale projects. It is created for scenarios which require an even higher grade of automation or automation of a scenario with more complex data structures and improves overall effectivity and performance.
For this, it contains operators that
- will ease project management and enable full automated testing and deployment scenarios in productive environments, especially with many servers,
- let high-load or low-latency scenarios benefit from the caching mechanisms now available for any part of a process,
- adds indexed collections to make processes more generic in situations where you have to perform the same task for thousands of units/users/machines,
- allow improved file handling even on remote locations over various protocols such as ftp, sftp, ftps ...,
- directly implement commonly needed functions that require combination of many core operators.
... now turn it into a golden jackhammer!
If you are an experienced user of RapidMiner, chances are you have already noticed RapidMiner being a golden hammer. Whatever kind of problem from the area of Data Science you are facing, probability is high that you can create a possible solution with RapidMiner from the top of your head. And the harsh project reality shows often enough that a golden hammer is needed nearly everywhere. The only alternative would be to use hundreds of different tools, making your tool belt so heavy that you will be limited to crouching through your project.
And sometimes you not only need a golden hammer, but many of them. Many standard projects nowadays involve an amount of data that does not yet justify the usage and overhead of real Big Data technologies, but still require some special treatment to be executed on commodity hardware. Our extension will turn your RapidMiner installation into an easy to use, golden Jackhammer, hammering away a lot of your daily grieves.
RapidMiner allows you to organize your projects into repositories that you can structure as you like. We at Old World Computing have developed a standardized best practice for repository design as we believe that avoiding unnecessary diversity helps focusing on the real problems. Of course passing a standardized project structure to a fellow colleague is also much easier ...
Another advantage of this standardized layout is that you can automate several important tasks:
- Automated storage of results without confusing different processes' results or entering path and macros every time
- Continuous integration tests for all processes to ensure compatibility across RapidMiner versions or infrastructure updates
- Deployment from a central development server to a star shaped staging infrastructure possibly with different servers per project
This extensions contains the necessary functions to do exactly this. And as the re-usability and control of processes is extremely important in large scale projects, the extension also contains specific operators to handle
- exceptional situations with better control than the built-in functions. It supports delayed re-tries if parts of an infrastructure may be temporarily unreachable.
- situations where you need to either stop a process entirely or just skip a specific loop iteration.
- debugging of library processes in case of errors in a comfortable way.
- undetermined execution order within processes without necessarily introducing an unnecessary sub process level.
Our caching function can be useful in many situations, from development to deployment. Let's imagine you have external resources that take long to load such as large databases being queried by your process. Especially during the modelling phase of a project, you will need to load the data very often even though it does not change at all. With the new operators, you can simply put a cache operator around the part of the process loading and transforming the data. The cache operator will store the results of the subprocess in memory and, if necessary, on a local disk. If the process is re-executed, the cache operator will check if the cache is still valid or if the generating process has been changed. In case it is still valid and not too old, the cached version is returned immediately and the process continues.
While this allows you to concentrate on the modelling by eliminating unnecessary waiting times, it becomes even more valuable in a deployment scenario. Let's assume that you are doing two queries from different databases for a process monitoring some machines. The one database contains the new signals being sent from the machines, while the other one contains stable information for this particular machine. You can avoid the additional process time for the second query to complete by putting it into a cache. The cache can be parameterized to be sensitive to process variables, so that it will store different results for different machines. If the process time is not important to you, the database load is probably important to your admin.
And of course in a web service deployment that can be called thousand of times per minute, you do want to eliminate any access on the repository and hence into the underlying database. If you have a scoring process with a large model like a k-NN or SVM, you save a lot of time and database load with placing one cache operator. As it is time restricted, any update of the model will be reflected after a limited time.
But if you are creating a web application with RapidMiner server, the caching becomes invaluable to improve responsiveness. For that particular use case, it is possible to access the cache filled in one process from also different processes while ensuring user rights are obeyed.
Collections in RapidMiner are very useful in situations where you repeat a particular task several times. In a monitoring scenario we might have hundreds of machines in dozens of factories, in a churn related application we might have thousands of customers, ...
It is a standard procedure to loop over the particular entities we are analyzing as it is usually unmanageable to create one process for each entity. Usually, you either store the results for each entity in RapidMiner's repository or, if you output data, you append the resulting collection. Both approaches have draw backs in very large settings with thousands of entities: If you store too many entries in the repository, it will become slow and management as deployment becomes a hell due to a large per entry overhead. If you combine data in one large table and you just want to access the data for a particular entity, you need to load and filter all, which is quite inefficient.
Our indexed collections solve both tasks: as in the standard collections, you can simply collect all objects in them, but you need to assign one or more keys to each one. These values can be used to later retrieve the data again. So if your entities have identifiers like factory and machine ID or customer number, you simply get the entry for these values. This way you only need to store one object, which is much more efficient. If loading is eliminated with a Cache operator, the access is instantaneous so that you do not need to filter again.
The extension also contains a special operator to loop over these collections with the keys being made available as process variable. This can be nicely used to report on model performance for every single entity.
Remote File Handling
RapidMiner already contains operators to handle files. However, in many situations data is distributed across multiple systems. Especially external data providers tend to make files available on servers using ftps or sftp. Our extension contains a set of operators allowing to access and manage various different remote file protocols.
Ease of Use
The key feature you will never want to give up again is our Shift Operators function. It is small but once you get used to it, you will never want to live without it: It allows to make space for a set of new operators within your process design, shifting the existing operators and notes according to our process design best practice.
Of course there are also multiple operators included making life easier in many situations. A very common pattern is to generate data randomly and then discard all generated data just to get an empty example set. We comprehended that in a specific operator. We did similar for the common misuse of Generate ID to create sequence of numbers.
But the most prominent example is the use of process variables in loops: Instead of looping over rows in an example set, then filtering the example set to the specific row and then extracting a set of process variables from it, you can either extract process variables from an example automatically or even make all the filtering and macro extraction automatic by applying our operators and best practices for naming.
If you are interested in testing our extension, please feel free to download it in the marketplace. To access full functions, please contact us for a demo key.
And of course we would be extremely happy if you liked the functions and wanted to order a license. Below you will find the prices:
|Users||1 Year Subscription||2 Year Subscription||Perpetual License|
|1 Named User||€ 249||€ 439||€ 569|
|5 Named Users||€ 1029||€ 1799||€ 2319|
|Company License||€ 2369||€ 4139||€ 5329|
- Named User: A named user is simply an individual human being. That means if you have a license for one named user, this particular person is allowed to use the extension in RapidMiner Studio and Server across installations as long as only they have access to it. That means this person can create processes in Studio and put the processes on a Server to publish their results, but nobody without a license may modify the processes containing operators from this extension.
The person may not change over time, unless you ask for exchanging the license key and get written approval from us. We will only do so for good reason, for example if the person has left your company or the like.
- Company License: A company license allows everybody who is employed by your company to work with our extension. The extension may be installed on every RapidMiner Server within the company. That means that as a consultant company your consultants may use the extension during projects with your customer, but may not install the extension on the customer's server.
- Yearly Subscription: A subscription covers license and maintenance, but is only valid for a certain period of time. That means you need to renew your subscription if you want to continue the usage of this extension.
- Perpetual License: This is a license that is not limited in time and will allow you to use the program as long as you want. It includes maintenance for one year. After that, if you do not have an additional maintenance contract, you will not be able to install the latest product updates any longer and will remain with the last version being published during your maintenance period.
- Maintenance: Maintenance covers as well support, as receiving updates. Subscriptions always include maintenance, while perpetual licenses need a separate maintenance contract after the first year.