Part 1: The Cache Operator
Introduction
Today, we would like to present a feature of the versatile Jackhammer Extension for RapidMiner and describe in more detail how to use the caching operators provided by the extension in order to speed up your processes and significantly reduce overhead.
The caching function is useful for all processes suffering from a long run time due to repeated data retrieval. This is not only a nuisance when designing and testing a process, constant requests also put more stress on a database than is necessary and affect other users and applications on the same database. But most importantly, caching can greatly improve response time for web services, lower resource utilization for high request volumes and reduce reaction times when deploying a web application with RapidMiner Server.
The Jackhammer Extension comes with more than sixty operators, three of which regard caching: Cache, Clear Cache and Retrieve Cache. For this first tutorial, we want to show how the Cache operator works and how to integrate it into your process. In later tutorials, we will talk about more advanced functions like data validity settings, user rights and how to use the Clear Cache operator and cover taking into account dependencies when caching.
Basically, the Cache operator offers a subprocess where you place the operators to load the relevant data. This subprocess is executed once and then keeps the output of the subprocess cached from there on. When you run the process again, it will return the cached data rather than reloading the same data over and over again. With the Jackhammer Extension, losing time over waiting for the process to load the data becomes a thing of the past asyou can put entire preprocessing chains into the subprocess.
It is also possible to put objects, even the training of prediction models or any other complex processes generating static results into the subprocess so as to cache their results and use them without having to run them each time.
For more details on the Jackhammer Extension for RapidMiner including a demo version and information on how to purchase a license, visit out product page.
Already know how to integrate the operator into your processes? Jump to the second tutorial about data validity settings and clearing the cache.
In order to be able to rebuild the tutorial, you will need to download this file containig the data which here is received from a database: Wind Turbine Data
The Scenario
Let’s say you are working for a large glassmaking company. Glassmaking has a high power consumption, which led your boss to the idea to put up the company’s own wind turbine to be able produce electricity inexpensively and environmentally friendly. Your boss is mighty proud of his purchase and would like all his employees to see how much power this new wind turbine produces. As the company’s chief data scientist you are given the task to make this info accessible on the internal company website. You start working on it and realize that the wind turbine sends new data only once a day into the database. If the data had to be loaded each time one of your colleagues checks it, this could put stress on the database and would slow down the service’s response time – for no reason as the data only updates once a day. With the Caching functions of our Jackhammer Extensions, the data will only be loaded for the first request, after which it is cached and does not need to be retrieved every time. In the following, we will demonstrate how to use the Cache operator.
Step 1
For this first step, we will take a look at the Cache operator itself and its parameters. Search for the Cache operator and add it to your process, then click on it to be able to see the parameter settings.
As you can see, you can set a Cache name (1) of your choosing (which will be relevant in our following tutorial covering the Clear Cache operator) or clear the cache manually (2). You can also make the tick to restrict validity (3) and enter cache dependencies (4) – again, functions which we will discuss in later tutorials. If you choose not enter a name now, RapidMiner will simply use the operator’s name, i.e. Cache. Names will become important for more advanced functions like the Clear Cache or the Retrieve Cache operators. In this example, we will use “wind turbine”, but again, for these steps it is not yet necessary to enter a name. Then double click on the operator to open the subprocess.
Step 2
On the subprocess level, add your database connection. To rebuild this process, download the file above and add a Read CSV operator instead of the Read Database operator. Use the wizard to load the file. Then continue as is described here.
Step 3
On this level, you can also add your data preprocessing steps. Their final result will be cached as well, meaning you will only have to run the preprocessing once and can save even more time. This is what the subprocess looks like with an added preprocessing step – of course you can use any and as many as you need! Do not forget to make all necessary connections between the operators and to the output ports in order to be able to receive your results.
When you run the process now, it will load the data and output an example set. Run it again and the results will be there virtually instantly! The extension’s process execution performance monitoring feature, which we demonstrated in our previous tutorial, supports the caching’s time saving effect with numbers:
This screenshot shows the run time in milliseconds for the first execution of the process. The following one illustrates the improvement due to caching the Read Database and Select Attributes operators:
As you can see, the tasks inside the Cache operator are not executed anymore, thus shaving off 344 ms off your run time. While this is of course only a small example process, you can surely imagine the time saving effect caching has on increasingly complex processes and larger databases!
Now you know how to use the Cache operator and can integrate it in your own processes. In the following tutorial we will pick up where we left off here to show how to set data validity periods and how to manually clear and reload the cache.