Indexing Objects and Models with the Jackhammer Extension

Part 1: Indexed Collections

Today, as part of our new Experts for Experts Series, we would like to present one of the many useful features of the Jackhammer Extension, bringing more convenience into handling collections with RapidMiner by indexing them. First, we will discuss Indexed Collections, and next talk about Indexed Models.

The idea behind Indexed Collections is simple, yet powerful: building on the existing object collections functionality of RapidMiner, the Jackhammer Extension enables you to add group information to the objects, thereby indexing them. This forms a clear structure for your results, making information readily accessible without having to start a cumbersome search through folders until you find what you were looking for.

on the left: normal Collection; on the right: Indexed Collection

As you can see above, the results are now available in a much more ordered and structured fashion: instead of having to click your way through many folders, all with the same name, you can precisely access the correct folder in just one step. This does not only improve the speed with which you find information, it also makes further processing or modeling steps more efficient and more precise.

We will illustrate this feature with an example: The king of Predictia is interested in knowing beforehand how much rain will fall in the upcoming season, for crop planning purposes. Charged with the task of making predictions, you have obtained centuries worth of reports of precipitation quantities from every corner of the kingdom (Predictia started recording the weather much earlier than other countries) and entered them into RapidMiner in order to later on construct predictive models upon them. Right now, however, you are drowning in the flood of information, all without any kind of information where or when a certain value was measured, and it is dawning on you that this data, as plentiful as it may be, will not actually be very useful if you do not add this information.

Without Indexed Collections, you can go two ways: either you add attributes like Location and Month to the data entries, or you use the normal (i.e. not indexed) Collections to group the information. Both options, however, have disadvantages: when adding the supplement information as attributes, you will always have to run through the entire ExampleSet to find information regarding a certain place or time. Furthermore, in later analytical steps, you will have to apply extensive filters to make sense of the data. While with collections, you could group your entries, you have no way of knowing which folder is which: “Folder 1, Folder 2, Folder 3” does not reveal much about the content. You will have to click through all of them and always check back with your original data if you would like to use them for your later analysis.

Indexed Collections surpass both of these approaches. Your data is neatly organized into folders for easy access, providing an overview and structure. Because of their clear designations, you can efficiently find and use relevant information for your analysis. What’s more, you now have direct access to the data for your later analyses without having to filter out unwanted items or loop over the whole collection: with the operator Select by Key, you can easily and conveniently access exactly what you need.

Coming back to our example, we could for example make collections by months, thus being able to access only a specific time period, sort by regions or both! Indexed Collections can be nested, meaning we can collect data by month and city. Of course, what actually makes sense depends heavily on the data and what you want to do with it. In our case, this solution seems sensible: rain, or precipitation in general, might fall more in one part of the country than another, and in the winter months there might be more of it than in, say, August. By sorting you can make sure your later predictions are based on the relevant data: using July data for fall forecasts might lead to a prediction that there will be almost no rain all October, but this probably won’t hold true. Put them in Indexed Collections to make sure you use only what’s relevant.

In two weeks, we will continue with our example and find out how to make our predictions about Predictia’s precipitation more accurate using Indexed Models.

 

Operators:

Combine Indexed Objects: copies the IOObjects of each input with their respective group information into a single indexedIOObjectsContainer. 

Extend Indexed Objects: extends the provided indexedIOObjectsContainer by the provided IOObject and group information. 

Select by Key: retrieves the IOObject that was assigned to the provided group information.