Part 2: Indexed Models
Two weeks ago we discussed Indexed Collections. Indexed Collections organize your data and facilitate your work by making data readily accessible for further analysis. For this next part of our Experts for Experts series we want to talk about another great feature of the Jackhammer Extension utilizing indexing: the Indexed Model. Indexed Models work similarly to Indexed Collections in that they automatically build an index that can increase work efficiency and result accuracy.
The Example Scenario
To explain in more detail, we will draw upon another example: power plants and their output. There are many different kinds of power plants, using different techniques and resources to generate power: older coal or even oil-fired plants, nuclear reactors, hydroelectric power stations using the potential energy of water, as well as solar or wind power plants. Their energy generation capacity differs from method to method: for example, while nuclear reactors can and must continuously generate power, solar panels will not work at night – although they can store energy, and thus can still feed into the power grid even then.
We will return to the kingdom of Predictia, where our king is also interested in optimizing energy generation in his country. He has provided you with data about the existing power plants and their energy output: there are a few nuclear power plants, several wind farms out at sea, and some solar fields in the sunnier regions down south. There are also two hydroelectric power stations at dams in the mountains and one tidal plant at the coast. Predictia, a forerunner in renewable energy, has long got rid of coal-fired power plants, but the data still shows a couple retired plants.
You create an ExampleSet, in which besides the power station’s name and yearly power output, you add attributes containing information like type of plant, i.e. coal, nuclear, wind, or solar. As you have still got the weather records data from your last job, you decide to add that too: info such as the number of sunny days in a year might be important for solar fields, wind data would give insights into the electricity generation by wind turbines.
A new solar field has been built by the biggest electricity supplier in the country, Royal Energy, and they would like to know how much power is estimated to be generated with this field in the next year. It has been running for a while, and they have sent you the data.
Now imagine you were to train the model you would like to use to predict the new field’s electricity generation using the data you were given before by the king, including that from other kinds of power stations, even the old retired coal-fired ones. While you would certainly receive a result, it would not be a very good prediction of how much power the new solar field will be generating: how could data from a nuclear reactor say anything about a solar field? How conclusive can records from coal-fired power stations be for it? Even for wind farms, where there is a connection between the two forms of energy due to the heat from the sun generating movement in the air, the predictions would be very unreliable.
So the logical next step is to construct separate models for the different types of power plants: one for nuclear reactors, one for solar fields, one for wind turbines, two for the two different kinds of hydroelectric stations. You could go further, increasing accuracy by making different models for the different geographical locations: data from solar fields regions with many hours of sunlight a day might not be best to predict energy production for a field in a region where it rains a lot. It would also make sense to subdivide models for different times of the year, as the days are shorter in winter. Certainly there are many more details that can be used to make the models more accurate and while this example is deliberately kept simple, you can surely imagine the vast number of models you would have to construct in some real-world scenarios if you wanted to have results that actually make sense. Additionally, if you were to do all this by hand, you would then get to the cumbersome task of having to select the correct model for the case in question, each time you wanted to apply it, also by hand.
Indexed Models solve all of this for you. You feed one big example set into the Indexed Model operator and specify one or several attributes you would like to group by (such as “power plant type” in our case, and possibly additionally “region” or “season”) in the parameters. These are the index attributes. Inside the subprocess you can build and train your model just like you would do with a normal model.
The operator constructs as many models as different values there are for the attribute or attributes you selected, i.e. the index attribute. So for our example, we will get one model each for wind farms, solar fields, coal-fired plants, nuclear reactors, the hydroelectric power stations at the dam, and the tidal ones. If, for example, you also selected “season” as index attribute, it will create a model each for solar fields in winter, in spring, summer, and fall, and also do this for all other types of power plants.
The important thing here is that it does construct many models, but you will only receive one indexed model. This makes this an incredibly powerful way to deal with cases where you have to predict one aspect for many different kinds of one thing. When you apply the indexed model, the relevant model will automatically be chosen for the data at hand – but you only have to deal with one indexed model, and are freed of the hassle to select the correct model from possible hundreds, besides the structure and tidiness of your processes.
We can now go and apply our Indexed Model to the data we received from Royal Energy to predict how much electricity the solar field will be generating next year. RapidMiner will automatically use the model for solar fields, and we can give back an accurate prediction. And because we have models for all other types of power stations, too, we are ready for the future, when the next prediction is to be made – with all the convenience of just one indexed model.