Extracting two or more relational example sets from one JSON

Now that we know how to generate one example set from a JSON string, we will turn to the truly exciting feature of the WebAutomation Extension: extracting relational information by creating two or more relational tables from the same JSON string. Click here to open this tutorial process in RapidMiner.

We will continue the tutorial with the information about three books we also used in the first part. What we have extracted so far are title, subtitle, language and edition as well as publication date and publishing company – the properties of the books array. Taking another look at the JSON, however, we will find two more arrays nested inside this one: authors and keywords. Both of these can have multiple entries, see for example the three authors of the second book.

Reminder: Our JSON string

Simply adding them to our existing example set would be impractical, as books with more than one author would force us to decide if we either want to have only one row per book, but then we can only get the last entry of the nested arrays. That would mean that we would only store Jerome Friedman, ignoring Mr Hastie and Mr Tibshirani’s contribution to the data scientist’s bible “Elements of statistical learning”. The only alternative would be to have three rows for the book, but then we would have multiple copies of the master data with it’s title, subtitle and so on.

To avoid this, we will show how to create a second table show the author’s names and how to link that with the master data of the first table.

For this tutorial, we will focus on creating a second table to relate the authors set to our first example set, but of course it is also possible to create a third table for keywords – we will discuss how to deal with arrays of values in the next part of this tutorial. As a reminder, this is how our process should be looking right now:

Reminder: The basic process

As the tree view shows, the Process Object operator houses the Extract Properties and Commit Row operators necessary to create the example set from the master data. If you are unsure about how to get there, have a look at the first part of this tutorial.

As mentioned before, this extension works by mirroring the JSON structure. Therefore, we need to enter the Process Array books operator, seeing as the authors array is within the books array. This is the right place to configure the extraction of the authors. Again, this is what you should be looking at:

Reminder: Inside the Process Array operator

With the Extract Properties and Commit Row operators we previously generated the rows of our example set. In order to create a second table, relating the author’s names to the books, there are two things we need to do: extract the properties of the authors array and establish a connection appearing in both example sets. We will first go about extracting the properties of the second array.

Extracting a Second Array

To be able to access the authors array within the books array, simply add another Process Array operator to your process.

Adding a nested Process Array operator

Basically, this is simply a rerun of what we did before to create the first example set: enter the Process Array operator and add Extract Properties and Commit Row operators to form a second example set. The tree view makes the similarity even clearer:

Similar structures

In Extract Properties, enter first name and last name as the properties to be extracted and go back to the first Process Array level. Here, we still need to make our port connections: using Multiply, connect the incoming left par (parse specification) port with both the Extract Properties and the nested Process Array operators. Going out from Process Array, make the connection to the second, still unused, par outgoing port. Make sure you connect the output ports on all the higher levels and also between the Process Object and the Parse operator. As you can see in the second screenshot below, the Parse Operator has multiple par ports to receive multiple parse specifications. For each incoming specification, the operator will generate an individual example set.

 

Connecting the ports

Start the process and you should be getting two example sets. These are, however, not yet related: You will see one example set showing the author’s names and a second set with the properties we extracted in the first part of this tutorial, but from that, you will not be able to see which book was written by which author(s). What we need is an ID appearing in both sets.

Establishing a Connection between the Example Sets

To create an ID relating the two example sets to one another, go to the parameter settings of the first Process Array and select “create id attribute” and give it a name of your choice:

Create ID attribute

An ID with an auto incremented number will now be assigned to every object in the array and add it as an attribute in the resulting data sets:

Two relational ExampleSets

As these IDs correspond to the same object in both example sets, you can now compare the two data sets and see which author belongs to which book.

 Usually the JSON should already include an ID value. In this case, you will probably want to use that as the connection between the sets instead of the ID created by the extension. To that end, add another Extract Properties operator before the Multiply operator, and select your ID property as an attribute to be generated. If you had previously added ID as a property to be extracted in the first (now second) Extract Properties operator, make sure to remove it from there. Putting it in front of the Multiply will add it to both example sets, as it is now contained in the parse specification that is fed into the Process Array operator for the authors, as you can see here:

Using the ID included in the JSON

Summary

The JSON Parsing capabilities of the Web Automation extension enable you to adaptably extract the structured information of the JSON format and put it into handy tables that you can directly process with RapidMiner for profile building and further analysis. It does so with a single scan of the file resulting in the incredible speed improvement of factor 450 witnessed by our beta tester compared to the built-in JSON capabilities of RapidMiner.