Extracting Arrays of Scalar Values with the WebAutomation Extension

In the previous posts, we first discussed the basic functions of the WebAutomation Extension and then demonstrated how to extract not only one, but multiple, relational, example sets from just one JSON string. As mentioned there, we have one more feature to show: extracting arrays of scalar values. Click here to open this tutorial process in RapidMiner.

As we will continue with our example data from before, let’s first have another look at the JSON:

Reminder: The JSON string

So far, we have discussed extracting the properties of the books array – title, subtitle, language and so on. We also covered how to extract the information of the nested authors array. As you can see above, both the books and the authors array, however, are arrays of objects. Having a closer look at the JSON, you will see that there is one array left which we have not yet processed: keywords. You will also see that keywords – as opposed to authors – is an array of single string values and not of nested objects. In the following, we will demonstrate how to extract the information into a third table.

First, here is a reminder of how the inside of the Process Array operator should be looking right now: as we have discussed before, the structure of the process mirrors the original JSON structure. Therefore, we will continue to work on the level of the books array.

Reminder: The Process Object sub-process

 

We will now add another Process Array operator, connecting it to Multiply and the third Parse Specification port on the right – remember to also make the new connections on all higher levels and between the Process Object and Parse operator in order to receive your example set.

The new Process Array operator

Click on the new operator to edit its parameters: set “keywords” as property name and for array type, select “scalar values”:

Process Array parameters

Going into the operator, we will build a similar sub-process to the ones we are using to extract the authors and the other properties. The only difference is that instead of the Extract Properties operator, we will now use the Extract Scalar operator provided by the WebAutomation extension. Enter an attribute name – Keywords – and select the correct attribute type, in this case polynominal. Do not forget to add a Commit Row operator to the sub-process to express that every entry should be represented by a row:

Process Array sub-process

Running the process, you should now get three individual example sets: one showing the properties of the books array, one with the authors’ names, and a third one with keywords assigned to the books. The keywords array process is nested within the Process Object operator, which, as you might remember from the previous tutorials, we have set to assign an ID to each JSON object. Thus, the new third example set will also include an ID corresponding to the other example sets, making relational conclusions possible. (If your data already includes an ID, go back to our second tutorial to read up on how to use it as the connecting element).

The corresponding example sets

Summary

This concludes our tutorials for JSON parsing with the new WebAutomation Extension. You should now be able to use this powerful tool to your advantage, increasing efficiency greatly. For further help with the extensions you can also check the tutorials found in the help tab in RapidMiner Studio when selecting one of the extension’s operators. Also be sure to have a look at the other useful functions, such as the JSON request operators, fetching the data directly from a web service.