Collaborative CRISP extends the established Cross Industry Standard Process for Data Mining (CRISP-DM), taking into account the whole process of introducing data science as a strategic ability into an organization.
CRISP-DM entirely takes a data scientist's perspective on a project to be implemented. However, when introducing data science and especially machine learning into an organization, it is necessary to take a step back and look at the bigger picture: corporate structure and social aspects are just as important as the purely technical side of things. Hence, Collaborative CRISP expands the Data Science Loop, adding areas for project management and social integration.
Select an individual phase by clicking on it to get more detailed information.
Project Management – Use Case Identification
Identification and evaluation of use cases play a key factor in the successful implementation of data science in your business. The first challenge here is to spread the Predictive Mindset among the employees in order to enable them to see the new possibilities that the implementation of data science will bring. The next step is to select the most promising projects, i.e. those that have the best cost-benefit assessment, simple infrastructure, and are most likely to succeed.
During implementation, several use cases should be considered, as it is difficult to draw conclusions and make decisions about the technology’s applicability from just one example. Using several, it is possible to learn from failures and experience first-hand successes as positive reinforcement, generating a learning curve.
Project Management – Data Providing
Once a use case has been determined, the collaborative process between data scientist and professional experts begins. They will select data sets that seem most useful in finding a solution. This generates an overall picture showing what possibly relevant information is available. The data sets are assessed in regards to presumed relevance, accessibility and complexity of analysis. It is the management’s task to ensure that other departments can and do help by providing semantics and access to the data.
Project Management – Infrastructure Identification
Depending on the size and nature of the problem and the identified data, the infrastructure is chosen to support an efficient analysis while minimizing the necessary infrastructural effort. This includes the infrastructure for storing the data as well as for processing it.
For complex problems where the quality of the analysis is unpredictable, it is possible to divide the analysis into several phases, each using different infrastructures. The phases are divided in a way that will produce the best possible results on least cost and effort, and that enables easy data migration to the next, more complex infrastructure.
Data Science Loop – Business Understanding
During this initial phase of the data science part of the project, the data scientist focuses on understanding the project objectives and requirements from a business perspective. What is the economic challenge and what should the solution take into account? What is required to be able to integrate the solution into business processes?
From these questions, the data scientist forms a first outline on how to reach the project’s goals. During this phase, the essential quality criteria for a successful project outcome as well as the deployment prospects of the project are determined in coordination with representatives of the management.
Data Science Loop – Data Understanding
In this phase, the data scientist gets acquainted with the data provided and, if necessary, uncover data quality problems. At the same time, the data scientist can test his understanding of the business’ problems, assessing them with the received data. This interpretation of the data can often lead to an additional improvement in the understanding of the processes involved, on part of the data scientist and often of other stakeholders involved.
Data Science Loop – Data Preparation
After understanding the data, the data scientist goes through the processing steps necessary to generate suitable situational profiles from the raw data that can be used to train predictive models. The data preparation is adapted according to the applied model, meaning it is carried out repeatedly. It comprises data and feature selection as well as transformation, aggregation and editing data.
Data Science Loop – Modeling
In this phase, various modeling techniques are applied and tested and their parameters are calibrated to optimal values. Some techniques have specific requirements on the form of data, therefore, stepping back to the data preparation phase is often necessary.
Data Science Loop – Evaluation
At this stage in the project, a model that appears to have high quality has been found. Before proceeding to the final deployment of the model, it is important to more thoroughly evaluate the model and its construction and to test models on independent data not used during training, in order to ensure it will achieve the expected quality and is able to indeed solve the initial business problems.
Data Science Loop – Deployment
The deployment of a project will integrate the project results into the business operations. Without a deployment, the project will not generate any use and must be deemed a failure. Deployment can be as simple as a regular scoring, web service deployment or a more complex big data deployment. This also covers possible end user interfaces for controlling the underlying data science algorithms.
Social Integration – Gain Acceptance
A solution is helpful only when put into practice. As employing data science technologies requires the fundamental change in thinking described with the Predictive Mindset, they are often rejected. In this phase, end users are enabled through training to actively participate in the project, bring forth their own ideas, and learn to trust the technologies, in short establishing the Predictive Mindset. Most importantly, it covers the practical integration into the end users’ work processes. In order to avoid rejection it is essential to leave the end user in control.
Social Integration – Monitor Acceptance
The deployment does not end the data science project. On the one hand, the predictions can always be improved in many aspects, as is shown in the data science loop, on the other hand, the end users gain more and more experience with the technology, enabling them to refine the original Uses Cases. Often this leads to adjusting the specifications for the integration into work processes as well as for the quality criteria for the predictions. Therefore, the end users will still need a data scientist’s guidance and support while the data scientist needs the domain expertise to provide new use cases.