Data scientists used models, statistics and other tools to solve problems associated with big data. Data science methodologies are used by researchers in order to form a framework which will help apply models and tools in the most efficient manner required for the particular context. When there is a lack of use of methodology, random and false information or discoveries emerge and lead to incorrect outcomes.
Some of the often used data science methodologies are the CRISP-DM, KDD, and SEMMA. These are especially used for analytics and data mining projects.
Cross Industry Standard Process for Data Mining (CRISP-DM) is a data mining process model that is used by data miners to describe their commonly used approaches. It remains as the most widely used methodology for analytics, data mining as well as other data science operations and projects. It was conceived in 1996 and consists of 6 phases.
The first two phases involve an understanding of a business and the various kinds of data generated from the business. For instance, let’s assume an e-commerce company seeks out a data science consultant to improve their sales. The first thing that the consultant will do is to identify the various data generated by the operations of the company.
The next two phases deal with data preparation and modeling. This phase makes use of statistics and programming models. These tools organize the data gathered in the first two phases. After data is prepared by being categorized, modeling programs run different analyses on the data in various combinations.
The fifth phase deals with the evaluation of the results of the modeling analyses of the data. Here the results are analyzed if any pattern of significance emerges that can be of benefit to the company.
The sixth and last phase is involved in deploying the methodology and tools researched in the previous phases to address the company’s desire to improve their operations and profit.
But the CRISP-DM methodology has existed for a long period without innovating to address the rise of modern big data and the complexities associated with it. Despite this, it is still useful in data science, especially for business analytics models.
SEMMA (Sample, Explore, Modify, Model, Assess) is another methodology created by the SAS Institute in North Carolina to help implement data mining applications. It functions similar to CRISP by selecting data to be analyzed, explores the data selected, use tools to modify and model the data according to needs and assess the final results. But since SEMMA is designed to be used within particular software by SAS, other data science users might face difficulties in using SEMMA if they were to access it by other platforms and software.
KDD (Knowledge Discovery Databases) is another popular data mining process used to analyze large databases and mine data from them.
Together, the three above methodologies can be said to constitute about 60% of the methodologies that are used by data science users. New methodologies have yet to emerge in large numbers. Apart from the above three methodologies, users have developed methodologies that they use for each specific project. This has lead to a list of best practices instead of a streamlined standard methodology.
Two new methodologies that have emerged to address modern data science issues are the PFA (Portable Format for Analytics) and the Decision Model & Notation Standard.
The PFA methodology functions like a language that is able to describe analytical models interchangeably. It means that PFA aims to be independent of specific tools that are designed in such a way that they must be used along with only particular platforms. Different analytical and statistical models can be imported into PFA and exported as well. This allows greater independence for data science users to implement tools and models from different domains.
The Decision Model & Notation Standard (DMN) is a standard rather than a methodology that is promoted as enabling data science users across organizations to access decision models and implement them according to various needs. This removes some of the restrictions in data science methodologies like in SEMMA where models that are not present in the software used by SEMMA cannot be imported by users.