Data Mining – A Brief Description – Data mining is the process of analyzing large sets of data and discovering useful patterns, links, relationships, and correlations among the data. This extracted data is summarized and used for specialized tasks. It is also known as knowledge discovery in databases (KDD). Computational statistics, machine learning through programming languages and artificial intelligence algorithms forms the core principles of a data mining process. Various techniques and tools are used for performing a data mining process on a large unorganized data set. Verification of the results of a data mining process is crucial to avoid any unintended or even intended manipulation of data. Primary users of data mining are those who are involved in sectors that require a strong focus on customers like retail, financial, communication, FMCG, healthcare, and marketing organizations.
A Typical Data Mining Process
A data mining process will typically involve the following tasks, although it may vary with respect to the requirements of a data miner and his/her task.
Discovering anomalies in the data that indicate any special patterns in need of further investigation
Searching for relations among variables within the analyzed data
Clustering or grouping of data having similar attributes with each other
Classifying the clustered data according to certain generalized formats
Regression, which refers to finding a suitable algorithm that can model in such a manner that the desired data is obtained in future too
Summarizing the useful data discovered with visualizations and reports
Data mining techniques
There are certain standards used in the data mining field that is followed by data miners when applying any processes to a task or creating predictive models. Cross-Industry Standard Process for Data Mining (CRISP-DM) is a data mining standard that is used by a majority of data miners to create models that suit their needs. Another standard that is used to create algorithms and models for data mining is the Sample, Explore, Modify, Model, and Assess (SEMMA) standard.
When models are created by data miners that focus on predictive analyses of data, they usually tend to follow the Predictive Model Markup Language (PMML) standard. This standard is especially used by the business analytics sector. This field requires predictive analyses of high quantity and quality due to the large and complex nature of consumption data.
In terms of the common data mining algorithms used today, two major divisions can be envisioned. The first being the classical techniques and the second being next-generation techniques.
1) The classical techniques involve techniques used prior to the digital or computer age. These include statistics, data counting and probability, clustering, nearest neighbor methods and regression analyses for prediction, histograms for summarizing data.
Statistics involve traditional statistical methods of the theory that can analyze small sets of data to provide useful information to the user. It can be considered to be an archaic type of data mining since it began before computers existed and the work was done manually by statisticians. Data counting and probability also work with small data sets using mathematical theories like the Bayes theorem for instance. These traditional methods are being adapted into computational statistics. An example is the Bayesian classification used in data mining software to verify the results obtained after a data mining process
Prediction analysis was used extensively in the stock market. Clustering and linear regression were used for comparing variables in a data set and their relationships. These methods are becoming outdated due to their inability to work with dynamic and complex data sets generated in the digital era.
Histograms provide an easy reference for viewing the summarized results. Inability to factor in dynamic and large digital data limits its application today.
2) The next-generation techniques almost all include a component of computer programming in creating models or algorithms for data mining processes. They generally either discover new information within large databases or build predictive models. They are distinguished from the classical techniques since they have mostly been developed in the past two decades and they are the ones being talked about in the news media when the word data mining is mentioned. Decision trees and neural networks are the most often mentioned and used next generation techniques.
Decision trees are data mining techniques that predict models for extracting data. Here, as data is extracted a tree is created. Each branch of a tree classifies a part of the data through a classification question and each leaf on a branch is a sub-classification of the branch’s dataset. It segments the data into a simple and easy to use model. It is especially helpful for targeting and attracting customers. An example would be to create a decision tree that determines the level that a marketing offer must reach so that a customer is attracted and chooses to buy a product.
Neural networks are the other technique that has gained fame and popularity among data miners, especially those employed in the business analysis sector. Neural network function like a biological brain. They are actually artificial neural networks in the form of computer programs. These programs are sophisticated and enable the model/algorithm to learn from previous experience. These are also termed as machine learning algorithms. Despite them being powerful predictive modeling techniques for businesses, they have not yet been deployed on a large scale due to their expensive nature and complexity in applying them correctly to a data set. Two strategies have been explored with some success to overcome these shortcomings. One is to form a neural network model for a well-defined problem and the other is by including expert consultants with domain knowledge when creating the neural network.