The presence of data preprocessing methods for data mining has been reviewed over the past few years with a lot of high volumes, velocity, and a variety of data that require a new high-performance processing. A large computational infrastructure in big data along with a challenging and time-demanding task is involved to ensure successful data processing and analysis. Approaches in big data comprise of definition, characteristics, and categorization of data preprocessing. There is a huge connection between big data and data preprocessing throughout all families of methods and big data technologies and everything will be examined including developments on different big data framework, such as Hadoop, Spark and Flink and the encouragement in devoting substantial research efforts in some families of data preprocessing methods and applications on new big data learning paradigms.
Raw Data
A large amount of raw data is surrounding us in our world, data that cannot be directly treated by humans or manual applications provided the current volume of data managed by our systems have surpassed the processing capacity of traditional systems. Technologies as the World Wide Web, engineering and science applications and networks, business services and much more generation of data in exponential growth as result of the development of power storage and connection tools. The burgeoning growth of new technologies and services like Cloud computing as well as the reduction in hardware price is leading to an ever-growing rate of information on the Internet. Organized knowledge and information cannot be easily obtained due to this huge data growth and neither it can be easily understood nor automatically extracted. These premises have led to the development of data science or data mining, a well-known discipline which is more and more present in the current world of the Information Age.
Impact of Emerging Technologies
The huge impact of emerging technologies obviously infers a big challenge for the data analytics community. Big Data can be thus defined as very high volume, velocity, and a variety of data that require a new high-performance processing. Before the advent of Big Data phenomenon, the distributed computing systems have been widely used by data scientists.
Therefore, with the intention of easing the learning process, a number of standard and time-consuming algorithms were replaced by their distributed versions. Though many issues have been incurred for large-scale processing in Big data, the new platforms try to bring closer the distributed technologies to the standard users such as engineers and data scientists by hiding the technical nuances derived from distributed environments.
The generalization of Complex designs is required to create and maintain these platforms that use distributed computing. On the other side, Big Data platforms also require additional algorithms that give support to relevant tasks, like big data preprocessing and analytics. Standard algorithms for those tasks must be also re-designed if we want to learn from large-scale datasets. It is not a trivial thing and presents a big challenge for researchers.
MapReduce
MapReduce was supposedly the first framework that enabled the processing of large-scale data sets. By using this revolutionary tool the generation of huge datasets done in an automatic and distributed way. Upon the implementation of two primitives, Map and Reduce, the user is able to use a scalable and distributed tool without worrying about technical nuances, such as failure recovery, data partitioning or job communication.
Current Scenario
The present scenario in big data preprocessing focuses on the size, variety, and velocity of data which is huge and continues to increase every day. Big Data frameworks can also be employed to store, process, and analyze data has changed the context of the knowledge discovery from data, especially the processes of data mining and data preprocessing. A healthy collaboration of Researchers, practitioners, and data scientists can be expected in future to guarantee the long-term success of big data preprocessing and to collectively explore new domains.