According to Statista, the online statistical portal, big data refers to those large and complex data sets that are difficult to understand or process with traditional analysis tools and applications. The term big data is more about analyzing what one can do with the data rather than just seeing what is in the data. A data scientist or analyst, primarily will work their skills on these large, complex data sets to generate useful conclusions and inferences.
But before a data scientist or analyst can perform big data analysis, they must possess 3 skills which form the pillars of data science. They are statistics, domain knowledge, and computer science skills. Domain knowledge provides information to the researcher or analyst about what questions to ask with respect to the context. Computer science skills help in collecting, organizing, and preparing the required data for analysis. Statistics is the means for a researcher or analyst to question the data prepared in order to provide relevant answers. Thus without statistics, one may not know if the analysis done using computational applications are without bias or any other problems.
The American Statistical Association’s community feels that the role of statistician has been undervalued and sidelined by data science which they feel appears to lean towards an almost purely computationally derived analysis of big data. They feel that applied statistics tools and applications must be developed by the next generation of statisticians to adapt to the fast-paced digital big data age.
Since statistics is a discipline of science that emphasizes on learning from data, statistical thinking or analysis helps people understand the significance of gathering data, analyzing and interpreting the data, and finally reporting on the results. This importance can be said to be true due to the rise in the number of analytics jobs that require knowledge of statistics. The better people grasp about statistics and related application in data science better will be the value or use that they can obtain from the big data sets. The important thing here is that people analyzing big data need not have an in-depth understanding of statistical concepts, rather it is sufficient if they have a good grasp on its basics.
One popular example cited to highlight the importance and relevance of statistics to big data is that of the sampling error. We are almost always unable to access the entire population about which we want to analyze. We rely instead on a sample of the population to make predictive analyses about certain questions about the population’s data. Use of statistics concepts like sampling error helps in getting clarity on whether the results obtained in the sample population’s data is what you can expect in the whole population’s data too.
Another area in statistical analysis which is considered to be the most fundamental is the concept of confounding. This refers to spurious correlations which emerge during data analyses. This can occur due to the manner in which data was measured or about any potential false extrapolations from the sample size’s results. Whenever new results are obtained through computational applications, it is advised to test them with statistical analysis tools in order to check for confounding.
Another tempting but wrong use of applied statistics in big data is to become addicted to using a certain tool or application and using it for every problem trying to fit in the data and obtain results. This approach is fraught with danger because not all data sets can be reliably assessed and analyzed by a single application or tool.
Thus in conclusion, while statistics plays a pivotal role in big data analysis it alone is certainly not what big data is all about. Domain knowledge and computer science skills are important in equal measure to understand big data effectively.