Big data is defined as Data that cannot be processed by conventional means and techniques because of its large size & complexity. Big in Big Data, it could mean petabytes or even exabytes of data. It is also estimated that nearly 90 % of data that we have today has been generated in last two years. This means; exponential growth in the volume of data is going to continue and would accelerate further with the advent of IoT(the Internet Of Things).
Big Data is mainly used for Data Analytics, which gives some meaningful insights into business operations of any business. It helps in business –
Infrastructure is the cornerstone of any Big Data architecture. However, before Data Scientists can analyze the data, it needs to be stored and processed. Gartner defined big data as consisting of 3 vs. Source: gartner.com
Many companies fail to realize full benefits of their Data Analyticsinitiatives due to lack of desired and scalable architecture. Success of Big Data infrastructure depends on its ability to
To handle high volumes, the data storage should be elastic and scalable. You must be able to add storage modules without causing any disruption in the operations. Cloud-based storage is a good idea for most businesses as it reduces upfront investments. There are no physical systems on site which mean saving in space & power consumption. It also takes care of data security burden. You need smart tools to enable virtualization (to quickly add capacity when required) and carry out data compression. You need object-based storage architecture to handle a large number of files.
The Big Data infrastructure needs to churn and deliver a large amount of data in real time at high speeds. That means latency (speed of response) needs to be controlled. To deliver on this promise, infrastructure should have the massive processing power and high-speed connectivity. This means there is a need for high IOPS (input /output operations) which can be delivered by server virtualization and use of flash memory.
Big Data infrastructure mandates support for comparison of disparate data sets. It is important to cross-reference different data sets from different platforms. Hadoop has become an important part of Big Data infrastructure plan. It is an open-source framework for storing and analyzing a large volume of data. The main advantage of Hadoop is its cost & time effectiveness. Firstly, it is free since it is an open-source and secondly, it can run on any cheap commodity hardware. It saves time because it processes many smaller data sets simultaneously. However the open source has its own drawbacks and therefore many companies are offering premium packages with better security & support.
Another important component of Big Data infrastructure is NoSQL (Not only SQL). Unlike its relational predecessor, it can work on dynamic and semi-structured data with higher speeds making it most ideal for Big Data environment.
Data security plays an important role in any Big Data Infrastructure plan. Data Analytics is no longer a prerogative of IT and Data Scientists alone. Data is increasingly accessed by many line managers for their own analysis. As more people access data, its security must address issues of data integrity and data protection (sensitive information).
In conclusion, we can say that failure or success of any Big Data initiative would depend upon right investment in appropriate technologies for data collection, data storage, data analysis, and data visualization/output.