What is Big Data?
With the invention of the internet, the world had suddenly propelled itself into a future of the collection, processing, and storage of an incredible amount of user-generated data every day. Industries, businesses, and governments rely heavily on this data to function at full efficiency, or in some cases, to function at all. This essay will explore what is Big Data? how it is relevant to us? and what are its components?
Everything from the items in your shopping cart to keywords upon keywords of search terms to latitudes and longitudes of known geographical locations is stored in the databases for the internet to function as we, in the modern age, have grown so accustomed to knowing. So what happens to these terabytes upon terabytes of data? How does it affect the end-user? How does it get processed once it does reach its destination? What do companies even do with such data? These are questions that knock on the door of a very specific technology, not new to humankind, but of immense significance – Big Data.
Structured, Semi-Structured and Unstructured Data
Sparing the highly technical terms of defining a complex topic such as big data, let us instead think of it as a process of computerizing the task of making sense of all that incredible amount of data being generated and collected. An organization can then use the transformed data to predict outcomes of actions the organization might be deliberating. Machine Learning algorithms can use the data to achieve said outcomes.
The act of collecting data can also be called data mining. There are three types of said data – Structured, Semi-structured, and Unstructured. Engineers have developed a sizeable amount of fancy math and computer engineering to manage the collected information efficiently. All over the world, big data has become increasingly common in data management architectures as the growth of generated data maintains its ever-increasing nature.
We characterize Big Data according to three Vs, namely- Volume, Variety, and Velocity. We shall glance over each as briefly as possible.
Volume is nothing but the quantity of generated and stored data. The more data a prediction model has at its disposal, usually the more accurate its prediction. However, the size of the data determines its value in terms of the amount of insight it offers into a particular subject matter.
Variety refers to the type and nature of the data, such as if said data is text, or an image, etc.
Velocity is the speed at which generation of data occurs at the source (generally the end-user), processed, and stored. With modern demands, data often needs to be handled as it gets generated, i.e., in real-time, which is far more complicated than it sounds. It is due to the sheer volume of data that needs processing per second, not to mention, from multiple users simultaneously, certainly making for an extremely challenging task, even for a very powerful computer. However, despite the challenges, real-time data processing for Big Data is very much a thing today.
However, these three fundamental characteristics are not the complete story as a few more features have been added with the progress of this technology, namely Veracity, Exhaustive, Fine-Grained and Uniquely Lexical, Relational, Extentional, Scalability, Value, and Variability. However, not serving an immediate purpose for a basic understanding of Big Data, the mentioned terms are outside this article’s scope.
At this juncture, you might already be getting a faint idea of “What is Big Data” and inevitable challenges that might hurt its effectiveness, as is with any new or radical technology. So let us discuss these challenges before we tackle the various ways engineers design solutions around them.
Challenges – Big Data
The biggest challenge is to be efficient as well as real-time handling of the inflow of a tremendous amount of data at any given moment. It is such a demanding task that it can easily overwhelm a single server or even a server cluster. Thus, we achieve the required performance by making sure that hundreds and thousands of servers or server clusters can work collaboratively to quickly process the data through the use of technologies like Apache Spark or Hadoop.
Servers capable of such performance and storage are generally niched enterprise hardware, which is anything but cheap, leading us to the second downside of implementing Big Data- Cost. Purchasing adequately powerful equipment and suitably deep storage running behind the scenes, often in myriads of locations around the world, naturally, is an extremely capital intensive affair. Engineers provide the solution through public cloud computing and storage. Here, generally, a single organization purchases the necessary hardware for all the computing and storage. Other organizations can then buy processing time and memory used from the previous company to serve their purpose, which brings down the cost for every party in the process.
Getting all this data to the analysts and scientists is also a challenge due to the distributed nature of the data. Lead researchers are working to build data catalogues that incorporate metadata management and data lineage functions.
- Techniques for analyzing data, such as machine learning and natural language processing.
- Business intelligence, cloud computing, and databases used to process and store collected data.
- Finally, Data visualization through charts, graphs, etc.
Real-time or near-real-time handling of data is an essential feature of any Big Data system. Thus latency in the connection is a significant issue. Therefore we always attempt to reduce latency, wherever possible, and to whatever extent practical.
The big players in cloud computing services for Big Data applications include Amazon EMR, Microsoft AzureHDInsight, and Google Cloud Dataproc. These services often use specialized file systems and databases for their specific use case. These include the Hadoop Distributed File System (HDFS), Amazon Simple Storage Service, NoSQL databases, and Relational databases.
However, organizations looking into deploying the servers, storage, etc. on-premises generally make extensive use of Apache Spark and Hadoop. Also, other open-source technologies used in conjunction include- YARN, MapReduce, Kafka, HBase, and SQL-on-Hadoop query engines like Drive, Hive, Impala, and Presto.
Let us answer that question from a business perspective. The data that you generate about your day to day activities over the internet are processed using Big Data. Such information might include the products you purchase, the pages you like, the websites you visit, etc. All this data is nothing but information about your likes and dislikes. Companies can use this information to push unique ads to your device about things you might have purchased in the past or so much as shown some notion of interest. Companies might also use it to provide you with personalized customer service based on accurate details about your specific problem.
All of this combined makes for a user experience tailored to your taste from the ground up- marketing, sales as well as customer support. It also goes a long way in helping companies understand what their users want better than ever before, giving them valuable insights into how the market trends change from time to time, furthering their hold over the same and bolstering their profitability.
Medical researchers can also gather real-time analysis to identify disease risk factors better and provide doctors with more information to work with, thus giving their diagnosis to a patient on an individual basis. Big Data is also used by Finance companies to monitor the condition of the market in real-time, allowing them to make smarter and more informed financial decisions or provide advice.
Governments make use of intelligence from news, social media sites, etc., to gauge the effectiveness of their governance on the public sentiment in a never-before-seen pace, allowing them to make better decisions, in line with the general opinion. It also finds use in emergency response, crime prevention, and smart city initiatives by governments. Manufacturing and transport companies have also been increasing the use of this technology to maintain supply chain and logistics and optimizing delivery routes.
In conclusion, the technology serves to improve the lives of everyday citizens by helping companies tailor their experience on a per-user basis, uplifting your experience of interacting with the services and products, making your life easier and more comfortable. It is also of great utility in providing companies with an edge over the others in the modern market of breakneck competition. The market trends are also better monitored using Big Data technology, allowing for swift actions when required. The benefits it offers in health, as well as crisis, risk management, and outbreak monitoring, helps governments, doctors, or concerned authorities respond quickly in mitigating a problem. Governments now get real-time information about the public sentiment, bolstering their capability to make better decisions based on the same.
All this does not discount the various, often downright, wicked, and criminal ways misuse of the power and intelligence that Big Data provides occurs through privacy breeches, sensitive information leakage, and a multitude of other ways. Thus, it is safe to assume that Big Data, though incredibly powerful in the right hands, still has a long way to go in terms of creating laws regulating it and security features preventing its misuse.