Big Data for Non Geeks Only

Big Data for Non Geeks Only

News

In the spring of 1999, (14 years ago), I had the enviable opportunity to work, on the largest database in the world. Being a database programmer and administrator it was a dream come true. It was for a software company in Redmond, Washington, and the project was on the WWMDB, the Worldwide Web Marketing Database. The database was a terabyte fiber optic database. It was the latest and greatest at the time.

Fast forward to the present, and you can go to your local computer store or online retailer and buy a terabyte USB portable harddrive for $89. Times certainly have changed. Moore’s law just keep going and going. Like a great game of Battlefield.  If you are unfamiliar with Moore’s law, it essentially states that every 18 to 24 months the computer processing doubles in performance. Nowadays it seems almost trite to talk about a Terbyte, we now talk in terms of Petabytes, Exabytes and Zettabytes. It seems all so crazy. And now over the last 2 years a term has emerged that has just as much ambiguity as magic and unicorns, what is the term? Big Data.

Just what the heck is Big Data?

As a database programmer and administrator, I found myself, deeply intrigued.

Big data essentially means massive amounts of data, both structured and unstructured that is so huge that it becomes difficult to process with the traditional relational database and software methods.

In other words, data so large that our conventional ways of the past just simply can not manage it in a quick and timely fashion.

So just how big is big data?

Typically big data is anything greater than or equal to a Exabyte, however Petabytes can also be considered big data, at least from my perspective.  I mean come on folks, a Petabyte worth of data, really? Is that not freaking big?

Here is a nice breakdown of the sizes and examples

Petabyte (1 000 000 000 000 000 bytes)

  • 1 Petabyte: 5 years of EOS data (at 46 mbps)
  • 2 Petabytes: All US academic research libraries
  • 20 Petabytes: Production of hard-disk drives in 1995
  • 200 Petabytes: All printed material OR Production of digital magnetic tape in 1995

Exabyte (1 000 000 000 000 000 000 bytes)

  • 5 Exabytes: All words ever spoken by human beings.
  • From wikipedia:
    • The world’s technological capacity to store information grew from 2.6 (optimally compressed) exabytes in 1986 to 15.8 in 1993, over 54.5 in 2000, and to 295 (optimally compressed) exabytes in 2007. This is equivalent to less than one 730-MB CD-ROM per person in 1986 (539 MB per person), roughly 4 CD-ROM per person of 1993, 12 CD-ROM per person in the year 2000, and almost 61 CD-ROM per person in 2007. Piling up the imagined 404 billion CD-ROM from 2007 would create a stack from the earth to the moon and a quarter of this distance beyond (with 1.2 mm thickness per CD).
    • The world’s technological capacity to receive information through one-way broadcast networks was 432 exabytes of (optimally compressed) information in 1986, 715 (optimally compressed) exabytes in 1993, 1,200 (optimally compressed) exabytes in 2000, and 1,900 in 2007.
    • According to the CSIRO, in the next decade, astronomers expect to be processing 10 petabytes of data every hour from the Square Kilometre Array (SKA) telescope.[11] The array is thus expected to generate approximately one exabyte every four days of operation. According to IBM, the new SKA telescope initiative will generate over an exabyte of data every day. IBM is designing hardware to process this information.

Zettabyte (1 000 000 000 000 000 000 000 bytes)

  • From wikipedia:
    • The world’s technological capacity to receive information through one-way broadcast networks was 0.432 zettabytes of (optimally compressed) information in 1986, 0.715 in 1993, 1.2 in 2000, and 1.9 (optimally compressed) zettabytes in 2007 (this is the informational equivalent to every person on earth receiving 174 newspapers per day).[9][10]
    • According to International Data Corporation, the total amount of global data is expected to grow to 2.7 zettabytes during 2012. This is 48% up from 2011.[11]
    • Mark Liberman calculated the storage requirements for all human speech ever spoken at 42 zettabytes if digitized as 16 kHz 16-bit audio. This was done in response to a popular expression that states “all words ever spoken by human beings” could be stored in approximately 5 exabytes of data (see exabyte for details). Liberman did “freely confess that maybe the authors [of the exabyte estimate] were thinking about text.”[12]
    • Research from the University of Southern California reports that in 2007, humankind successfully sent 1.9 zettabytes of information through broadcast technology such as televisions and GPS.[13]
    • Research from the University of California, San Diego reports that in 2008, Americans consumed 3.6 zettabytes of information.

There are many factors that define the components that actually qualify data as Big Data.

The three main factors that define big data in general are

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.

  • Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
  • Convert 350 billion annual meter readings to better predict power consumption

Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.

  • Scrutinize 5 million trade events created each day to identify potential fraud
  • Analyze 500 million daily call detail records in real-time to predict customer churn faster

Variety: Big data is any type of data – structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

  • Monitor 100’s of live video feeds from surveillance cameras to target points of interest
  • Exploit the 80% data growth in images, video and documents to improve customer satisfaction

Here is an example of an open source big data solution, it is call Hadoop

http://hadoop.apache.org/

The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

 

In conclusion, it seems big data is here to stay, just as the terms big brother, big bird, big daddy kane, notorious b.i.g. , and big bang. Whether or not it gains traction and overall market acceptance is a whole other story, that I will leave to the brilliant marketers and big corporations. At least for now, we have an understanding  of this vague and convoluted subject that has become all the buzz, at least for now.