Just how much data qualifies as big data? This is a question that is bound to solicit different responses, depending on who you ask. The answers are also likely to vary depending on when the question is asked. Currently, any data set over a few terabytes is classified as big data. This is typically the size where the data set is large enough to start spanning multiple storage units. It’s also the size at which traditional RDBMS techniques start showing the fi rst signs of stress. Even a couple of years back, a terabyte of personal data may have seemed quite large. However, now local hard drives and backup drives are commonly available at this size. In the next couple of years, it wouldn’t be surprising if your default hard drive were over a few terabytes in capacity. We are living in an age of rampant data growth. Our digital camera outputs, blogs, daily social networking updates, tweets, electronic documents, scanned content, music files, and videos are growing at a rapid pace. We are consuming a lot of data and producing it too.

It’s difficult to assess the true size of digitized data or the size of the Internet but a few studies, estimates, and data points reveal that it’s immensely large and in the range of a zettabyte and more. In an ongoing study titled, “The Digital Universe Decade – Are you ready?” (, IDC, on behalf of EMC, presents a view into the current state of digital data and its growth. The report claims that the total size of digital data created and replicated will grow to 35 zettabytes by 2020. The report also claims that the amount of data produced and available now is outgrowing the amount of available storage.

A few other data points worth considering are as follows:

» A 2009 paper in ACM titled, “MapReduce: simplifi ed data processing on large clusters” —
20of%20the%20ACM — revealed that Google processes 24 petabytes of data per day.

» A 2009 post from Facebook about its photo storage system, “Needle in a haystack: efficient storage of billions of photos” — http// —mentioned the total size of photos in Facebook to be 1.5 pedabytes. The same post mentioned that around 60 billion images were stored on Facebook.

» The Internet archive FAQs at say that 2 petabytes of data are stored in the Internet archive. It also says that the data is growing at the rate of 20 terabytes per month.

» The movie Avatar took up 1 petabyte of storage space for the rendering of 3D CGI effects. (“Believe it or not: Avatar takes 1 petabyte of storage space, equivalent to a 32-year-long MP3” —

As the size of data grows and sources of data creation become increasingly diverse, the following growing challenges will get further amplified:

» Efficiently storing and accessing large amounts of data is difficult. The additional demands of fault tolerance and backups makes things even more complicated.

» Manipulating large data sets involves running immensely parallel processes. Gracefully recovering from any failures during such a run and providing results in a reasonably short period of time is complex.

» Managing the continuously evolving schema and metadata for semi-structured and un-structured data, generated by diverse sources, is a convoluted problem.

Therefore, the ways and means of storing and retrieving large amounts of data need newer approaches beyond our current methods. NoSQL and related big-data solutions are a first step forward in that direction. Hand in hand with data growth is the growth of scale.

