It’s difficult to assess the true size of digitized data or the size of the Internet but a few studies, estimates, and data points reveal that it’s immensely large and in the range of a zettabyte and more. In an ongoing study titled, “The Digital Universe Decade – Are you ready?” (http://emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm), IDC, on behalf of EMC, presents a view into the current state of digital data and its growth. The report claims that the total size of digital data created and replicated will grow to 35 zettabytes by 2020. The report also claims that the amount of data produced and available now is outgrowing the amount of available storage.
A few other data points worth considering are as follows:
» A 2009 paper in ACM titled, “MapReduce: simplifi ed data processing on large clusters” — http://portal.acm.org/citation.cfm?id=1327452.1327492&coll=GU
IDE&dl=&idx=J79&part=magazine&WantType=Magazines&title=Communications%
20of%20the%20ACM — revealed that Google processes 24 petabytes of data per day.
» A 2009 post from Facebook about its photo storage system, “Needle in a haystack: efficient storage of billions of photos” — http//facebook.com/note.php?note_id=76191543919 —mentioned the total size of photos in Facebook to be 1.5 pedabytes. The same post mentioned that around 60 billion images were stored on Facebook.
» The Internet archive FAQs at archive.org/about/faqs.php say that 2 petabytes of data are stored in the Internet archive. It also says that the data is growing at the rate of 20 terabytes per month.
» The movie Avatar took up 1 petabyte of storage space for the rendering of 3D CGI effects. (“Believe it or not: Avatar takes 1 petabyte of storage space, equivalent to a 32-year-long MP3” — http://thenextweb.com/2010/01/01/avatar-takes-1-petabyte-storagespace-equivalent-32-year-long-mp3/.)
As the size of data grows and sources of data creation become increasingly diverse, the following growing challenges will get further amplified:
» Efficiently storing and accessing large amounts of data is difficult. The additional demands of fault tolerance and backups makes things even more complicated.
» Manipulating large data sets involves running immensely parallel processes. Gracefully recovering from any failures during such a run and providing results in a reasonably short period of time is complex.
» Managing the continuously evolving schema and metadata for semi-structured and un-structured data, generated by diverse sources, is a convoluted problem.
Therefore, the ways and means of storing and retrieving large amounts of data need newer approaches beyond our current methods. NoSQL and related big-data solutions are a first step forward in that direction. Hand in hand with data growth is the growth of scale.
Source of Information : NoSQL
0 comments: on "Big Data"
Post a Comment