Tuesday, October 23, 2012

Summary of use cases, using hadoop in enterprise


  • Need to analysis / summarize / query / store unstructured or semi-structured data. Example:
    • logs
    • sensor data
    • emails
    • blogs
    • web content
    • DOCs / PDFs
    • images
    • videos
  • Ability to support multiple data sources that are producing very disparate and unstructured data
  • Rate at which data is generated is very high, continuous and unpredictable ( say 1 TB per day or per cycle)
  • Data to be analyzed is massively distributed. eg logs
    • Not possible to intercept data being generated at single / known source
  • Using traditional ETL batch processes to summarize data is too time consuming or impractical or expensive
    • Moving all the big data to one storage area network (SAN) or ETL server becomes infeasible with big data volumes. 
    • Even if you can move the data, processing it is slow, limited to SAN bandwidth, and often fails to meet batch processing windows.
  • There is a need to run analytics on raw data
    • Queries that will be run on raw data are not determinate and hence, criteria / parameters for summarizing data are not know upfront
  • Huge amount of data needs to be retained on cheap commodity hardware
    • Using expensive storage, used by RDBMS is not feasible
  • To be continued

1 comment:

Vishal Pathak said...

Topic is gaining importance as more and more organisations are trying to source data from beyond RDBMS(transaction processing systems).External sources like employee forums, social media, blogs etc are becoming major input to Data mining processes to see the trends.

Waiting for the Part 2 of this article to understand the challenges and technical solutions to overcome those.