Idea Fountain: Summary of use cases, using hadoop in enterprise

Tuesday, October 23, 2012

Summary of use cases, using hadoop in enterprise

Need to analysis / summarize / query / store unstructured or semi-structured data. Example:

logs
sensor data
emails
blogs
web content
DOCs / PDFs
images
videos

Ability to support multiple data sources that are producing very disparate and unstructured data
Rate at which data is generated is very high, continuous and unpredictable ( say 1 TB per day or per cycle)
Data to be analyzed is massively distributed. eg logs

Not possible to intercept data being generated at single / known source

Using traditional ETL batch processes to summarize data is too time consuming or impractical or expensive

Moving all the big data to one storage area network (SAN) or ETL server becomes infeasible with big data volumes.
Even if you can move the data, processing it is slow, limited to SAN bandwidth, and often fails to meet batch processing windows.

There is a need to run analytics on raw data

Queries that will be run on raw data are not determinate and hence, criteria / parameters for summarizing data are not know upfront

Huge amount of data needs to be retained on cheap commodity hardware

Using expensive storage, used by RDBMS is not feasible

To be continued

1 comment:

Vishal Pathak said...: Topic is gaining importance as more and more organisations are trying to source data from beyond RDBMS(transaction processing systems).External sources like employee forums, social media, blogs etc are becoming major input to Data mining processes to see the trends.

Waiting for the Part 2 of this article to understand the challenges and technical solutions to overcome those.; October 23, 2012 at 10:18:00 AM GMT+5:30

Subscribe to: Post Comments (Atom)