Recently I heard “moving content into Hadoop” – although I did not further question their motive, I was wondering seriously about “effective solutions” on Hadoop for the day-to-day business problems. Hadoop is not a magic wand to wipe away all the troubles in the business world to bring back sanity, revenue, opportunities or any other troubles you can think off. First thing to understand is Hadoop – Hadoop has the HDFS for data storage and MapReduce framework to perform batch analysis on the data stored on HDFS. The data stored in Hadoop need not be structured – it can work well with unstructured, quasi-structured, and semi-structured data from different data sources. Hadoop produces excellent results when the volume of the data is in petabytes.
Being said that Hadoop can work well with structured data as well. At same time, Hadoop cannot be considered as a replacement for conventional RDBMS. RDBMS are ACID compliant to preserve data integrity. RDBMS are optimized to capture and analyze transactions like online shopping, ATM transactions, patient records, and other entities in the real world. These transactions often require low latency and faster retrieval. Hadoop cannot be used in these business scenarios.
Hadoop stores data in files, and does not index them. To retrieve anything, a MapReduce job is required to go through all the data. This takes time. Moving “content” from any content management systems such as SharePoint or Documentum to Hadoop for reasons like “affordable” storage of HDFS will not make any sense either. Hadoop has no ECM or WCM capabilities.
Hadoop works when the data is too big for RDBMS – reaching technical limits. There are solutions that take advantage of HDFS to store structured data and leverage the relatively inexpensive storage. They usually work by moving the data from RDBMS to Hadoop for batch analysis. Shifting the data back and forth between RDBMS and Hadoop can be overkill if the storage needs are not huge.
Business scenarios for Hadoop are where data is in high volume (in petabytes) that is to be analyzed and queried at length later. HDFS and MapReduce functions come to the rescue for such high volume storage and batch analysis. Use cases such as pattern recognition, sentiment analysis, recommendation engines, and index building are ideal to be solved by Hadoop.
Hadoop cannot replace existing systems like RDBMS, ecommerce, or content management systems. Hadoop should be integrated with the existing LOB systems to augment data management and storage capabilities. Connect existing LOB systems with Hadoop by using tools like Flume (to pull or push data from RDBMS to Hadoop) and Sqoop (to extract system logs in real time to Hadoop) irrespective of data volume to gain meaningful insights.