HDFS, the “storage” and MapReduce, the “compute” are combined in traditional Hadoop model. If this Hadoop model is directly translated into a VM, it will affect the ability to scale up and down as the lifecycle of VM is tightly coupled to the data. When this kind VM is powered off, data is lost. Scaling out also requires rebalancing data to expand the cluster. Hence this model is not very elastic.
Separating compute from storage in a virtual Hadoop cluster can achieve elasticity and improves resource utilization. It is very simple to configure HDFS storage always available with the compute layer with variable number of TaskTracker nodes that can be extended or shrunk on demand. Multi-tenancy can be achieved with data-compute separation on the virtualized Hadoop cluster. Thus each virtual compute cluster can enjoy performance, security, and configuration isolation.
EMC brings two solutions – Isilon for storage layer and vSphere for Topology awareness
- EMC Isilon scale-out NAS for virtualized Hadoop Cluster Shared Data Service
- VMware vSphere Big Data Extensions for Virtualized Hadoop Cluster Topology Awareness
For more details and step by step installation notes, check out EMC Hadoop Starter Kit. This Hadoop Starter Kit (HSK) is intended to simplify all Hadoop distribution deployments, reduce time and cost of deployment.