Hadoop has a poor out of the box programming model. Applications often become spaghetti code in the form of scripts calling Hadoop command line applications. Spring aims to simplify Hadoop applications by leveraging several Spring eco-system projects. Spring for Apache Hadoop provides consistent programming and declarative configuration model. Available projects in Spring for Apache Hadoop for various use cases are –
Spring Batch – Reuses same Batch infrastructure and knowledge to manage Hadoop workflows. The steps in the workflows can be any Hadoop job type or HDFS script. Batch analytics is the best use case for this Spring project.
Spring Integration – Implementation of the enterprise integration patterns. It is mature since 2007 with Apache 2.0 license. It separates integration concerns from processing logic by handling message reception and method invocation. Use case for this Spring project is real-time analytics such as consuming twitter search results or syslog events to transform and write into HDFS.
Spring Data – Helper libraries for various services such as increment counters to create continuous queries in JPA, JDBC, Redis, MongoDB, Neo4j, Gemfire
Spring Framework – Covers DI, AOP, web, messaging, scheduling
Spring XD – eXtreme Data or y = mx + b is the new open source umbrella project that spans across all the use cases to deliver a pluggable module system
Spring for Apache Hadoop helps to improve developer productivity by –
- creating well-formed applications
- simplifying HDFS and FsShell API with JVM scripting
- providing runner classes for small workflows for MapReduc/Pig/Hive/Cascading
- providing helper “Template” classes for Pig/Hive/HBase
For experimenting further –