eXtreme Transaction Processing

Massimo Pezzini, distinguished Gartner Analyst coined the term eXtreme Transaction Processing or XTP in 2007.  XTP relates to a class of applications that requires collecting, correlating, and operating on large volumes of data to deliver meaningful insights into business use cases.  Data processed by these XTP applications comes in the form of large numbers of events, and represents data that changes frequently.  When conventional OLTP/ETL systems becomes bottlenecks as they cannot provide high-speed performance and are unable to scale elastically on demand, XTP application helps to solve business problems like online trading, risk assessment, and fraud detection.

For high-speed performance in-memory data grids (IMDG) are used that overcomes both disk I/O and network I/O. They are distributed across nodes with micro-second latency. IMDG avoids network hops by optimizing data distribution and replication. It is similar to MapReduce of Hadoop but is in real-time. They also can distribute transactions, queries, and procedures at a low-latency response times. Thus it provides both elastic scaling on demand and high-speed performance. NoSQL or NewSQL can be used with IMDGs. NoSQL provides API and functions to programmatically access key-values, objects, and more. NewSQL uses SQL on structure data – that is for data that require XTP with ACID guarantees.

Pivotal GemFire is an in-memory NoSQL data grid that can handle XTP with high throughput, high scalability, and low latency. It also comes with Spring integration and simplified APIs for greater development ease. GemFire’s event driven architecture and continuous querying can also send selected events to other systems like complex event processing platforms.

Pivotal SQLFire is an in-memory NewSQL grid that handles XTP with fast-throughput, low latency, high scalability and a consistent view of data. SQLFire also supports global WAN connectivity and provides the option of replicating data to remote clusters for disaster recovery.

Posted in General | Tagged | Leave a comment

When Big Data meets Fast Data

2001 Gartner research article by Doug Laney titled “3-D Data Management: Controlling Data Volume, Velocity, and Variety” now serves as the construct for big data. Fast data is often related to data velocity – that is defined as the rate of changes in linking data sets at different speeds. Big data models historical trends and patterns that allow business to find opportunities that are not obvious. Fast data allows business to use these models in real-time to influence results by insights as the data is generated.  Fast Data is getting increasingly crucial for modern business as they are in constant pursuit to gain a competitive edge. Some common use cases are regulatory reporting, fraud detection, and surveillance.

Pivotal core components help to tackle both big data and fast data use cases in business enterprises with GemFire for big data and SQLFire for fast Data. GemFire is used to perform map-reduce jobs on the huge datasets. GemFire’s scatter-gather semantics gives the ability to analyze big data.  Pivotal SQLFire helps businesses to move compute right into the data fabric that can cause as much as 75X speed-up for simple operations like pricing and risk. It can also be used to improve the time to detect patterns and anti-patterns for use cases like compliance and fraud detection. GemFire WAN Gateways enables business to have local access to the global analysis in the form of micro-cubes that are stored in “edge caches” so that they can be sliced and diced locally. When big data works together with the eXtreme Transaction Processing (XTP) of fast data, it paves way to new business models that are robust and accurate.

Posted in General | Tagged | 3 Comments

Hadoop Invades My Desk

I gave Hadoop elephant an Indian makeover with red and gold – acrylics on paper.

Hadoop Desk

 

Posted in General, Hadoop | Tagged | Leave a comment

Spring for Apache Hadoop

Hadoop has a poor out of the box programming model. Applications often become spaghetti code in the form of scripts calling Hadoop command line applications. Spring aims to simplify Hadoop applications by leveraging several Spring eco-system projects. Spring for Apache Hadoop provides consistent programming and declarative configuration model. Available projects in Spring for Apache Hadoop for various use cases are –

Spring Batch – Reuses same Batch infrastructure and knowledge to manage Hadoop workflows. The steps in the workflows can be any Hadoop job type or HDFS script. Batch analytics is the best use case for this Spring project.

Spring Integration – Implementation of the enterprise integration patterns. It is mature since 2007 with Apache 2.0 license. It separates integration concerns from processing logic by handling message reception and method invocation. Use case for this Spring project is real-time analytics such as consuming twitter search results or syslog events to transform and write into HDFS.

Spring Data – Helper libraries for various services such as increment counters to create continuous queries in JPA, JDBC, Redis, MongoDB, Neo4j, Gemfire

Spring Framework – Covers DI, AOP, web, messaging, scheduling

Spring XD – eXtreme Data or y = mx + b is the new open source umbrella project that spans across all the use cases to deliver a pluggable module system

Spring for Apache Hadoop helps to improve developer productivity by –

  • creating well-formed applications
  • simplifying HDFS and FsShell API with JVM scripting
  • providing runner classes for small workflows for MapReduc/Pig/Hive/Cascading
  • providing helper “Template” classes for Pig/Hive/HBase

For experimenting further –

Posted in Hadoop, Spring | Tagged | Leave a comment

Pivotal Platform – where Cloud meets Big Data

There is a major shift towards cloud computing that includes both infrastructure transformation and transformation of application development and its usage. These simply require new methods. Pivotal platform helps organizations in these transformations. Pivotal enables organizations to build a new class of applications, leveraging big and fast data, and do all of this with the power of cloud-independence. The strategic areas of Pivotal Platform include the following:

Data Fabric

Pivotal Data Fabric combines Greenplum Database and Hadoop with the fast data technologies of GemFire and SQLFire from VMware to deliver the most comprehensive big and fast data platform in the industry. The components of data fabric are –

  • Pivotal HD – Hadoop distribution from Pivotal that includes HAWQ to deliver world’s most powerful Hadoop data infrastructure
  • Pivotal Greenplum Database – best in class MPP database to deliver business value through analytics on structured data
  • Pivotal Data Computing Appliance – multifunction analytics platform that supports structured and unstructured data analytics, ETL, BI, machine learning, and data visualization
  • Pivotal Chorus – analytic productivity platform built to let teams search, explore, visualize, and import data from anywhere in the organization
  • Pivotal Performance and Management – real-time application monitoring component that provides a view of all database transactions
  • Pivotal Analytics – end-to-end analytics platform
  • Pivotal GemFire – provides elastic in-memory data management
  • Pivotal SQLFire – in-memory distributed SQL database

App/Cloud Fabric

Modern applications are built and deployed differently. They are social and accessed primarily by mobile devices. They are written using modern programing frameworks and deployed on virtual and cloud infrastructures. There should also be cloud independence to ensure users are not locked into a single provider while providing simple deployment options behind the firewall or on public clouds. Pivotal Cloud Application platform delivers on these requirements with these components –

  • Cloud Foundry – Platform-as-a-service (PaaS) platform that enables developers to quickly deploy, scale and manage new applications
  • Pivotal tc Server – lightweight application server optimized for virtual environments
  • Pivotal Web Server – HTTP server and load-balancing component based on Apache
  • Pivotal RabbitMQ – protocol-based messaging middleware groomed for cloud computing
  • Spring Framework – enterprise application developer framework for bridging legacy applications to new platforms

Expert Services

Pivotal provides a number of services to help organizations better leverage big data and build new applications. They are quick and focused offerings that deliver business value in weeks. These services are –

  • Pivotal Labs – quickly create and deploy new applications
  • Pivotal Data Science Labs – comprehensive data science practice to accelerate analytics projects
Posted in General | Tagged | 2 Comments

HAWQ Soars Higher

HAWQ is a modern distributed and parallel query processor on top of HDFS that gives enterprises the best of both worlds: high-performance query processing with SQL, and scalable open storage. When the data is directly stored on HDFS, it provides all features of Hadoop. Using HAWQ, SQL can scale up on Hadoop like petabyte range of datasets. HAWQ natively reads data from and writes data to HDFS. It has true SQL capabilities that include being SQL standards complaint, ACID complaint, and cost based query optimizer. It allows users to connect to HAWQ via most popular programming languages and supports ODBC and JDBC. HAWQ also supports columnar or row-oriented storage.

HAWQ can tolerate disk level and node level failures. It can coexist with MapReduce, HBase, and other database technologies common in Hadoop ecosystem. It supports traditional OLAP as well as advanced machine learning capabilities like supervised and unsupervised learning, inference, and regression.

HAWQ’s industry leading performance is achieved by dynamic pipelining, a parallel data flow framework, to orchestrate query executions. HAWQ essentially breaks complex queries into smaller tasks and distributes them to query processing units for execution and dispatched to segments that work together to deliver a single result set. HAWQ was benchmarked against Hive and Impala that were deployed on Pivotal Analytics Workbench (AWB) that clearly showed the industry leading performance in action. Here are the performance results for five real world queries on HAWQ, Hive, and Impala.

Performance results for five real world queries on HAWQ, Hive and Impala

Performance results for five real world queries on HAWQ, Hive
and Impala

Read this white paper – Pivotal HD: HAWQ, A true SQL Engine for Hadoop, for more details.

Posted in Hadoop | Tagged | Leave a comment

Delta of Hadoop Distributions

Greenplum introduced first Hadoop distribution GPHD (Greenplum Hadoop Distribution) in 2011 removes the need in building out a Hadoop cluster from scratch. In February this year, Pivotal – Greenplum announced the first product Pivotal HD to expand the capabilities of Hadoop that already has an enterprise data platform. Quick check on the differences here…

GPHD includes –

  • Installation and Configuration Manager (ICM) – cluster installation, upgrade, and expansion tools.
  • GP Command Center – visual interface for cluster health, system metrics, and job monitoring.
  • Hadoop Virtualization Extension (HVE) – enhances Hadoop to support virtual node awareness and enables greater cluster elasticity.
  • GP Data Loader – parallel loading infrastructure that supports “line speed” data loading into HDFS.
  • Isilon Integration – extensively tested at scale with guidelines for compute-heavy, storage-heavy, and balanced configurations.

Pivotal HD adds the following to GPHD –

  • Advanced Database Services (HAWQ) – high-performance, “True SQL” query interface running within the Hadoop cluster.
  • Extensions Framework (GPXF) – support for HAWQ interfaces on external data providers (HBase, Avro, etc.).
  • Advanced Analytics Functions (MADLib) – ability to access parallelized machine-learning and data-mining functions at scale.
  • YARN – has the old MapReduce 1.0 algorithm as well as the new YARN (also known as MapReduce 2.0) algorithm as Pivotal HD has the Hadoop 2.0.2 core.

Core components and versions are listed below –

HD Delta

HD Components and Versions

Posted in Hadoop, Hadoop Distribution | Tagged , | Leave a comment

Introduction to Pivotal HD

Pivotal HD is a full Apache Hadoop distribution with Pivotal add-ons and a native integration with the Greenplum database. Hence bringing together both NoSQL and SQL access layers to multi-structured data stored within the Pivotal HDFS. This distribution is the world’s first true SQL processing for enterprise-ready Hadoop. Supported by the host of EMC technologies, Pivotal HD is virtualization and cloud ready with VMWare and Isilon. It comes in three flavors.

Pivotal HD Single Node VM allows developers, data professionals, and data scientists to work with experiment with real-world data, and perform advanced analytics and rapidly reveal insights from big data sets with ease on their laptops. It is a preconfigured installation of Hadoop with HAWQ’s SQL query processing power and speed.

Pivotal HD Community Edition of the Pivotal HD stack is entirely comprised of Apache Hadoop (HDFS, MapReduce, Hive, Mahout, Pig, NoSQL, HBase, Zookeeper, Sqoop, and Flume) with a 50 node limit.

Pivotal HD Enterprise is a commercially-supported distribution of the Apache Hadoop stack that is tested at scale in Pivotal’s 1,000 nodes Pivotal Analytics Workbench. It comes with essential add-ons for corporate customers – support for the Spring framework and the number of projects in its ecosystem, such as processing framework Spring Batch and Spring for Apache Hadoop, which simplifies Hadoop application development for users of the enterprise Java framework.

Download Pivotal HD and get started!

Pivotal HD

Pivotal HD

Posted in Hadoop, Hadoop Distribution | Tagged , | Leave a comment

Hadoop Install on Windows Server 2012

My installation notes for Cygwin and Hadoop on Windows Server 2012-

https://github.com/mercyp/Hadoop

Posted in Hadoop | Tagged | Leave a comment

Arrow of Time in Big Data – Understanding the Interconnectedness

Arrow of Time is a term coined by British astronomer Arthur Eddington to describe time flows inexorably in one direction or the “asymmetry” of time. We can experience this time’s arrow in our everyday lives. Certain conditions, developments, or processes would lead to random events in the future that cannot be undone – such as ice cubes melting in a drink at room temperature or eggs cooked to an omelet cannot be reversed from water into ice cubes or from omelet to eggs.

Cause precedes effect – the causal event occurs before the event it affects. For example, dropping a glass of wine is a cause while the glass subsequently shattering and spilling the wine is the effect. Time’s arrow is bounded by this causality. The perception of cause and effect in the dropped glass of wine example is a consequence of the Second law of thermodynamics. Causing something to happen, controlling the future, creates correlations between the cause and the effect. These can only be generated as we move forwards in time. Moving backwards in time would render the scenario nonsensical. In the subjective arrow of time, we know the past but cannot change it; we don’t have direct knowledge of the future but predict in part, as it can be modified by our actions and decisions.

Special non-generic initial conditions of the universe are used to explain the time-directed nature of the dynamics we see around us. Many cosmic theories claim to use physical conditions and processes to set up the initial conditions of the standard big bang that created our physical universe. At this point of time, our existence is made easy with our own digital universe. Electromagnetic waves and signals, electronic circuits, copper wires to fiber channels, the need to achieve results quickly, efficiently, and accurately, led to the data big bang and subsequently the growth of our digital universe.

As the organizations go about their business and experiments in scientific research, more application are created every day that they are generating a tremendous amount of data. In a very positive socio-economic transformation, the rapid adoption of GPS enabled, media-rich smart mobile devices that integrates very well with social networking sites paved path to a new way of living – instant and spontaneous exchange of information among the individuals living across the world. The data is pouring out in a sheer pace that the volume of data by the end of this decade will be 50 times more than what we have today.

Data comes from variety of sources – business applications and scientific research, to personal data. The velocity in which it is generated is constant and instantaneous. This volume and variety is fuelled by the properties of cloud – affordability, agility, and extensibility. The volume of data thus created can be make sense when aggregating small sets that are somehow related to a larger set of the fragments to identify patterns, and define decisions and actions, to influence the effect in future get us the definition of big data.

Big data is any attribute that challenges the constraints of business needs or system capability. Take generation diversity for example – automated generation of data, such as images of weather prediction to manual entry such as tweets and blogs. Data generated thus are being updated at an amazing rate iteratively and incrementally. As we move forward in time, data about data is created, calculated or inferred. Despite of the size, speed or source of the data, big data is all about making sense out of chaos – find the meaning between the data that is constantly changing, finding the relationship between the data creation, understanding the interconnectedness that unlocks the value of big data – modeling events that affects future.

IDC’s digital universe study 2011 states that like our physical universe, the digital universe is something to behold — 1.8 trillion gigabytes in 500 quadrillion “files” — and more than doubling every two years. That’s nearly as many bits of information in the digital universe as stars in our physical universe.

Timeline

Entropy is a measure of the disorder of a system. A well-organized system, like ice cubes or unbroken eggs, has low entropy; whereas a disorganized system, like melting ice cubes or broken eggs, has high entropy. Left to its own devices, entropy goes up as time passes. Entropy is the only quantity in the physical sciences that requires a particular direction for time. As we go “forward” in time, the second law of thermodynamics says that the entropy of an isolated system will increase. Hence, we can define entropy measurement is a way of distinguishing the past from the future from one perspective. A perceptual arrow of time is a continuous movement from the known past to unknown future. Expecting or predicting the unknown forms of future, is a norm that make something move towards it like setting a goal for a output of a system or in our case its hopes, dreams, and desires that make us move towards future.

As our digital universe grows in time, there is increase in the variety, velocity, and volume of data. This often leads to the disorder such as lack of governance, compliance, and security. We have standard governance, regulatory standards, and privacy laws that anticipate the future system.

Entropy

Traditional BI fails to handle cases when data sets become progressively diverse and more granular against the results that are real-time and iterative. This kind of analysis requires organizations to capture exhaustive information from a specific moment in time before the entropy takes its effect on data. Conventional relational database systems does not support unstructured, high volume, fast-changing data—big data. It requires a new generation of technologies, tools, and analytic methods to extract value from our digital universe. Big data approaches are essential when organizations want to engage in predictive modeling, advanced statistical techniques, natural language processing, image analysis or mathematical optimization.

If we want big data to transform business and extract real value from the chaos of our digital universe it is extremely important to find the interconnectedness as we move forward in time. This will help business to make the best decision at the correct time and to deliver that to right people to execute. This is an attempt to apply the concept of arrow of time in big data to understand the interconnectedness in the digital universe to derive real business value from big data at an enterprise level.

References

http://www.emc.com/microsites/cio/articles/big-data-big-opportunities/LCIA-BigData-Opportunities-Value.pdf

Posted in Conceptual, Philosophy | Tagged | 1 Comment