Measuring Databases

Measuring the first V – Volume of big data becomes critical and essential. Here are some samples in the technologies I’ve worked on.

SQL Server

sp_helpdb ‘database_name’ – returns the size of data and log files of a database

sp_spaceused ‘table_name’ – returns the size and index of a table

GreenPlum / PostgreSQL

# SELECT sodddatname, sodddatasize from gp_toolkit.gp_size_database; – returns the size of all databases using gp_toolkit

# SELECT pg_database_size(‘database_name’); – returns the size of database in bytes

# SELECT pg_size_pretty(pg_database_size(‘database_name’)); – returns the size of database in MBs

# SELECT pg_size_pretty(pg_relation_size(‘schema.tablename’)); – returns the size of a non-partitioned table excluding indexes 

Hive / HDFS

sudo -u hdfs hadoop fs -du /user/hive/warehouse/ | awk ‘/^[0-9]+/ { print int($1/(1024**3)) ” [GB]\t” $2 }’

Posted in Code Snippet | Tagged | Leave a comment

Hosting Big Data

Rackspace recently introduced its new Big Data hosting options – customize your configuration for managing big data platform, run Hadoop on the public cloud, or configure your own private cloud. Rackspace eliminates the complex process of building and maintaining a big data environment to evaluate the possibilities of Hadoop and advanced analytics for a Proof of Concept / Value. Once success of concept or value is evaluated, the solution can be implemented with the right Hadoop distribution and architecture as a public or private cloud.

Managed Big Data Platform provides the flexibility to optimize the configuration – for example scale up the compute or storage independently with EMC Isilon or EMC VNX series or VMAX series (all models), reduce the operational costs, and integrate with custom solutions. EMC Isilon allows separating compute and storage resources and providing a level of redundancy and snapshotting that are not currently available in traditional bare metal hosting. Isilon can be used as a data repository or data lake, parsing out some of the data services to Hadoop. For persistent Hadoop workloads, Isilon has data locally to boost the performance. When a disaster recovery is required, Isilon provides snapshot and replication technology for maximum data protection. Here is the architecture for Rackspace Managed Big Data Platform on EMC Isilon

EMC Isilon RefArch

Posted in Hadoop | Tagged , | Leave a comment

PivotalR

PivotalR package is an R front-end to PostgreSQL, Pivotal (Greenplum) database, and a wrapper for machine learning open-source library MADlib. It also interacts with Pivotal HD/HAWQ for Big Data analytics by providing an interface to the operations on tables/views in the database that is similar to data.frame. Hence it eliminates the need to learn SQL for the users of R when they work on objects in the database.

This package enables R users to operate on big data sets that would not fit into R memory and let them use R scripts to leverage MPP database as well as in-database analytics libraries. It also minimizes the data transferred between R and database. Big data is stored in database. When the user enters R commands, this package effectively translates into SQL queries and sends them to database for parallel execution. After execution the computed result is returned to R. Thereby using the powerful analytical capabilities of database and plotting the result with graphical functionalities of R.

PivotalR provides the core R infrastructure and over 50 analytical functions in R that leverage in-database execution. These include

  • Data Connectivity – db.connect, db.disconnect, db.Rquery
  • Data Exploration – db.data.frame, subsets
  • R language features – dim, names, min, max, nrow, ncol, summary etc
  • Reorganization Functions – merge, by (group-by), samples
  • Transformations – as.factor, null replacement
  • Algorithms – linear regression and logistic regression wrappers for MADlib

Useful Links

Posted in R | Tagged , | Leave a comment

Difference between MapReduce 1.0 and MapReduce 2.0

Apache Hadoop, introduced in 2005 has a core MapReduce processing engine to support distributed processing of large-scale data workloads. Several years later, there are major changes to the core MapReduce so that Hadoop framework not just supports MapReduce but other distributed processing models as well.

MapReduce 1.0

In a typical Hadoop cluster, racks are interconnected via core switches. Core switches should connect to top-of-rack switches Enterprises using Hadoop should consider using 10GbE, bonded Ethernet and redundant top-of-rack switches to mitigate risk in the event of failure. A file is broken into 64MB chunks by default and distributed across Data Nodes. Each chunk has a default replication factor of 3, meaning there will be 3 copies of the data at any given time. Hadoop is “Rack Aware” and HDFS has replicated chunks on nodes on different racks. JobTracker assign tasks to nodes closest to the data depending on the location of nodes and helps the NameNode determine the ‘closest’ chunk to a client during reads. The administrator supplies a script which tells Hadoop which rack the node is in, for example: /enterprisedatacenter/rack2.

MapReduce 1.0

MapReduce 1.0

Limitations of MapReduce 1.0 – Hadoop can scale up to 4,000 nodes. When it exceeds that limit, it raises unpredictable behavior such as cascading failures and serious deterioration of overall cluster. Another issue being multi-tenancy – it is impossible to run other frameworks than MapReduce 1.0 on a Hadoop cluster.

MapReduce 2.0

MapReduce 2.0 has two components – YARN that has cluster resource management capabilities and MapReduce.

MapReduce 2.0

MapReduce 2.0

In MapReduce 2.0, the JobTracker is divided into three services:

  • ResourceManager, a persistent YARN service that receives and runs applications on the cluster.  A MapReduce job is an application.
  • JobHistoryServer, to provide information about completed jobs
  • Application Master, to manage each MapReduce job and is terminated when the job completes.

Also, the TaskTracker has been replaced with the NodeManager, a YARN service that manages resources and deployment on a node. NodeManager is responsible for launching containers that could either be a map or reduce task.

This new architecture breaks JobTracker model by allowing a new ResourceManager to manage resource usage across applications, with ApplicationMasters taking the responsibility of managing the execution of jobs. This change removes a bottleneck and let Hadoop clusters scale up to larger configurations than 4000 nodes. This architecture also allows simultaneous execution of a variety of programming models such as graph processing, iterative processing, machine learning, and general cluster computing, including the traditional MapReduce.

Posted in Hadoop | Tagged | 2 Comments

Self-Service Data Access – Pivotal DD

Enterprise data resides in heterogeneous systems and of different data types. IT has its challenges to consolidate data in the right time. Also, many times it is difficult to know what data sources are required to access data. Pivotal DD helps users to gain on-demand access to data from the heterogeneous systems. Pivotal DD helps analysts and data scientists have self-service access to discover data from any data source, provision data from multiple sources onto sandboxes in Hadoop or MPP databases. Pivotal DD includes native, high-speed adapters and resource monitoring for multiple platforms, including Pivotal HD and HAWQ, Pivotal Greenplum Database (GPDB), Apache Hadoop, IBM Netezza, Oracle and SQL Server. Also, Pivotal DD can connect with any database through JDBC, and with most distributed file systems such as NFS. Pivotal DD is installed as a middleware service on a commodity Linux cluster with the following component services:

  • Metadata and Security
  • Data Discovery and Search
  • Workflow Design
  • Resource Management

For further reading, read the technical brief here.

Posted in Big Data, Hadoop | Tagged , , | Leave a comment

Virtualizing Hadoop

HDFS, the “storage” and MapReduce, the “compute” are combined in traditional Hadoop model. If this Hadoop model is directly translated into a VM, it will affect the ability to scale up and down as the lifecycle of VM is tightly coupled to the data. When this kind VM is powered off, data is lost. Scaling out also requires rebalancing data to expand the cluster. Hence this model is not very elastic.

Separating compute from storage in a virtual Hadoop cluster can achieve elasticity and improves resource utilization. It is very simple to configure HDFS storage always available with the compute layer with variable number of TaskTracker nodes that can be extended or shrunk on demand. Multi-tenancy can be achieved with data-compute separation on the virtualized Hadoop cluster. Thus each virtual compute cluster can enjoy performance, security, and configuration isolation.

EMC brings two solutions – Isilon for storage layer and vSphere for Topology awareness

For more details and step by step installation notes, check out EMC Hadoop Starter Kit. This Hadoop Starter Kit (HSK) is intended to simplify all Hadoop distribution deployments, reduce time and cost of deployment.

Posted in Hadoop | Tagged , | Leave a comment

Run Splunk with EMC

Splunk is a powerful data analytics platform that collects, indexes, and analyzes data from virtually any source, including application and machine-generated data in a searchable repository from which it can generate meaningful insights. Splunk makes this data available and usable to organizations to quickly analyze and gain insights into application problems, security, compliance, and web analytics issues. Splunk works very well for enterprises that are required to store and process large volumes of data when used with a scalable, highly available, cost-effective solution.

EMC solutions provides them for Splunk –

  • Isilon delivers a complete scale-out network attached storage system, and together with the EMC Isilon OneFS Operating system combines file systems, volume managers, and data protection into a single unified software layer, creating a single intelligent file system that spans all nodes in an Isilon cluster.
  • High availability and scalability is required at the compute level to enable management and growth of any Splunk deployment. VMware vSphere provides the dynamic compute scalability and high availability allowing seamless scale-out and complete management of any Splunk deployment.

For more, read EMC reference Architecture for Splunk

Posted in General | Tagged , | Leave a comment

Preping for Data Scientist Associate

I come from a content management background handling terabytes of content. Content lifecycle starts with capture/create, versioning, managing, publishing, to end with archival and retention. Content falls thru information rights, compliance, governance, and retention either at the organization level or at the worldwide web level. Soon I moved to content management on the cloud. Establishing trust in managing content both on cloud and on premise is essential. Being part of EMC, it is natural to cover all three “cloud, big data, and trust”. I started exploring big data. Here is my little journey exploring big data…

EMC

Best place to start – To understand big data concepts and business drivers are

  • Big Ideas by EMC TV – Patricia Florissi, EMC VP and Global Sales CTO, walks thru a creative video explaining concepts of big data, Hadoop, and associated solutions. This is good to getting acquainted with the terminologies and concepts
  • InFocus Blogs – EMC leaders April Reeve, Bill Schmarzo, David Dietrich, Frank Coleman, Laddie Suk, and Scott Burgess brings great insights in their posts related to big data. These posts brings the industry perspective and leadership thoughts on big data

Get Organized – It is easy to get lost in all the reading and surfing web. I prefer disciplined, structured approach to learning. That’s where EMC Education comes to rescue.

  • Business Transformation Course – This was introduced later for data-savvy business leaders who can identify opportunities to solve business problems using advanced analytics.
  • Data Science and Big Data Analytics Course – My personal favorite, covers extensively on data science, advanced analytics, big data project lifecycle, and available solutions from EMC.

Dive Deeper – If you already have statistical or analytical background, you can skip this step. Since I worked with content management for long time, I wanted to brush up on my analytics basics that I had in college. Coursera MOOCs are very helpful to get the basics straight

Dig More – After getting the fundamentals straight, it is easy to learn the solutions associated with it

  • Greenplum Unified Analytics Platform – EMC Education offers a comprehensive approach to learn the massive parallel programming architecture for big data analytics. I took the Greenplum Architecture and Administration course.
  • Hadoop – I had this training thru EMC Academic Alliance and have been trying out practically with Greenplum Hadoop and now started working with Pivotal HD.

EMC Education Services Data Scientist course and Greenplum Analytics Labs helps you to get started on your big data projects. Hope this helps! If you have any questions, I’ll be more than happy to answer.

Posted in Big Data | Tagged , | Leave a comment

EMC Kazeon – Dark Data Explorer

Dark matter in astronomy and cosmology is a type of matter that hypothetically accounts for the large part of total mass in the universe. It neither emits nor absorbs light or other electromagnetic radiations so that it cannot be observed with any telescopes. Its existence is rather inferred from its gravitational effects on visible matter. Dark data is similar to the dark matter as defined by Gartner as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes. Similar to dark matter in physics, dark data often comprises most organizations’ universe of information assets.” Organizations across all industries are provisionally collecting and storing a burgeoning amount of data for compliance and retention. When these existing underutilized “dark data” such as emails, multimedia and other enterprise content are explored, it can potentially represents the most immediate opportunity to transform businesses. EMC Kazeon can bring this big data intelligence for organizations. Kazeon is a “drop in” appliance meaning it is drop, set, and go – no need to setup infrastructure. It can be applied to file shares, laptops, desktops, emails, SharePoint or Documentum. When the organizations use it on a continual basis to keep file monitoring in place, Kazeon reduces costs and risks associated with dark data.

Posted in Big Data, Conceptual | Tagged | Leave a comment

Hadoopable?

Recently I heard “moving content into Hadoop” – although I did not further question their motive, I was wondering seriously about “effective solutions” on Hadoop for the day-to-day business problems. Hadoop is not a magic wand to wipe away all the troubles in the business world to bring back sanity, revenue, opportunities or any other troubles you can think off. First thing to understand is Hadoop – Hadoop has the HDFS for data storage and MapReduce framework to perform batch analysis on the data stored on HDFS. The data stored in Hadoop need not be structured – it can work well with unstructured, quasi-structured, and semi-structured data from different data sources. Hadoop produces excellent results when the volume of the data is in petabytes.

Being said that Hadoop can work well with structured data as well. At same time, Hadoop cannot be considered as a replacement for conventional RDBMS. RDBMS are ACID compliant to preserve data integrity. RDBMS are optimized to capture and analyze transactions like online shopping, ATM transactions, patient records, and other entities in the real world. These transactions often require low latency and faster retrieval. Hadoop cannot be used in these business scenarios.

Hadoop stores data in files, and does not index them. To retrieve anything, a MapReduce job is required to go through all the data. This takes time. Moving “content” from any content management systems such as SharePoint or Documentum to Hadoop for reasons like “affordable” storage of HDFS will not make any sense either. Hadoop has no ECM or WCM capabilities.

Hadoop works when the data is too big for RDBMS – reaching technical limits. There are solutions that take advantage of HDFS to store structured data and leverage the relatively inexpensive storage. They usually work by moving the data from RDBMS to Hadoop for batch analysis. Shifting the data back and forth between RDBMS and Hadoop can be overkill if the storage needs are not huge.

Business scenarios for Hadoop are where data is in high volume (in petabytes) that is to be analyzed and queried at length later. HDFS and MapReduce functions come to the rescue for such high volume storage and batch analysis. Use cases such as pattern recognition, sentiment analysis, recommendation engines, and index building are ideal to be solved by Hadoop.

Hadoop cannot replace existing systems like RDBMS, ecommerce, or content management systems. Hadoop should be integrated with the existing LOB systems to augment data management and storage capabilities. Connect existing LOB systems with Hadoop by using tools like Flume (to pull or push data from RDBMS to Hadoop) and Sqoop (to extract system logs in real time to Hadoop) irrespective of data volume to gain meaningful insights.

Posted in Conceptual, Hadoop | Tagged | Leave a comment