Cruising Data Lakes at Supersonic Speeds

Traditional workloads or second platform workloads for organizations go into File Shares on NAS, HPC on SAN, or Backup/Archive workloads to tape. They typically work with SMB, NFS, or FTP protocols. Emerging workloads like Hadoop referred as third platform pushes organizations to go to DAS or Mobile or Object storage. HTTP, HDFS, REST, SWIFT protocols drive third platform emerging workloads. Existing unstructured data storage architectures are limited to handle large PB scale datasets. The islands of storage silos exist and grow. This becomes a management challenge and can results in inefficient, unbalanced, underutilized storage capacity, and even storage hotspots. Administrators are left with the difficult task of managing data to compensate for these inefficiencies.

EMC Isilon core building blocks which have both these workloads include the OneFS operating system, the NAS architecture, the scale-out data lake, and enterprise-grade software features. EMC Isilon scale-out data lake stores, manages, protects, secures, and enables reporting and analytics on unstructured data for both traditional and for emerging workloads. Leveraging the scale-out data lake, organizations can consolidate multiple, disparate islands of storage into a single cohesive and unified data repository. Scale-out data lake is the key enabler for driving business value into enterprise environments through the multi-protocol, multi-access, single namespace and data repository. Without data lakes, interoperability would require an expensive and time consuming sequence of operations on data across multiple silos, or even costly and inefficient data duplication

In the latest updates made to general availability last week, EMC Isilon introduced new version of OneFS 7.1.1 and two products S210 and X410. OneFS 7.1.1 includes new SmartFlash, a flash-based cache enabling enterprises to get to their data more quickly that can scale up to one petabyte in a single cluster. This would result in 100% flash efficiency and reduced latency, for second and third platform workloads. Isilon S-Series combines unmatched IOPS performance with high efficiency and an ultra-low overhead scale-out NAS package is ideal for random access and file-based applications. New Isilon S210 ideal for high transactional workloads for industries such as media and entertainment and financial services runs up to 3.75 million IOPS per cluster and provides flexible configuration and deployment. Isilon X-Series is the most flexible and comprehensive storage product line, strikes the right balance between large-capacity and high-performance storage. Isilon X410 offers 70% increase in throughput at 33% less $/MBPS, and up to 200GBps per cluster is achievable. Its versatility easily supports Hadoop analytics; high performance computing and enterprise file applications.

Posted in Big Data, Hadoop | Tagged , , , , | Leave a comment

Comparing vCHS, AWS, and Azure

vCHS AWS Azure
Deployment Model
Virtual Private Cloud
Dedicated Cloud
Public Cloud
Public Cloud
Service Model
IaaS
PaaS
DaaS
DRaaS
IaaS
IaaS
PaaS
Platform
vSphere
vCloud Director
vCloud Networking & Security
Elastic Compute Cloud
Windows Server 2008 R2 & Windows Server 2012 Hyper-V
Data Center Locations
3 US data centers and 2 more are in planning
8 data centers located in the US, Europe and Asia
8 data centers located in the US, Europe and Asia
Subscription Options
Virtual Private Cloud
Dedicated Cloud
Hourly Rate
Monthly / Fixed Rate
Reserved Instances
Spot Instances
Free Plan
Hourly Rate
Base Plan
Virtual Private Cloud – $0.03 per hour
Dedicated Cloud – $0.13 per hour
$0.11 per hour
$0.02 per hour
Pricing
View Pricing
View Pricing
View Pricing
High Availability
Virtual Private Cloud – 99.90%
Dedicated Cloud – 99.95%
99.95%
99.95% (for customers with two or more instances of the same role)
Guest Operating Systems
Supports 67 OS
Supports 33 OS
Supports 7 OS
Security and Compliance
ISO 27001
SOC 2 Type 1
SOC 1 (SSAE 16 / ISAE 3402) Type 2
SOC 2 Type 2
FedRAMP / FISMA
HIPAA infra
PCI DSS 2.0
CSA
HIPAA
SOC 1/SSAE 16/ISAE 3402
SOC 2
SOC 3
PCI DSS Level 1
ISO 27001
FedRAMP(SM)
DIACAP and FISMA
ITAR
FIPS 140-2
CSA
MPAA
ISO/ISC 27001:2005
SOC 1/SSAE 16/ISAE 3402 (formerly SAS70)
SOC 2
CSA
FedRAMP(SM)
PCI DSS Level 1
G-Cloud Accredited Level 2
HIPAA
FERPA
Hadoop
Pivotal CF 1.2
EMR – Elastic Map Reduce
HD Insight
Posted in Cloud | Tagged , , | 1 Comment

E = mc2 and The Cosmic Dance

The first law of thermodynamics or the law of conservation of energy states that “energy can neither be created nor be destroyed but can change from one form to another”. Energy can take different forms – heat, electrical, chemical, gravitational, motion, and so on. For example, apple falling from tree is given gravitational energy from the earth is then transformed into motion or kinetic energy, when it hits the ground it becomes static energy, and attracts anything between ants, gnomes, to humans. The theory of relativity says amount of energy contained in a particle is equal to the particle’s mass, m, times c2, the square of the speed of light; thus

E = mc2

Once to be in the form of energy, mass is not indestructible but can be transformed to other forms of energy. This happens when subatomic particles collide, particles can be destroyed, and their kinetic energy in the masses can be used to form new particles. This creation and destruction of material particles is the stunning result of mass and energy equivalence. They form dynamic patterns in the fourth dimension of space-time making everything connected at the subatomic world.

That I learnt in school. Being raised as Hindu and adding my own nerdy inquisitiveness, I know that the Lord Shiva as dancing Natarajar, His “ananda thandavam” or “dance of bliss’” is the creation of universe. His upper left hand holding a small drum symbolizes “creation”, fire on the upper right hand symbolizes “destruction, the second right hand showing “abhaya mudra” symbolizes “protection”, the second left hand pointing to his feet represents “salvation and grace”, and all these happens in the ring of fire around him that represents continuous cycle that holds all the four said representations in the universe. He dances on a small demon “muyalakan” who represents “ignorance”. Altogether the iconography represents if you put your “ignorance” at God’s feet, He’ll protect you from the continuous cycle of creation and destruction, and leading up to salvation.

While researching for my painting, although it is nothing but a classical Tanjore one, the revelations were mind boggling. Sharada Srinivasan in her paper “Shiva as ‘cosmic dancer’: on Pallava origins for the Nataraja bronze” proves with archaeometallurgical, iconographic and literary evidences that the iconography originated from Tamil Nadu, India between 7th to mid-9th century C.E. first in stone sculptures and later in timeless bronzes. I also stumbled upon Fritjof Capra’s “Tha Tao of Physics” where the parallels between modern physics and the cosmic dance of Shiva are beautifully pictured. To quote

For the modern physicists, then, Shiva’s dance is the dance of subatomic matter. As in Hindu mythology, it is a continual dance of creation and destruction involving the whole cosmos; the basis of all existence and of all natural phenomena. Hundreds of years ago, Indian artists created visual images of dancing Shivas in a beautiful series of bronzes. In our time, physicists have used the most advanced technology to portray the patterns of the cosmic dance. The metaphor of the cosmic dance thus unifies ancient mythology, religious art and modern physics

Indian government acknowledged the insightful significance of the metaphor of Shiva’s dance for the cosmic dance of subatomic particles, which is observed and analyzed by CERN’s physicists with a 2m tall statue of Natarajar in 2004. Here is my own to remind my future family and me of the deep and profound cultural and heritage roots in Tamil and India.

Natarajar

Posted in General, Philosophy | Tagged , , , , , | 1 Comment

What can you do with EMC Enterprise Hybrid Cloud?

Enterprises and service providers are moving towards virtual data centers, or cloud architectures, or the new Software Defined Data Center to gain benefits in terms of agility, efficiency, and cost control. Move to these dynamic, virtualized, cloud architectures is creating new challenges for the management teams who are responsible for ensuring service availability, performance, and compliance. They got to choose either the speed and agility of public cloud services or the control and security of private cloud infrastructure.

Over the past few years, IT organizations have been working both with private and public cloud offerings. Bringing together the advantages of private and public cloud was difficult because lack of interoperability across different platforms to poor visibility as there were separate processes and tools. EMC Hybrid Cloud solution brings the best of both the worlds that can broker services between private and public cloud. It provides visibility and control over the where the business applications are hosted.

EMC Hybrid Cloud is not a product but a solution pre-integrated, complete Federation stack of products that can be installed and implemented in as little as a few days by following the step by step instructions published by EMC. Hybrid Cloud Reference Architecture was developed through extensive interoperability testing and EMC’s expertise in hybrid cloud deployments to speed adoption of a hybrid cloud. These defined standards provide enterprises with a fully integrated cloud infrastructure by identifying their specific use cases and applying them to the predefined solution.

EMC Enterprise Hybrid Cloud

EMC Enterprise Hybrid Cloud Overview

The possibilities of Enterprise Hybrid Cloud will make the enterprises a smooth transformation to the 3rd Platform. The standard use cases are –

Self-Service Storage: Self-service access to storage resources enables rapid turnaround for enterprises. This also should provide multiprotocol support: file, block, object storage. Solution features include VMware vCloud Automation Center, EMC ViPR, and EMC and/or 3rd-Party Storage.

Automated Provisioning: Users are offered a portal for requesting cloud resources. This request can be an individual VM, or a complete application stack that leverages multiple VMs, e.g. database, application, and web servers. Solution features include VMware vCloud Automation Center, VMware vCenter Orchestrator, and EMC vCO Workflows.

Secure Multi-tenancy: The business cases for multi-tenancy are end users in the IaaS cloud can request virtual machines that are isolated from other cloud users and enables support for specific compliance, security, and regulatory needs. Solution features include vCloud Automation Center, vCloud Networking and Security, and VMware NSX.

Data Availability and Protection: Efficient use of data availability and protection provides operational savings. Users can set their own data-protection policies and report on the protection status of their own data. Data protection policies can be enforced with corporate or regulatory backup and recovery policies. Solution features include VMware vCAC, VMware vCenter Orchestrator, EMC Avamar, and EMC Data Domain.

Automated Monitoring: Automated monitoring ensures that not only capacity, performance, and health monitored, but also alerts are issued intelligently. It thus reduces the noise of a monitoring system to maintaining uptime and service level agreements. Solution features include VMware vCenter Operations, EMC ViPR Storage Analytics, and VMware Log Insight.

Transparent Pricing: Financial transparency of service cost is essential to reduce the wastage of virtualized resources. Resource and group-level chargeback services should be available to users for what they use. Report on usage and cost metrics will help IT to plan well. Solution features include VMWare IT Business Management and vCenter Operations Manager.

Posted in Hybrid Cloud | Tagged , , | Leave a comment

Journey to the Third Platform – ECS

Recent trends in IT today that drives everything are mobile, cloud, big data, and social. Gartner calls this as “nexus of forces” that “is transforming the way people and businesses relate to technology”. These trends are really underpinning a shift to what IDC calls it as third platform. The First Platform was around mainframe with millions of users and thousands of applications. Second Platform is what we saw last decade about client/server and distributed computing that has increased the number of users by 100s of millions and tens of thousands of apps. The 3rd Platform is a new era that we believe is going to be driven by mobile devices. Order of magnitude difference in terms of scale for third platform is billions of users and hundreds of millions of applications.

3rd Platform

Organizations, with the complexity of today’s IT environment, wants to lower operational costs, increase their revenue, and reduce risk. Organizations that can tap into the third platform users and applications can create a converged vision that supports business goals and leads innovation. Overwhelming data growth outstrips the ability of traditional storage systems and drives the need for a new breed of solutions that can strike a balance between the simplicity and the cost-effectiveness of the public cloud and the control and reliability of traditional enterprise IT.

Elastic Cloud Storage Appliance (ECS), powered by EMC ViPR, provides a complete hyper-scale storage infrastructure designed to meet the requirements of 3rd Platform users and applications. ECS Appliance offers support for rack based solutions consisting of 4 or 8 nodes with up to 60 6TB disks per node for a maximum capacity of 2.9 PB. ECS offers a total cost of ownership (TCO) that is between 9 to 28 percent lower than public cloud alternatives such as Amazon Web Services and Google Compute Engine while eliminating the compliance issues.

ECS along with ViPR 2.0, provides universal accessibility with support for block, object, and HDFS protocols. ECS will add file-based access in a future release and add support for new ViPR services as they become available. It features superior economics, geo-efficient protection, multi-tenancy, detailed metering, a self-service portal and billing integration. ECS provides all the software-defined functionality in an easy-to-buy and deploy hardware appliance.

Posted in ViPR | Tagged , | Leave a comment

Fun with Financials

When there is a need for quick and comprehensive financial data in R, quantmod comes to rescue. It was originally envisioned as a rapid prototyping environment to facilitate quantitative modeling, testing, and trading. Quantmod allows R to read data from CSV files, spreadsheets, databases, datasets of statistical packages, and from the web (Google, Yahoo, FRED, and others).

Here is a quick look on how precious metals performed in one year.

> library(quantmod)
 > getMetals(c('XPT', 'XAU','XAG'), from=Sys.Date()-365)
 > layout(matrix(1:3, nrow=3))
 > chartSeries(XPTUSD, layout=NULL, TA=NULL)
 > chartSeries(XAUUSD, layout=NULL, TA=NULL)
 > chartSeries(XAGUSD, layout=NULL, TA=NULL)
 > layout(1)

Precious Metals
Analyzing financial results and stock quotes of EMC –

 > getFinancials("EMC")
 > viewFinancials(EMC.f)

EMC Financials

 > getSymbols("EMC")
 > chartSeries(EMC)
 > addBBands()
 > reChart(subset="2014", theme="white", type="candles")
EMC Quote EMC Quote 14

Quick look at FRED (Federal Reserve Economic Data from Federal Reserve Bank of St. Louis) on Consumer Price Index –

CPI CPI Graph
Posted in R | Tagged , , | Leave a comment

Introducing Pivotal HD 2.0

Pivotal HD 2.0 is a commercial distribution of Apache Hadoop 2.2. Along with that it brings an in-memory, SQL database to Hadoop through seamless integration, Pivotal GemFire XD – a SQL compliant, in-memory database designed for real-time analytics for Big Data applications. GemFire facilitates real-time analytics on Hadoop and enables real-time Big Data analytics but is explicitly designed for data environments with high demands for scalability and availability. Pivotal HD 2.0 also expands analytic use cases by integration with GraphLab for graphing analytics as well as enhancements to HAWQ such as support for MADlib, R, Python, Java, and Parquet.

Organizations can process business data lakes that are manageable sets of data to quickly gain value in the Big Data world. When there is a need to quickly derive insights from real-time transactions, the most recent data, can be treated as business data lakes. These data can be maintained in-memory for quick response and querying analysis. Pivotal HD makes these data immediately available for SQL analysis in-memory or in HDFS completely eliminating the need for ETL. With business data lakes being the foundation for architecture of Pivotal HD 2.0, HAWQ, and GemFire XD, it is best suited for organizations that are looking to take advantage of real-time data analytics.

GemFire XD is an ANSI-compliant SQL database with high-availability features that can run over WANs. It can also coexist with the existing databases. In another major enhancement, HAWQ SQL-on-Hadoop query engine, that is based on the Greenplum database can now apply the more than 50 in-database algorithms in the MADlib Machine Learning Library. It also supports automatic translation of R, Python, and Java-based queries and applications. It also supports GraphLab, an open source framework that contains a set of tools and algorithms for analytics that allow data scientists and analysts to gain deeper insight into the data.

Josh Klahr, Vice President, Product Management, Pivotal says “When it comes to Hadoop, other approaches in the market have left customers with a mishmash of un-integrated products and processes. Pivotal HD 2.0 is the first platform to fully integrate proven enterprise in-memory technology, Pivotal GemFire XD, with advanced services on Hadoop 2.2 that provide native support for a comprehensive data science toolset. Data driven businesses now have the capabilities they need to gain a massive head start toward developing analytics and applications for more intelligent and innovative products and services,”

Read more about Pivotal HD 2.0 on Pivotal Blog.

Posted in Hadoop Distribution | Tagged , , | Leave a comment

EMC Isilon and RainStor for Big Data Management

Big Data creates petabytes of data that organizations can readily mine to discover patterns and trends. Although Hadoop provides a comparatively inexpensive way to manage massive amounts of data, it is difficult to manage as the Hadoop cluster grows big.

EMC Isilon scale-out network-attached storage (NAS) has integrated support for Hadoop analytics. Isilon is the only scale-out NAS platform natively integrated with the Hadoop Distributed File System (HDFS). Using HDFS as an over-the-wire protocol, organizations can deploy a powerful, efficient, and flexible Big Data storage and analytics ecosystem.

RainStor unique enterprise database with its data compression and security enables organizations to query information via a MapReduce or SQL interface. RainStor allows data to be efficiently stored using extreme data compression instead of archiving massive amounts of data and then reloading it whenever as needed by the Big Data applications. There by enabling organizations efficiently to run analytics on Hadoop on larger datasets and reducing the infrastructure costs.

When RainStor deployed alongside with EMC Isilon scale-out NAS, it provides an efficient way to keep massive amounts of data active without having to invest in additional infrastructure as the amount of data in the Hadoop cluster grows. Using Isilon, the compute and storage for Hadoop workload is decoupled, enabling organizations to balance CPU and storage capacity optimally as data volumes and number of queries grow.

High Level Architecture

Rainstor-IsilonRainStor – EMC Isilon solution could be used to archive huge amounts of structured data or as a tape archiving for compliance requirements. The data could be accessed via Hive, Pig, MapReduce, or SQL. It also provides the industry-leading storage efficiency with zero data replication, 20-40X data compression, and >80% NAS utilization rate. There is increase in I/O bandwidth due to RainStor compression and the scalability is ensured by adding compute or storage separately as required. This solution also comes up with multi-protocol support such as HDFS, NFS, FTP, and HTTP.

For more details, download RainStor for EMC Isilon Solution Brief.

Posted in Hadoop | Tagged , , | Leave a comment

ViPR Data Services

Big Data era places new demand on data storage. Storage must support varying data types, all of which need to be stored securely for a long periods of time and be available for analysis. There is an increasing focus on data unification that the storage infrastructure for big data has to serve structured, semi-structured, and unstructured data types. Growing emphasis on in-place analytics require compute workloads such as Hadoop Map/Reduce operations to run right where the data lives. Compliance market is fraught with challenges stemming from regulatory and compliance requirements. Data storage is not immune to these challenges and how data gets stored in the long term.

ViPR aggregates multi-vendor heterogeneous storage into a unified storage platform that can be leveraged as a logical scale-out layer. This layer can serve as the underlying infrastructure for hosting a range of data services to support collecting, managing and utilizing unstructured content at massive scale. ViPR Data Services are implemented as software that features a simple, lightweight, low-touch, scale-out design. Data services are storage abstractions that reflect the combination of –

  • Data type (file, object or block of data),
  • Access protocols (iSCSI, NFS, REST, etc.),
  • Security characteristics (snapshots, replication, etc.)
  • Durability and Availability

In ViPR, file, block, object, and HDFS are all data services.  Object and HDFS are available with more to follow. Data services can be used to provide different semantic views of the same data. This can help to manipulate a file as a file or as an object without having to move the data to a different platform that features that semantic.

ViPR

ViPR’s direct benefit is its ability to automate storage management and provisioning. It makes storage available as a self-service, a consumable resource within the software-defined data center. ViPR also transforms data services delivery in large enterprises. With storage arrays and storage services defined in software and managed by policy, ViPR enables organizations to deploy unique Data Services that will enable existing infrastructure on its journey to the cloud. Thereby ViPR extends the use cases for their data and adds value to their storage investments.

For more details download the white paper – Unleash the Value of Data with EMC ViPR Data Services

Posted in ViPR | Tagged , | Leave a comment

Enterprise Infrastructure for Hadoop

Hadoop sandboxes rely on commodity hardware with direct attached storage (DAS). These implementations make it difficult to scale out on storage separately as Hadoop requires three or more copies of data residing within the internal drive of a server unit. Other challenges include data replication, data visibility, lack of multi-tenancy, over use of IT resources for technology components upgrade, and many more. Commodity servers with DAS do not account the data management process.

Decoupled compute and storage via shared storage can help in scaling out and also provide equivalent better performance. Vblock based model avoids creating shadow IT and data silos, by making it easier for enterprise IT to enhance the existing environment, run advanced analytics and develop real-time insights. Vblock, the most innovative and converged infrastructure in the industry features Compute and Network technology from Cisco, Storage and Data Protection from EMC, and Server Virtualization and Virtualization Management from VMware.

Recent IDC report, organizations around the world spent over $3.3 billion on converged systems in 2012, and forecasted this spending to increase by 20% in 2013 and again in 2014. IDC calculates that Vblock Systems infrastructure resulted in a return on investment of 294% over three year period and 435% over a five year period compared to data on traditional infrastructure. Several reasons for better ROI includes simplified operations, faster deployments, improved agility, cost savings, extended services, and improve user/customer satisfaction.

Converged infrastructure model of Vblock allows the enterprises to easily develop and maintain analytics specific services, Hadoop processing, or other applications leveraging the data in one place without having to move and replicate across multiple servers and storages. Also Vblock standardizations make it possible for enterprises to reduce deployment risk and eliminate time and cost of testing across different.

For more information, download the solution brief – Transform Your Business: Big Data and Analytics with VCE and EMC

Posted in Hadoop | Tagged , | Leave a comment