Trying to pick out all my data science presentation and consolidate here.

Posted in Big Data | Leave a comment

Handy Tricks

Some tricks from my latest trials

  1. Changing Keyboard Layout: You can download one of the prebuilt Linux VM for two reasons – either you are in a time crunch or just lazy to create your own. Whatever may be reason, and now you find strange characters coming out of your key input instead of the ne you wanted. That’s because the VM was built with a different locale. To get your own, try one of these –

# loadkeys us

      Or for permanent changes

vi /etc/sysconfig/keyboard

      and change


      Please note to change the locale you want, I want “US English” here. After saving the file, system reboot did the magic of bringing back my keys.

  1. EPEL Installation Error: I was trying to install the extra packages for enterprise Linux, with the following command

sudo yum install epel-release

      I got “Cannot retrieve metalink for repository” error. The issue is with “https”. To overcome, try this –

sudo yum upgrade ca-certificates –disablerepo=epel

      Once the certificates are upgraded, yum install of EPEL will work normally.

  1. To list redhat-lsb Cores: One of the prerequisites to install my application is red-hat lsb core. I wanted to list the available and installed redhat-lsb cores –

yum –showduplicates list redhat-lsb

Posted in Code Snippet | Tagged , | Leave a comment

Reading Text Files in R

It is pretty straight forward to read csv files in R – simply give the source of the data file that be in

trial1 = read.csv(“data\\sample1.csv”)

With importing text files in R, you got same default parameters to play around like if the file includes the header. The strip.white allows you to indicate whether you want the white spaces from unquoted character fields stripped. In the following example, NA value is set as “EMPTY” with na.strings parameter.

trail2 = read.table(“data\\sample2.txt”, header=TRUE, sep=”/”, strip.white=TRUE, na.strings=”EMPTY”)

One other variation is read.csv2. This function is used when the files contains data in a different locale that has comma as decimal points and semicolon as separator.

trial3 = read.csv2(“data\sample3.csv”, header=TRUE, strip.white=TRUE)

Quick tips to remember when prepping your datasheet are:

  • Use short name for the files
  • Do not use blank spaces in names, values or fields
  • Avoid using names with special characters
  • Indicate missing values with NA
  • Delete comments in excel files to avoid extra columns or NAs
Posted in R | Tagged | Leave a comment

Decision Matrix for Big Data Tools and Technologies


Image | Posted on by | Tagged , , , | Leave a comment

EHC Use Case – Hadoop as a Service

Hadoop can handle extremely large, unstructured data sets efficiently and at affordable cost, makes it a valuable technology for enterprises across a number of applications and fields. Market Analysis predicts that the market for Hadoop MapReduce is forecast to grow at a compound annual growth rate (CAGR) of 58% reaching $2.2 billion in 2018. At the same time, Hadoop has created operational challenges that include deployment difficulties, poor utilization of storage/processor, inefficient data loading, and the lack of multi-tenancy. For enterprises working on analytics framework built on Hadoop, on-premise solution was the best option. When a solution could create performance or security isolation between different tenants and provides resource containment for different service levels, which brings Hadoop to the cloud – Hadoop as a service (HaaS/HDaaS). This could eliminate the time, resources, and cost that are required to build and maintain complex Hadoop installation on premise. Allied Market Research states that HaaS is expected to reach $16.1 billion by 2020, registering a CAGR of 70.8% from 2014 to 2020.

EMC Hybrid Cloud (EHC) HaaS provides a multi-tenant, self-service portal that leverages EMC Federation – EMC II storage and data protection, Pivotal Big Data Suite, VMware cloud management and virtualization solutions. EHC HaaS is a solution stack made up of EHC IaaS, integrated with VMware Big Data Extensions (BDE) and Pivotal Hadoop (PHD). It is possible to deploy or extend Hadoop cluster within minutes using vCAC portal. Automation of Hadoop clusters is achieved by using custom workflows created with vCO. These workflows are configured from within vCAC to present enterprises with a self-service portal that includes a catalog of pre-configured Hadoop deployment use cases.

EMC Hybrid Cloud - Hadoop as a Service

EMC Hybrid Cloud – Hadoop as a Service

VMware vSphere Big Data Extensions (BDE) is the commercial version of Serengeti, an open source project by VMware to deploy and manage Hadoop and big data clusters in a vCenter Server managed environment. BDE runs on top of Serengeti that includes Virtual Appliance that has Serengeti Management Server and a Template Server. BDE provide the GUI for managing Hadoop clusters that includes the basic Apache Hadoop but is also very easy to add commercial Hadoop distributions such as Pivotal Hadoop(PHD), Cloudera Hadoop, Hortonworks Hadoop, or MapR Hadoop. This solution uses Pivotal Hadoop Distribution integrated with EMC Hybrid Cloud IaaS stach to create Hadoop as a Service.

The following video demonstrates EMC Hybrid Cloud used to deploy Hadoop-as-a-Service (HaaS), the underpinnings of a Virtual Data Lake.

Posted in Big Data, Cloud, Hadoop, Hybrid Cloud | Tagged , , , , , , | Leave a comment

Standardizing Shadow IT

When employees face restrictions at the work environments, they could potentially turn into workarounds, hacks, quick fixes, or any backdoor entries what they find it necessary to perform their business functions effectively. These solutions are part of important source of innovation but what it lacks is the organization’s requirements for control, reporting, documentation, security, and reliability. Hence bringing uncertain and significant risk – these are called as “Shadow IT” that also goes by the name “Stealth IT” or “Rogue IT” to describe solutions that are not specified and deployed by the IT department.

Examples of these Shadow IT solutions range from use of online messaging, webmail, cloud storage, or external cloud computing platforms. IDC 2013 US Cloud Security Survey says that 72% of organizations saw at least one incident of unauthorized use of cloud computing and 45% of IT organizations had at least one instance of unauthorized IP upload to cloud service. The reasons for Shadow IT consumption are:

  • Lack of internal process clarity
  • Lack of control over provisioning of services
  • Ignorance of general industry standards and security best practices
  • Expanding business that requires BYOD, Internet of Things, and Big Data
  • Quicker response time
  • IT budget of the organization

Shadow IT provides the business units with speed and efficiency at cheaper costs and thus become the breeding grounds for innovation. While paving way for innovation in organization, there are risks associated with business units driven towards Shadow IT solutions. Potential risks include:

  • Data loss / leaks
  • Intellectual property and applications moving out of organization’s firewall and across geographies
  • Security vulnerabilities
  • Lack of regulatory standards and governance
  • Legal liabilities
  • Making of “silos” and thereby lack interoperability with other data/applications

It is not easy to shut down or ignore Shadow IT as the user will find innovative hacks for security restrictions that are considered “necessary to run business” to them. Organizations with highly valuable or sensitive data or intellectual property are the logical targets of economically or strategically motivated attackers. Vertical industries such as government, banking, financial services, energy, defense, retail, technology, manufacturing, healthcare, and others are ideal targets for these attacks. At the same time, Shadow IT brings agility and lower operational costs that cannot be overlooked by CIOs. It is quite possible to regulate Shadow IT and make it legit.

  • To avoid public cloud storage, Syncplicity comes to the rescue as the leader in enterprise file sync and sharing
  • Build your own private cloud with core EMC Cloud Portfolio (VMAX, VCE, VNX) and offer IT as a Service to business units
  • If public cloud offerings cannot be ignored, control and regulate with EMC Hybrid Cloud
Posted in Cloud | Tagged , , , | Leave a comment

Wedding Reception Venue thru R

After finding related facts between Jay and me that goes back to ancient times, I moved to learn geocoding and maps visualization in R. Here is how I learnt –

Started with loading needed libraries:

> library(ggplot2)
> library(maps)
> library(scales)
> library(ggmap)

Get the geocode of the related cities:

> geocode("Bangalore")
lon     lat
1 77.59456 12.9716
> geocode("Philadelphia")
lon     lat
1 -75.16379 39.95233

Plot the map of the world:

> ds <- map_data("world")
> class(ds)
[1] "data.frame"
> str(ds)
'data.frame':  25553 obs. of 6 variables:
$ long     : num -133 -132 -132 -132 -130 ...
$ lat     : num 58.4 57.2 57 56.7 56.1 ...
$ group   : num 1 1 1 1 1 1 1 1 1 1 ...
$ order   : int 1 2 3 4 5 6 7 8 9 10 ...
$ region   : chr "Canada" "Canada" "Canada" "Canada" ...
$ subregion: chr NA NA NA NA ...
> head(ds)
long     lat group order region subregion
1 -133.3664 58.42416     1     1 Canada     <NA>
2 -132.2681 57.16308     1     2 Canada     <NA>
3 -132.0498 56.98610     1     3 Canada     <NA>
4 -131.8797 56.74001     1     4 Canada     <NA>
5 -130.2492 56.09945     1     5 Canada     <NA>
6 -130.0131 55.91169     1     6 Canada     <NA>
> p <- ggplot(ds, aes(x=long, y=lat, group=group))
> p <- p + geom_polygon()
> p


Plot the national boundaries:

> p <- ggplot(ds, aes(x=long, y=lat, group=group, fill=region))
> p <- p + geom_polygon()
> p <- p + theme(legend.position = "none")
> p


Bring USA into focus:

> map <- get_map(location = as.numeric(geocode("USA")), zoom=4, source="google")
> p <- ggmap(map)
> p


Highlight Philadelphia, PA:

> map <- get_map(location="Philadelphia, PA", zoom=14, maptype="roadmap", source="google")
> p <- ggmap(map)
> p


Voila, the reception venue :)

> addr <- "Liberty Bell, Philadelphia, PA 19106, United States"
> loc <- as.numeric(geocode(addr))
> lbl <- data.frame(lon=loc[1], lat=loc[2], text=addr)
> map <- get_map(location=loc, zoom=15, maptype="hybrid", source="google")
> p <- ggmap(map)
> p <- p + geom_point(data=lbl, aes(x=lon, y=lat), alpha=I(0.5), size=I(5), color="red")
> p <- p + geom_text(data=lbl, aes(x=lon, y=lat, label=text), size=5, color="blue", hjust=0.5, vjust=5)
> p
Liberty Bell Center

Liberty Bell Center

Can’t wait to set the date… :)

Posted in R | Tagged , , | 4 Comments