My Data Experiments

SEPTA Regional Rail OTP 2016

Posted on April 25, 2017 by Mercy Beckham

Exploring SEPTA Regional Rail on time performance in 2016 from the dataset here. Overall, the busy days are the weekdays rather than the weekends. It is also noted that the trains originate from south faces more delays than the north. The heatmap of delays brought an interesting red that made google what happened on the first week of November 2016? SEPTA strike! Here is the story…

SEPTA OTP 2016

Posted in Visualization | Tagged philadelphia, SEPTA, Tableau, train timings | Leave a comment

Philadelphia Crime Story

Posted on April 18, 2017 by Mercy Beckham

OpenDataPhilly has the datasets of crimes in Philadelphia from 2006 to recent. Here is the exploration of crimes in Philadelphia. In general, there is downward trend in overall crime rates. There seems to be seasonal peaks and declines – for example crimes seems trending low during winter and there is a surge in summer.Here is the story…

Posted in Visualization | Tagged crimes, data visualization, OpenDataPhilly, philadelphia | Leave a comment

Hadoop Ecosystem – A Quick Glance

Posted on April 11, 2017 by Mercy Beckham

What do Pig, Kangaroo, Eagle, and Phoenix have in common? Hadoop! We got some interesting technologies with curious names in Hadoop ecosystem. Azkaban is bloody wicked. H20 and Sparkling Water compete in the same space. Rethink, Couch, Dynamo, and Gemfire would let you think you just got out positive affirmations seminar. Leaving the bad jokes aside, Hadoop Ecosystem has been growing. Here is a quick glance with my little tweaks –

Posted in Hadoop | Tagged Hadoop | Leave a comment

Presentations

Posted on March 1, 2017 by Mercy Beckham

Trying to pick out all my data science presentation and consolidate here.

Posted in Big Data | Leave a comment

Handy Tricks

Posted on June 29, 2016 by Mercy Beckham

Some tricks from my latest trials

Changing Keyboard Layout: You can download one of the prebuilt Linux VM for two reasons – either you are in a time crunch or just lazy to create your own. Whatever may be reason, and now you find strange characters coming out of your key input instead of the ne you wanted. That’s because the VM was built with a different locale. To get your own, try one of these –

# loadkeys us

Or for permanent changes

vi /etc/sysconfig/keyboard

and change

LAYOUT=”us”

Please note to change the locale you want, I want “US English” here. After saving the file, system reboot did the magic of bringing back my keys.

EPEL Installation Error: I was trying to install the extra packages for enterprise Linux, with the following command

sudo yum install epel-release

I got “Cannot retrieve metalink for repository” error. The issue is with “https”. To overcome, try this –

sudo yum upgrade ca-certificates –disablerepo=epel

Once the certificates are upgraded, yum install of EPEL will work normally.

To list redhat-lsb Cores: One of the prerequisites to install my application is red-hat lsb core. I wanted to list the available and installed redhat-lsb cores –

yum –showduplicates list redhat-lsb

Posted in Code Snippet | Tagged Linux, VM | Leave a comment

Reading Text Files in R

Posted on February 2, 2016 by Mercy Beckham

It is pretty straight forward to read csv files in R – simply give the source of the data file that be in

trial1 = read.csv(“data\\sample1.csv”)

With importing text files in R, you got same default parameters to play around like if the file includes the header. The strip.white allows you to indicate whether you want the white spaces from unquoted character fields stripped. In the following example, NA value is set as “EMPTY” with na.strings parameter.

trail2 = read.table(“data\\sample2.txt”, header=TRUE, sep=”/”, strip.white=TRUE, na.strings=”EMPTY”)

One other variation is read.csv2. This function is used when the files contains data in a different locale that has comma as decimal points and semicolon as separator.

trial3 = read.csv2(“data\sample3.csv”, header=TRUE, strip.white=TRUE)

Quick tips to remember when prepping your datasheet are:

Use short name for the files
Do not use blank spaces in names, values or fields
Avoid using names with special characters
Indicate missing values with NA
Delete comments in excel files to avoid extra columns or NAs

Posted in R | Tagged R tips | Leave a comment

Decision Matrix for Big Data Tools and Technologies

Posted on February 11, 2015 by Mercy Beckham

Posted in Big Data, Hadoop | Tagged Analytics, Big Data, Decision Matrix, Hadoop | Leave a comment

EHC Use Case – Hadoop as a Service

Posted on September 29, 2014 by Mercy Beckham

Hadoop can handle extremely large, unstructured data sets efficiently and at affordable cost, makes it a valuable technology for enterprises across a number of applications and fields. Market Analysis predicts that the market for Hadoop MapReduce is forecast to grow at a compound annual growth rate (CAGR) of 58% reaching $2.2 billion in 2018. At the same time, Hadoop has created operational challenges that include deployment difficulties, poor utilization of storage/processor, inefficient data loading, and the lack of multi-tenancy. For enterprises working on analytics framework built on Hadoop, on-premise solution was the best option. When a solution could create performance or security isolation between different tenants and provides resource containment for different service levels, which brings Hadoop to the cloud – Hadoop as a service (HaaS/HDaaS). This could eliminate the time, resources, and cost that are required to build and maintain complex Hadoop installation on premise. Allied Market Research states that HaaS is expected to reach $16.1 billion by 2020, registering a CAGR of 70.8% from 2014 to 2020.

EMC Hybrid Cloud (EHC) HaaS provides a multi-tenant, self-service portal that leverages EMC Federation – EMC II storage and data protection, Pivotal Big Data Suite, VMware cloud management and virtualization solutions. EHC HaaS is a solution stack made up of EHC IaaS, integrated with VMware Big Data Extensions (BDE) and Pivotal Hadoop (PHD). It is possible to deploy or extend Hadoop cluster within minutes using vCAC portal. Automation of Hadoop clusters is achieved by using custom workflows created with vCO. These workflows are configured from within vCAC to present enterprises with a self-service portal that includes a catalog of pre-configured Hadoop deployment use cases.

EMC Hybrid Cloud – Hadoop as a Service

VMware vSphere Big Data Extensions (BDE) is the commercial version of Serengeti, an open source project by VMware to deploy and manage Hadoop and big data clusters in a vCenter Server managed environment. BDE runs on top of Serengeti that includes Virtual Appliance that has Serengeti Management Server and a Template Server. BDE provide the GUI for managing Hadoop clusters that includes the basic Apache Hadoop but is also very easy to add commercial Hadoop distributions such as Pivotal Hadoop(PHD), Cloudera Hadoop, Hortonworks Hadoop, or MapR Hadoop. This solution uses Pivotal Hadoop Distribution integrated with EMC Hybrid Cloud IaaS stach to create Hadoop as a Service.

The following video demonstrates EMC Hybrid Cloud used to deploy Hadoop-as-a-Service (HaaS), the underpinnings of a Virtual Data Lake.

Posted in Big Data, Cloud, Hadoop, Hybrid Cloud | Tagged Big Data Extensions, EMC Federation, EMC Hybrid Cloud, Hadoop as a Service, Pivotal, Pivotal HD, VMware | Leave a comment

Standardizing Shadow IT

Posted on August 2, 2014 by Mercy Beckham

When employees face restrictions at the work environments, they could potentially turn into workarounds, hacks, quick fixes, or any backdoor entries what they find it necessary to perform their business functions effectively. These solutions are part of important source of innovation but what it lacks is the organization’s requirements for control, reporting, documentation, security, and reliability. Hence bringing uncertain and significant risk – these are called as “Shadow IT” that also goes by the name “Stealth IT” or “Rogue IT” to describe solutions that are not specified and deployed by the IT department.

Examples of these Shadow IT solutions range from use of online messaging, webmail, cloud storage, or external cloud computing platforms. IDC 2013 US Cloud Security Survey says that 72% of organizations saw at least one incident of unauthorized use of cloud computing and 45% of IT organizations had at least one instance of unauthorized IP upload to cloud service. The reasons for Shadow IT consumption are:

Lack of internal process clarity
Lack of control over provisioning of services
Ignorance of general industry standards and security best practices
Expanding business that requires BYOD, Internet of Things, and Big Data
Quicker response time
IT budget of the organization

Shadow IT provides the business units with speed and efficiency at cheaper costs and thus become the breeding grounds for innovation. While paving way for innovation in organization, there are risks associated with business units driven towards Shadow IT solutions. Potential risks include:

Data loss / leaks
Intellectual property and applications moving out of organization’s firewall and across geographies
Security vulnerabilities
Lack of regulatory standards and governance
Legal liabilities
Making of “silos” and thereby lack interoperability with other data/applications

It is not easy to shut down or ignore Shadow IT as the user will find innovative hacks for security restrictions that are considered “necessary to run business” to them. Organizations with highly valuable or sensitive data or intellectual property are the logical targets of economically or strategically motivated attackers. Vertical industries such as government, banking, financial services, energy, defense, retail, technology, manufacturing, healthcare, and others are ideal targets for these attacks. At the same time, Shadow IT brings agility and lower operational costs that cannot be overlooked by CIOs. It is quite possible to regulate Shadow IT and make it legit.

To avoid public cloud storage, Syncplicity comes to the rescue as the leader in enterprise file sync and sharing
Build your own private cloud with core EMC Cloud Portfolio (VMAX, VCE, VNX) and offer IT as a Service to business units
If public cloud offerings cannot be ignored, control and regulate with EMC Hybrid Cloud

Posted in Cloud | Tagged Cloud, EMC Hybrid Cloud, Shadow IT, Syncplicity | Leave a comment

Wedding Reception Venue thru R

Posted on July 25, 2014 by Mercy Beckham

After finding related facts between Jay and me that goes back to ancient times, I moved to learn geocoding and maps visualization in R. Here is how I learnt –

Started with loading needed libraries:

> library(ggplot2)
> library(maps)
> library(scales)
> library(ggmap)

Get the geocode of the related cities:

> geocode("Bangalore")
lon     lat
1 77.59456 12.9716
> geocode("Philadelphia")
lon     lat
1 -75.16379 39.95233

Plot the map of the world:

> ds <- map_data("world")
> class(ds)
[1] "data.frame"
> str(ds)
'data.frame':  25553 obs. of 6 variables:
$ long     : num -133 -132 -132 -132 -130 ...
$ lat     : num 58.4 57.2 57 56.7 56.1 ...
$ group   : num 1 1 1 1 1 1 1 1 1 1 ...
$ order   : int 1 2 3 4 5 6 7 8 9 10 ...
$ region   : chr "Canada" "Canada" "Canada" "Canada" ...
$ subregion: chr NA NA NA NA ...
> head(ds)
long     lat group order region subregion
1 -133.3664 58.42416     1     1 Canada     <NA>
2 -132.2681 57.16308     1     2 Canada     <NA>
3 -132.0498 56.98610     1     3 Canada     <NA>
4 -131.8797 56.74001     1     4 Canada     <NA>
5 -130.2492 56.09945     1     5 Canada     <NA>
6 -130.0131 55.91169     1     6 Canada     <NA>
> p <- ggplot(ds, aes(x=long, y=lat, group=group))
> p <- p + geom_polygon()
> p

World

Plot the national boundaries:

> p <- ggplot(ds, aes(x=long, y=lat, group=group, fill=region))
> p <- p + geom_polygon()
> p <- p + theme(legend.position = "none")
> p

Countries

Bring USA into focus:

> map <- get_map(location = as.numeric(geocode("USA")), zoom=4, source="google")
> p <- ggmap(map)
> p

USA

Highlight Philadelphia, PA:

> map <- get_map(location="Philadelphia, PA", zoom=14, maptype="roadmap", source="google")
> p <- ggmap(map)
> p

Philly

Voila, the reception venue :)

> addr <- "Liberty Bell, Philadelphia, PA 19106, United States"
> loc <- as.numeric(geocode(addr))
> lbl <- data.frame(lon=loc[1], lat=loc[2], text=addr)
> map <- get_map(location=loc, zoom=15, maptype="hybrid", source="google")
> p <- ggmap(map)
> p <- p + geom_point(data=lbl, aes(x=lon, y=lat), alpha=I(0.5), size=I(5), color="red")
> p <- p + geom_text(data=lbl, aes(x=lon, y=lat, label=text), size=5, color="blue", hjust=0.5, vjust=5)
> p

Liberty Bell Center

Can’t wait to set the date… :)

Posted in R | Tagged Maps, R, Visualization | 4 Comments

My Data Experiments

SEPTA Regional Rail OTP 2016

Philadelphia Crime Story

Hadoop Ecosystem – A Quick Glance

Presentations

Handy Tricks

Reading Text Files in R

Decision Matrix for Big Data Tools and Technologies

EHC Use Case – Hadoop as a Service

Wedding Reception Venue thru R

BLOG RSS

Follow Blog via Email

Disclaimer

Recent Posts

Categories

Links

Archives

Meta