Primum Non Nocere

I want to discuss one of the lesser known study in the world. The study is named “Mushroom Trial” that was spearheaded by my loving mother and the subjects were my immediate family. As much as she loves to try new recipes, in the late 90’s she found a new “vegetable” to cook with commonly called as mushrooms. We were all thrilled to try new stuff although I voiced my concern “but that is fungus”. All of us enjoyed the meal and I didn’t have my usual portion because of my fungus apprehensions.  20 minutes later, I had intense stomach cramps and started throwing up. My mom was annoyed by the unwarranted trip to ER for non-stop projectile vomiting on something as innocent and pure as farm-fresh, organic ingredients, and love. She assumed that I was jumping around and climbing trees in the backyard after my meal. The next time she cooked mushrooms, she asked me to have a generous helping. This time results were more ominous – started vomiting bile with pronounced rashes. This time, I wasn’t jumping around and in the ER she assumed the kind of mushrooms she cooked could be the reason. Not wanting to upset my mother, I grew numb of apprehension while increasing wary of physical distress. Eventually the doctor in ER ended the mushroom trial by “can we just say she is allergic to all kinds of mushrooms and save everyone time?” Overall, her trial went like this –

Mushroom Study

Mushroom Study

Now, let us analyze the data from this study:

Informed Consent

The most essential part of data collection or study is informed consent. Institutional Review Board (IRB) requires informed consent in research that involves human subjects unless you get a waiver or alteration due to the sensitive nature of the study. There are two parts here “inform” and “consent”. You got to inform the human subject and get their consent. In this trial, mom did inform everyone about mushroom but not necessarily the consent. That brings another interesting question, although it was one study but there were individually 5 different experiments. So, do we need informed consent at the beginning or do we need to keep it going as we change the parameters of the study? Informed consent is a more than a form or paper, it is a process that we should have the respect for individuals. Any research is voluntary.

Data Inputs

There are several issues with the data inputs. It was a poor selection of 5 people in a big family tree and it is also biased for the immediate family. More than this study also promotes a historical bias “mom knows better”. My mom never intentionally tried to harm me. Towards the end, I did raise an objection that it makes me sick. Unlike United States, food allergies are not that prevalent in India to a point that even the doctor took time to come up to the conclusion. This dataset is also incomplete as in I was not sure if I had the same “but that is fungus” apprehension after the first time. I just went with my mom trying not to upset her. In a way, it is incorrect to conclude “allergic to all mushrooms”. The data is outdated as in this dataset is from my childhood and some allergies can fade as the kid grows.

Algorithmic Bias

The problem with this study is assuming correlation implies causation. It is known that not to engage in strenuous physical activity after meals and not all kinds in the same food group could cause adverse reaction. These two correlations were completely wrong about the true cause. When personalization or recommendation algorithms are built on incorrect assumptions and bad data inputs, it could skew the possibilities of expansion and could create a tunnel vision.

Data Privacy

Aside from talking connections is philosophical way, we are connected more than ever with Internet of Things. In this connected world and the information is literally at your fingertips, what has become of privacy? I chose to expose this study and dataset in my website – that exposed my immediate family. My family is very private. I have the most social media presence. I made the dataset public. Have I violated their privacy concerns? Not exactly by the word of law. Are there ethical violations? There is a very little chance that my parents read this and get upset about it. There is a possibility that my siblings could read (usually I’ve ask them to read), if they read and raise concerns, what are the possible actions? Should I say “my blog, my stories” or “offer them some form of compensation” or “take down this post”.

Technological advancements, such as big data opens doors to endless possibilities everywhere. As always, with great power comes great responsibilities. Data laws are still in the primary stages of evolution. Ethics can get polarizing and controversial as with any issue in this country. Study after study, I read shows how we are chartering in unexplored waters when the real world get increasingly complex. That reminds me the principal percept in medicine and bioethics – “First, do no harm!”. Data world could start from there too. I want to share my musings on the one area where should keep our focus on – ethics – as our digital universe explodes.

Posted in Big Data, Conceptual | Tagged , , , | Leave a comment

Philadelphia Crime Story

OpenDataPhilly has the datasets of crimes in Philadelphia from 2006 to recent. Here is the exploration of crimes in Philadelphia. In general, there is downward trend in overall crime rates. There seems to be seasonal peaks and declines – for example crimes seems trending low during winter and there is a surge in summer.Here is the story… 

Posted in Visualization | Tagged , , , | Leave a comment

Hadoop Ecosystem – A Quick Glance

What do Pig, Kangaroo, Eagle, and Phoenix have in common? Hadoop! We got some interesting technologies with curious names in Hadoop ecosystem. Azkaban is bloody wicked. H20 and Sparkling Water compete in the same space. Rethink, Couch, Dynamo, and Gemfire would let you think you just got out positive affirmations seminar. Leaving the bad jokes aside, Hadoop Ecosystem has been growing. Here is a quick glance with my little tweaks –

Posted in Hadoop | Tagged | Leave a comment

Presentations

Trying to pick out all my data science presentation and consolidate here.

Posted in Big Data | Leave a comment

Handy Tricks

Some tricks from my latest trials

  1. Changing Keyboard Layout: You can download one of the prebuilt Linux VM for two reasons – either you are in a time crunch or just lazy to create your own. Whatever may be reason, and now you find strange characters coming out of your key input instead of the ne you wanted. That’s because the VM was built with a different locale. To get your own, try one of these –

# loadkeys us

      Or for permanent changes

vi /etc/sysconfig/keyboard

      and change

LAYOUT=”us”

      Please note to change the locale you want, I want “US English” here. After saving the file, system reboot did the magic of bringing back my keys.

  1. EPEL Installation Error: I was trying to install the extra packages for enterprise Linux, with the following command

sudo yum install epel-release

      I got “Cannot retrieve metalink for repository” error. The issue is with “https”. To overcome, try this –

sudo yum upgrade ca-certificates –disablerepo=epel

      Once the certificates are upgraded, yum install of EPEL will work normally.

  1. To list redhat-lsb Cores: One of the prerequisites to install my application is red-hat lsb core. I wanted to list the available and installed redhat-lsb cores –

yum –showduplicates list redhat-lsb

Posted in Code Snippet | Tagged , | Leave a comment

Reading Text Files in R

It is pretty straight forward to read csv files in R – simply give the source of the data file that be in

trial1 = read.csv(“data\\sample1.csv”)

With importing text files in R, you got same default parameters to play around like if the file includes the header. The strip.white allows you to indicate whether you want the white spaces from unquoted character fields stripped. In the following example, NA value is set as “EMPTY” with na.strings parameter.

trail2 = read.table(“data\\sample2.txt”, header=TRUE, sep=”/”, strip.white=TRUE, na.strings=”EMPTY”)

One other variation is read.csv2. This function is used when the files contains data in a different locale that has comma as decimal points and semicolon as separator.

trial3 = read.csv2(“data\sample3.csv”, header=TRUE, strip.white=TRUE)

Quick tips to remember when prepping your datasheet are:

  • Use short name for the files
  • Do not use blank spaces in names, values or fields
  • Avoid using names with special characters
  • Indicate missing values with NA
  • Delete comments in excel files to avoid extra columns or NAs
Posted in R | Tagged | Leave a comment

Decision Matrix for Big Data Tools and Technologies

DM

Image | Posted on by | Tagged , , , | Leave a comment