A Data Science Project- Part 4: Chi-Square Test of Independence

In the last article, we have discussed ANOVA test, and it gave us insight into checking the distribution of response variable among groups of an independent variable. Today, we will learn how to check relationships between two categorical variables.


A Data Science Project- Part 3: Hypothesis testing and ANOVA

This post is in continuation of A Data Science project series. In this post, We will use ANOVA test whenever we need to check if two or more groups are different from each other or not. i.e. let’s say there are four races in a school – White, Hispanic, Black and Indian. Now school management wants to know if the marks scored by these races are statistically different or not. If yes then they would like to do something about it.



A Data Science Project- Part 2: Making Sense of Data

I hope you have gone through part1 and part2, Today I will tell you not only how to explore data through visualization but also the most important part how to interpret them. Don’t miss reading summary in the end of every post.

we will create bar chart, histograms, scatter-plot, box-plots etc. We will also check association between two variables. Let’s start.



A Data Science Project- Part 1(a)

In the earlier post, we have discussed our hypothesis and different variables impacting life expectancy. In this post, we will start digging into data. I suggest you go through codebook(check the earlier post). It is important to understand variables first before doing any analysis. I have chosen SAS as data analysis language, but you are free to choose any language R or Python. The process and logic will be same, the just syntax will be different. Now in this part, we will explore all the variables. We will check frequency and distribution of variables. For numeric variables, you can draw the histogram, boxplot, quantiles to understand them and for the categorical variables, you can draw barplot, frequency tables.

A Data Science Project-Introduction: How can we have better life expectancy!

Hi, future data science buddies!

I have been asked many times how can I become data scientist or what should I do to become a data scientist.  Here I have written about some technical skills that are required to be a data scientist.


But above article was lacking one thing, and Monica Rogati has summed it up perfectly.


This advice from Monica Rogati is a must read for all beginners who want to make into data science. In one sentence, she says

Do a project you care about. Make it good and share it.

Now the question is how to choose a data science project?

So I thought how about I can give you some ideas about the projects and how to start a project and finish it. I am starting a series of blog posts in which step by step I will show you the journey of a data science project.

I am sharing one of my project which I have undertaken during a course. This is a basic project and read this article to understand the approach that you will follow in any project. You will figure out why hypothesis generation is necessary. You will also get to know that why people say every good data analysis starts with a question. You will understand what type of questions data science can answer. You can choose any project topic as long as you have relevant data available.

My research Project Topic was How life expectancy is related to social, economic, environmental factors.

The dataset I have downloaded is from Gapminder http://www.gapminder.org/data/.  Details of the data were given in codebook.

This is gapminder data’s codebook link

After looking through the codebook for the gap minder study, I have decided that I am particularly interested in Life expectancy.
So I chose a variable lifeexpectancy from the code book.
I want to study the life expectancy behavior with social,economic and environmental factors. I found out the variables which are related to these factors.

  • alcconsumption(alcohol consumption)
  • co2emissions(cumulative CO2 emission)
  • oilperperson(oil Consumption)
  • incomeperperson

I have performed a literature review to see what the work has been done on this topic so that I can bring out some new research. I have taken references from below research:

  • Life expectancy of people with intellectual disability:
    a 35-year follow-up study
    K. Patja,1 M. Iivanainen,1 H.Vesala,2 H. Oksanen3 & I. Ruoppila4    
  • Income distribution and life expectancy
    R G Wilkinson
  • Changes in life expectancy in Russia in the mid-1990s
    Vladimir Shkolnikov, PhD,Prof Martin McKee,Prof David A Leon, PhD
  • Drinking Pattern and Mortality:
    The Italian Risk Factor and Life Expectancy Pooling Project
  1. After going through those literature I generated hypothesis that I believe, people who are having better income, less alcohol consumption and better environment conditions are having better life expectancy. So These factors are positively correlated with life expectancy according to my hypothesis.

We would like to find out what is the association between life expectancy and social,economic and environmental factors. We would like to address several questions i.e.

  1.    Does alcohol consumption and oil consumption determine life
  2.    Does better income means better life expectancy?
  3.    Is there causal relationship ?
  4.    If there is more CO2 emission then would it means lower life

The Variables that we will consider for the hypothesis are

  1. incomeperperson (Gross Domestic Product per capita in constant 2000 US$)
  2. alcconsumption (alcohol consumption per adult (age 15+), litres
    Recorded and estimated average alcohol consumption, adult (15+) per
    capita consumption in litres pure alcohol)
  3. co2emissions (cumulative CO2 emission (metric tons), Total amount of CO2 emission in metric tons since 1751)
  4. lifeexpectancy (life expectancy at birth (years)
    The average number of years a newborn child would live if current
    mortality patterns were to stay the same.)
  5. oilperperson (oil Consumption per capita (tonnes per year and person))

This will help us to find answers for important questions like How does a country can have better life expectancy without compromising on its development projects which causes lots of CO2 emission. Where does a country should focus its agenda to have a better life expectancy. So lets dive into the data….