Hi, future data science buddies!
I have been asked many times how can I become data scientist or what should I do to become a data scientist. Here I have written about some technical skills that are required to be a data scientist.
But above article was lacking one thing, and Monica Rogati has summed it up perfectly.
https://www.quora.com/How-can-I-become-a-data-scientist-1/answer/Monica-Rogati
This advice from Monica Rogati is a must read for all beginners who want to make into data science. In one sentence, she says
Do a project you care about. Make it good and share it.
Now the question is how to choose a data science project?
So I thought how about I can give you some ideas about the projects and how to start a project and finish it. I am starting a series of blog posts in which step by step I will show you the journey of a data science project.
I am sharing one of my project which I have undertaken during a course. This is a basic project and read this article to understand the approach that you will follow in any project. You will figure out why hypothesis generation is necessary. You will also get to know that why people say every good data analysis starts with a question. You will understand what type of questions data science can answer. You can choose any project topic as long as you have relevant data available.
My research Project Topic was How life expectancy is related to social, economic, environmental factors.
The dataset I have downloaded is from Gapminder http://www.gapminder.org/data/. Details of the data were given in codebook.
This is gapminder data’s codebook link
https://www.dropbox.com/s/lfzmkvnwb5r84ff/GapMinder%20Codebook%20.pdf?dl=0
After looking through the codebook for the gap minder study, I have decided that I am particularly interested in Life expectancy.
So I chose a variable lifeexpectancy from the code book.
I want to study the life expectancy behavior with social,economic and environmental factors. I found out the variables which are related to these factors.
- alcconsumption(alcohol consumption)
- co2emissions(cumulative CO2 emission)
- oilperperson(oil Consumption)
- incomeperperson
I have performed a literature review to see what the work has been done on this topic so that I can bring out some new research. I have taken references from below research:
- Life expectancy of people with intellectual disability:
a 35-year follow-up study
K. Patja,1 M. Iivanainen,1 H.Vesala,2 H. Oksanen3 & I. Ruoppila4 - Income distribution and life expectancy
R G Wilkinson - Changes in life expectancy in Russia in the mid-1990s
Vladimir Shkolnikov, PhD,Prof Martin McKee,Prof David A Leon, PhD - Drinking Pattern and Mortality:
The Italian Risk Factor and Life Expectancy Pooling Project
- After going through those literature I generated hypothesis that I believe, people who are having better income, less alcohol consumption and better environment conditions are having better life expectancy. So These factors are positively correlated with life expectancy according to my hypothesis.
We would like to find out what is the association between life expectancy and social,economic and environmental factors. We would like to address several questions i.e.
- Does alcohol consumption and oil consumption determine life
expectancy? - Does better income means better life expectancy?
- Is there causal relationship ?
- If there is more CO2 emission then would it means lower life
expectancy?
The Variables that we will consider for the hypothesis are
- incomeperperson (Gross Domestic Product per capita in constant 2000 US$)
- alcconsumption (alcohol consumption per adult (age 15+), litres
Recorded and estimated average alcohol consumption, adult (15+) per
capita consumption in litres pure alcohol) - co2emissions (cumulative CO2 emission (metric tons), Total amount of CO2 emission in metric tons since 1751)
- lifeexpectancy (life expectancy at birth (years)
The average number of years a newborn child would live if current
mortality patterns were to stay the same.) - oilperperson (oil Consumption per capita (tonnes per year and person))
This will help us to find answers for important questions like How does a country can have better life expectancy without compromising on its development projects which causes lots of CO2 emission. Where does a country should focus its agenda to have a better life expectancy. So lets dive into the data….