This post is in continuation of A Data Science project series. In this post, We will use ANOVA test whenever we need to check if two or more groups are different from each other or not. i.e. let’s say there are four races in a school – White, Hispanic, Black and Indian. Now school management wants to know if the marks scored by these races are statistically different or not. If yes then they would like to do something about it.
I hope you have gone through part1 and part2, Today I will tell you not only how to explore data through visualization but also the most important part how to interpret them. Don’t miss reading summary in the end of every post.
we will create bar chart, histograms, scatter-plot, box-plots etc. We will also check association between two variables. Let’s start.
In the previous article (https://d4datascience.wordpress.com/2016/11/10/a-data-science-project-part-1/), we have done basic data analysis like calculating means, frequency tables, summary etc. Now we will derive new variables. Why?
Derived variables will help to understand more about them. For example, We have derived variable ip(derived from incomeperperson variable) which will help us to understand how many people fall in lower income class or higher income class. Similarly other variables le, ac transformed into new variables.
How do we decide these value of cutoff points?
This is answered by the business people or you have to explore the data to divide them into different buckets.
If you have any query, let me know in comment section.
In the earlier post, we have discussed our hypothesis and different variables impacting life expectancy. In this post, we will start digging into data. I suggest you go through codebook(check the earlier post). It is important to understand variables first before doing any analysis. I have chosen SAS as data analysis language, but you are free to choose any language R or Python. The process and logic will be same, the just syntax will be different. Now in this part, we will explore all the variables. We will check frequency and distribution of variables. For numeric variables, you can draw the histogram, boxplot, quantiles to understand them and for the categorical variables, you can draw barplot, frequency tables.