Statistics in Data science:
Following are the steps involved in data science,
Data collection - If we have to conduct experiments and collect data, then it needs statistics to conduct randomized control experiments.
Processing data - For standardization n normalization
Describe data - Summary statistics needs statistics
model data - Statistical modelling or algorithmic modelling
What is statistics?
Statistics is a process of collecting, describing and drawing inferences from the data. In statistics we are always interested in studying large collection of people or objects. But its expensive to study the whole population or we may not be able to study the whole population due to constraints. As a solution, we select a sample from the population and conduct the study on them.
Population: A population is the total collection of all the objects we are interested in studying.
Sample : Its a sub group of the population that we study to draw inferences about the population.
What are the inferences/parameters we are trying to draw?
Say,
Proportion of the citizen in favor of the candidate.
Average mileage of the cars that are manufactured in a company
variance in the yield of the farms in the state
When we do these study on the sample of the population, it is statistic.
Parameter: Parameter is any numeric property of the entire population
Statistic : Statistic is any numeric property of the sample of the entire population which is used as an estimate for the corresponding parameter of the population
How to select a sample?
A sample should be a representation of the population, only then estimate will be useful.
few sampling strategies,
Simple random sampling
Stratified sampling
Cluster sampling
How to design an experiment to collect data?
Suppose we want to study if consuming walnuts daily helps in decreasing the cholesterol. While studying effect of one variable(walnut) on another(cholesterol) we should ensure we nullify the effect of lurking variables(exercise, smoking).
So as a solution, we follow conduct randomized control experiments, where we take two samples. One will be given just the placebo and the other will be given the walnuts. Now this nullifies the lurking variables as both the samples have them.
How to describe and summarize data?
Usually data will be stored in tabular format. In this tabular format its difficult to answer the simple question. So we try to plot the data as a graph or draw the distribution of the data from which we can visually get the idea of answers we are trying to figure out. Apart from this, we also can summarize data in Mean, Median, mode, variance and standard deviation.
So in descriptive statistics, we will learn different plots and measures like
(Relative) Frequency charts
(Relative) Frequency polygons
Histograms
Stem and leaf plots
Box plots
Scatter plots
Measures of centrality and spread
Why do we need probability theory?
A sampling strategy is said to be truly random (unbiased) only if every element in the population has equal chance in becoming a part of the sample.
So what do we mean by 'Chance'?
The branch of mathematics that deal with chances and probabilities is called Probability theory.
If we observe a trend in a small sample, what is the chance that it will reflect in other samples or in the entire population.
example, If I calculate the mean of a small sample, what is the chance that the mean of the entire population is close to this mean.
How do we give guarantees for the estimate made from the sample?
Suppose we have a population of 10 and sample size of two. In this case, we can choose two samples from 10 in 10*9 different ways. Now if we take 5 such sample combination each of two. Now find the mean of each samples. we will get 5 means now one for each sample. Lets find the distribution of these means we have found. Here we are 95% sure that the Mean of the entire population will fall in the interval of the distribution (Min and Max of the distribution).
To do this we will learn:
Point estimates
Distributions of sampling statistics
Interval estimates
What is hypothesis and how do we test it?
Hypothesis is some assumption we want to test if it is true. Example, Bumras mean bowling speed is more than 90mph.
To do this, we will learn:
Hypothesis testing:
One population, two populations and Multiple population
Z - Tests, T - Tests and Analysis of variance (ANOVA)
How to model relationship between the variables?
Suppose we want to model relationship between the variables(decrease in cholesterol and number of days of treatment), we use the statistical modelling. i.e. Simple relationship like linear equation.
y = m x + c , where parameters m and c will be estimated from the data sample.
As m and c are estimated from the data sample and not from the population, we have to deal with the uncertainty. i.e. Are we 99% sure that the estimated m and c are close to the real m and c of the entire population.
To do this, we will learn:
Linear regression
Estimating parameters
Estimating confidence bands
Measuring goodness of fit.
How well does the model fits the data:
Consider the below hypothesis.
In cricket, the five ways of getting dismissed is equally likely. The model would look as below.
Now I try to estimate the probabilities from the data sample. Say I take the last 100 matches played and model as below,
Now I am interested in "Are the variations observed in the sample significant or is it due to random chance?"
To do this we will learn,
Chi Square test
Determine the goodness of fit
Determine if two variables are independent
We will be covering the below topics in the next chapters.