Monday, February 1, 2021

Statistics

Statistics in Data science:

Following are the steps involved in data science,

Data collection - If we have to conduct experiments and collect data, then it needs statistics to conduct randomized control experiments.
Processing data - For standardization n normalization
Describe data - Summary statistics needs statistics
model data - Statistical modelling or algorithmic modelling

What is statistics?

Statistics is a process of collecting, describing and drawing inferences from the data. In statistics we are always interested in studying large collection of people or objects. But its expensive to study the whole population or we may not be able to study the whole population due to constraints. As a solution, we select a sample from the population and conduct the study on them.

Population: A population is the total collection of all the objects we are interested in studying.
Sample : Its a sub group of the population that we study to draw inferences about the population.

What are the inferences/parameters we are trying to draw?
Say,
Proportion of the citizen in favor of the candidate.
Average mileage of the cars that are manufactured in a company
variance in the yield of the farms in the state

When we do these study on the sample of the population, it is statistic.

Parameter: Parameter is any numeric property of the entire population
Statistic : Statistic is any numeric property of the sample of the entire population which is used as an estimate for the corresponding parameter of the population

How to select a sample?

A sample should be a representation of the population, only then estimate will be useful.
few sampling strategies,
Simple random sampling
Stratified sampling
Cluster sampling

How to design an experiment to collect data?

Suppose we want to study if consuming walnuts daily helps in decreasing the cholesterol. While studying effect of one variable(walnut) on another(cholesterol) we should ensure we nullify the effect of lurking variables(exercise, smoking).
So as a solution, we follow conduct randomized control experiments, where we take two samples. One will be given just the placebo and the other will be given the walnuts. Now this nullifies the lurking variables as both the samples have them.

How to describe and summarize data?

Usually data will be stored in tabular format. In this tabular format its difficult to answer the simple question. So we try to plot the data as a graph or draw the distribution of the data from which we can visually get the idea of answers we are trying to figure out. Apart from this, we also can summarize data in Mean, Median, mode, variance and standard deviation.
So in descriptive statistics, we will learn different plots and measures like 
(Relative) Frequency charts
(Relative) Frequency polygons
Histograms
Stem and leaf plots
Box plots
Scatter plots
Measures of centrality and spread

Why do we need probability theory?

A sampling strategy is said to be truly random (unbiased) only if every element in the population has equal chance in becoming a part of the sample. 

So what do we mean by 'Chance'?

The branch of mathematics that deal with chances and probabilities is called Probability theory.
If we observe a trend in a small sample, what is the chance that it will reflect in other samples or in the entire population. 

example, If I calculate the mean of a small sample, what is the chance that the mean of the entire population is close to this mean.

How do we give guarantees for the estimate made from the sample?

Suppose we have a population of 10 and sample size of two. In this case, we can choose two samples from 10 in 10*9 different ways. Now if we take 5 such sample combination each of two. Now find the mean of each samples. we will get 5 means now one for each sample. Lets find the distribution of these means we have found. Here we are 95% sure that the Mean of the entire population will fall in the interval of the distribution (Min and Max of the distribution).

To do this we will learn: 
Point estimates
Distributions of sampling statistics
Interval estimates

What is hypothesis and how do we test it? 

Hypothesis is some assumption we want to test if it is true. Example, Bumras mean bowling speed is more than 90mph. 

To do this, we will learn:
Hypothesis testing:
One population, two populations and Multiple population
Z - Tests, T - Tests and Analysis of variance (ANOVA)

How to model relationship between the variables?

Suppose we want to model relationship between the variables(decrease in cholesterol and number of days of treatment), we use the statistical modelling. i.e. Simple relationship like linear equation.

            y = m x + c , where parameters m and c will be estimated from the data sample.

As m and c are estimated from the data sample and not from the population, we have to deal with the uncertainty. i.e. Are we 99% sure that the estimated m and c are close to the real m and c of the entire population.

To do this, we will learn:
Linear regression 
Estimating parameters
Estimating confidence bands 
Measuring goodness of fit.

How well does the model fits the data:

Consider the below hypothesis.

In cricket, the five ways of getting dismissed is equally likely. The model would look as below.


Now I try to estimate the probabilities from the data sample. Say I take the last 100 matches played and model as below,


Now I am interested in "Are the variations observed in the sample significant or is it due to random chance?"

To do this we will learn,
Chi Square test
Determine the goodness of fit
Determine if two variables are independent

We will be covering the below topics in the next chapters.

 



Engineering Data Science systems

Systems thinking:

Engineering aspects of Data Science:

This is a way of thinking of systems as encompassed and global rather than focusing on one particular issue. Suppose if we have to do a sort. we could use the bubble sort algorithm but it doesn't take advantage of the caches. Rather we can use the quick sort which uses the cache to sort as thus makes it more efficient. Hence while building a system always see the larger picture rather than the particular issue.

In data science, systems thinking need Domain knowledge, Hacking skills and Math and Stats.

Roles involved in Data Science:

To have system perspective of data science, one needs business knowledge, programming, statistics and Communication. Data analyst knows the Business knowledge, statistics and communication. Research analyst knows Communication, Statistics and programming.

Processes involved in Data Science:

Engineering systems of data science involves two parts,
  • Process 
  • Programming.

So to become a data engineer, the person should know not just the programming, but also the process.

Process :

Process has two components.

1.  Flow of steps (What are the steps I take?)

2. Agile improvement. (How to improve each step in an agile way?)

One such process followed in data science is CRISP-DM.

CRISP-DM:

1. Business understanding. One needs to understand the context one is working on and should be able to specify the problems we are trying to solve.

2. Data understanding. What data do I have? will this solve the problem I have?

3. Data preparation.

4. Data modelling. How do I model? How do I analyze the hypothesis?

5. Evaluate the model.

6. Deploy the system.

All the steps in the process are iterative and repeated a lot of times till we achieve the expected output. Data science follows MVP where we build a simple complete system and build/add up on the existing system iteratively. This is agile improvement.

Programming tools:

No code environments like IBM Watson, Amazon Lex. Paid interfaces that anyone can use to analyze data.

Spreadsheets and BI tools like Microsoft Excel, Google sheets, power BI, Tableau.

Programming languages like Weka, MATLAB, Mathematica, Python and R.

High performance stacks like Hadoop and Spark.

Why Phython?

1. Python is beginner friendly

2. Python is increasingly the popular choice for data science

3. Python is good for production and planning

4. Availability of open source libraries which are used in Data science

5. Python is cool beyond the data science. eg as a script language, web applications. programming IOT devices.

Disadvantages

Python is a interpretable file. Can make it slower.