Wednesday, February 3, 2021

Descriptive statistics

 Types of data:

There are two types of data.

  •  Qualitative data
  •  Quantitative data

Qualitative data(categorical data):

                    Qualitative or categorial attributes are which describe the object under consideration using a finite set of discrete classes. This can be divided further into 

  • Nominal data: Nominal attributes are those qualitative attributes which there is no any natural order that the attributes can take.
  • Ordinal data:  Ordinal attributes are those qualitative attributes which there is a natural order that the attributes can take.

Quantitative data:

                    Quantitative attributes are those which have numeric values and which are used to count or measure certain properties of the population. This can be divided further into
  • Discrete data: Discrete attributes are those quantitative attributes which can take on only finite number of numerical values(Integers).
  • Continuous data: Continuous attributes are those quantitative attributes which can take on fractional values(Real numbers).

Why bother about data types: Because the type of statistical analysis depends on the type of the variable. 

Suppose if we ask below questions for a qualitative attribute

What is the average color of all the shirts in my catalogue? - Doesn't make sense
What is the average nationality of all the students in the class? - Doesn't make sense
What is the frequency of color red in my catalogue? - Right question
Regression analysis(between two nums) - Doesnt make sense for qualitative attributes
Analysis of variance(ANOVA) - Right tool for qualitative attributes
Chi-square testRight tool for qualitative attributes

We can ask the below questions for quantitative discrete attributes,
What is the average value in the dataset?
What is the spread of the data?
What is the frequency of the given value?
Regression analysis.

We can ask the below questions for quantitative continuous attributes,
What is the average value in the dataset?
What is the spread of the data?
Regression analysis.

But,
What is the frequency of the given value? - This wont make sense as continuous attributes will have fractional values and may not repeat often.

Numbers:

Whole numbers: 0, 1, 2 ...(No fractions, No negatives)
Integers: .... - 3,-2,-1, 0 , 1, 2 ,3....(No fractions)
Rational numbers: 1/2, 1/3,5/2, (Ratio of two integers)
Irrational numbers: Cannot be expressed as ratio of two integers (pi, sq.root of 2)
Real numbers : Rational + Irrational


Whole num is a subset of Integers is a subset of Rational num is a subset of Real numbers

Irrational num is a subset of Real numbers

How to describe Qualitative data?

Generally, the values of the Categorical/Qualitative data keeps repeating in the data. 
For example,
  • How many red color shirts in the catalogue? 
  • How many times does LWB appear? 
  • How many kharif crops are there in the data?
This led to the term Frequency.

Primary question here is ,

What is the frequency of different categories?

The count of total number of the times the value appears in the data is called frequency. Frequency can be described by frequency table where there are two columns, one the values and the other is the frequency of appearance as shown below.

Number of centuries scored by Sachin against country.

Frequency plots:

The better and efficient way to represent frequency is using frequency plot as shown below,

Frequency plot of no. of centuries scored by Sachin against each country.

Here, the horizontal axis is the categorial attribute and its frequency is mapped along the Y axis. The height of the bar is proportionate to the the count. 

Suppose if I plot the frequency plot of crops grown in India and I want to know the 7th most grown in the country, it gets difficult as the chart is not sorted. So Sort the values by their counts (along the Y axis) for better visualization. 

Sorted along the frequency gives a better visualisation.

Frequency plots- Long tailed distribution:

As shown in the above graph, a long tailed distribution has,
  • A large number of tall bars at the beginning.
  • A large number of short bars at the end.
  • Very common in real world scenarios.

Frequency plots - Uniform distribution:


Frequency plot of rolling a die. As each number has equal chances of getting rolled, all the bars will be  of almost equal height. This is uniform distribution.

Relative frequency tables:

What percentage of farms grow groundnut? To calculate the percent I need to know the total number of farms which is difficult to figure from the frequency plot. Hence we use relative frequency tables as it is easier to interpret than absolute frequencies.



Relative frequency plots are easy to answer the percentage questions in visual form.

Grouped frequency plots:

Till now, we had one data set and were plotting the frequencies for that set. Suppose if I want to compare two data sets. If I have farming data for two or three consecutive years and I want to find if the farming pattern has changed over the years, I go for grouped frequency plots as shown below.


Grouped relative frequency charts:

From the above groped frequency bar charts, it might look like the rice farms has decreased over the years. But we are not sure as the total number of farms could have decreased. So its better to use relative grouped relative frequency charts.

So from the above chart, we can understand the rice farms have in fact increased.


How to describe Qualitative data?

Does the frequency make sense to Qualitative data? 
Lets consider sachin's ODI matches. The number of runs scored in each match is a discrete qualitative attribute.

Suppose we want to know the following questions,
  • What would be the histogram of sachins score would look like?
  • Where would the tallest bar be?
  • Would there be some regions along the x- axis which would have a bar height of 0?
Lets plot the histogram of sachin.


Surprisingly the tall bars are at the beginning indicating he has scored more low scores. There are blank regions along the x-axis.

Issues
  • Too many unique values along the x- axis.
  • How many times was he dismissed in his 90s or single digits? Its difficult to answer these questions. And we are not keen about how many times he scored 40?
Solution:  we will group the values into bins. 0-9,10-19,20-29...



We can say that he has scored in 90s around 18 times from the above chart.

How to choose the right bin size?

If we choose bin size of 5 , will it be too many bins? 
If we choose bin size of 20,  it will lose the granularity.
So both the extremes are bad. Choose something that will have sufficient details yet not too many bins.
So bin size depends up on the range of the data. If the range is large, too small a bin size will not be helpful. And if the range is small, a too big bin size will hide all the important details.

So ideal bin size is one which reveals meaningful patterns (neither hides nor reveals too many details).

Suppose we want to plot the strike rate of sachin which is a continuous quantitative data. Plotting for every unique value will be cumbersome to read. So here too we use bins of correct size to get interesting information and patterns.

What about class boundaries?
Bin size of 0 - 2 will have all values in between 0 to 2. how about 2?

Left end inclusion convention: A class interval that includes it left end boundary but not its right end boundary.

Summary:

Relative frequency histograms:

Suppose we want to know the percentage of matches Sachin scored less than 20 run? It is difficult to answer from the histograms. So as in Qualitative data, we will use relative frequency histograms. 



Where we calculate the relative frequency table and then plot the relative frequency histogram.

Relative frequency histograms are also aids in comparing two charts as its in percentage and will give the real picture. Suppose I want to compare Sachin and Virat Kohli. What percent of matches did Sachin and Virat scored less than 10%?





Frequency Polygons:

If we want to compare more datasets,
  • We individually have to check different charts for each dataset which is cumbersome.
  • We can overlap charts but its hard to distinguish between individual histograms.
  • We can draw grouped bar charts but we will lose the overall trend.
So we go for frequency polygons.


Here the trend is still intact and we can compare more datasets too by grouping. 



Again frequency polygons are easier to compare and will give us the real picture. 



Frequency polygons for continuous data:

We can similarly compare continuous data using frequency polygons.


Cumulative frequency polygons:

Suppose if I want to know in how many matches did Sachin scored less than 60 runs? from the frequency polygons, I have to add subsequent bars to find answers. Instead I can use Cumulative frequency polygons. 



Again if I want to know in percentages, then I can use cumulative Relative frequency polygon.

'
Again comparing becomes easier with Cum. Rel. Freq. Polygons.


Stem and Leaf plots:



  • These are similar to histograms but has more details in the bars. With the first digit of the value as the stem and remaining digits as leaves. 
  • As the number gets larger, we take two, three digits as stems. This is more like choosing the Bin size.
  • It is not useful if the dataset is large and it is efficient for small data sets as its easy to see the patterns.
  • It is good to compare two datasets side by side.
  • Can be done for continuous data by rounding the number.



Monday, February 1, 2021

Statistics

Statistics in Data science:

Following are the steps involved in data science,

Data collection - If we have to conduct experiments and collect data, then it needs statistics to conduct randomized control experiments.
Processing data - For standardization n normalization
Describe data - Summary statistics needs statistics
model data - Statistical modelling or algorithmic modelling

What is statistics?

Statistics is a process of collecting, describing and drawing inferences from the data. In statistics we are always interested in studying large collection of people or objects. But its expensive to study the whole population or we may not be able to study the whole population due to constraints. As a solution, we select a sample from the population and conduct the study on them.

Population: A population is the total collection of all the objects we are interested in studying.
Sample : Its a sub group of the population that we study to draw inferences about the population.

What are the inferences/parameters we are trying to draw?
Say,
Proportion of the citizen in favor of the candidate.
Average mileage of the cars that are manufactured in a company
variance in the yield of the farms in the state

When we do these study on the sample of the population, it is statistic.

Parameter: Parameter is any numeric property of the entire population
Statistic : Statistic is any numeric property of the sample of the entire population which is used as an estimate for the corresponding parameter of the population

How to select a sample?

A sample should be a representation of the population, only then estimate will be useful.
few sampling strategies,
Simple random sampling
Stratified sampling
Cluster sampling

How to design an experiment to collect data?

Suppose we want to study if consuming walnuts daily helps in decreasing the cholesterol. While studying effect of one variable(walnut) on another(cholesterol) we should ensure we nullify the effect of lurking variables(exercise, smoking).
So as a solution, we follow conduct randomized control experiments, where we take two samples. One will be given just the placebo and the other will be given the walnuts. Now this nullifies the lurking variables as both the samples have them.

How to describe and summarize data?

Usually data will be stored in tabular format. In this tabular format its difficult to answer the simple question. So we try to plot the data as a graph or draw the distribution of the data from which we can visually get the idea of answers we are trying to figure out. Apart from this, we also can summarize data in Mean, Median, mode, variance and standard deviation.
So in descriptive statistics, we will learn different plots and measures like 
(Relative) Frequency charts
(Relative) Frequency polygons
Histograms
Stem and leaf plots
Box plots
Scatter plots
Measures of centrality and spread

Why do we need probability theory?

A sampling strategy is said to be truly random (unbiased) only if every element in the population has equal chance in becoming a part of the sample. 

So what do we mean by 'Chance'?

The branch of mathematics that deal with chances and probabilities is called Probability theory.
If we observe a trend in a small sample, what is the chance that it will reflect in other samples or in the entire population. 

example, If I calculate the mean of a small sample, what is the chance that the mean of the entire population is close to this mean.

How do we give guarantees for the estimate made from the sample?

Suppose we have a population of 10 and sample size of two. In this case, we can choose two samples from 10 in 10*9 different ways. Now if we take 5 such sample combination each of two. Now find the mean of each samples. we will get 5 means now one for each sample. Lets find the distribution of these means we have found. Here we are 95% sure that the Mean of the entire population will fall in the interval of the distribution (Min and Max of the distribution).

To do this we will learn: 
Point estimates
Distributions of sampling statistics
Interval estimates

What is hypothesis and how do we test it? 

Hypothesis is some assumption we want to test if it is true. Example, Bumras mean bowling speed is more than 90mph. 

To do this, we will learn:
Hypothesis testing:
One population, two populations and Multiple population
Z - Tests, T - Tests and Analysis of variance (ANOVA)

How to model relationship between the variables?

Suppose we want to model relationship between the variables(decrease in cholesterol and number of days of treatment), we use the statistical modelling. i.e. Simple relationship like linear equation.

            y = m x + c , where parameters m and c will be estimated from the data sample.

As m and c are estimated from the data sample and not from the population, we have to deal with the uncertainty. i.e. Are we 99% sure that the estimated m and c are close to the real m and c of the entire population.

To do this, we will learn:
Linear regression 
Estimating parameters
Estimating confidence bands 
Measuring goodness of fit.

How well does the model fits the data:

Consider the below hypothesis.

In cricket, the five ways of getting dismissed is equally likely. The model would look as below.


Now I try to estimate the probabilities from the data sample. Say I take the last 100 matches played and model as below,


Now I am interested in "Are the variations observed in the sample significant or is it due to random chance?"

To do this we will learn,
Chi Square test
Determine the goodness of fit
Determine if two variables are independent

We will be covering the below topics in the next chapters.

 



Engineering Data Science systems

Systems thinking:

Engineering aspects of Data Science:

This is a way of thinking of systems as encompassed and global rather than focusing on one particular issue. Suppose if we have to do a sort. we could use the bubble sort algorithm but it doesn't take advantage of the caches. Rather we can use the quick sort which uses the cache to sort as thus makes it more efficient. Hence while building a system always see the larger picture rather than the particular issue.

In data science, systems thinking need Domain knowledge, Hacking skills and Math and Stats.

Roles involved in Data Science:

To have system perspective of data science, one needs business knowledge, programming, statistics and Communication. Data analyst knows the Business knowledge, statistics and communication. Research analyst knows Communication, Statistics and programming.

Processes involved in Data Science:

Engineering systems of data science involves two parts,
  • Process 
  • Programming.

So to become a data engineer, the person should know not just the programming, but also the process.

Process :

Process has two components.

1.  Flow of steps (What are the steps I take?)

2. Agile improvement. (How to improve each step in an agile way?)

One such process followed in data science is CRISP-DM.

CRISP-DM:

1. Business understanding. One needs to understand the context one is working on and should be able to specify the problems we are trying to solve.

2. Data understanding. What data do I have? will this solve the problem I have?

3. Data preparation.

4. Data modelling. How do I model? How do I analyze the hypothesis?

5. Evaluate the model.

6. Deploy the system.

All the steps in the process are iterative and repeated a lot of times till we achieve the expected output. Data science follows MVP where we build a simple complete system and build/add up on the existing system iteratively. This is agile improvement.

Programming tools:

No code environments like IBM Watson, Amazon Lex. Paid interfaces that anyone can use to analyze data.

Spreadsheets and BI tools like Microsoft Excel, Google sheets, power BI, Tableau.

Programming languages like Weka, MATLAB, Mathematica, Python and R.

High performance stacks like Hadoop and Spark.

Why Phython?

1. Python is beginner friendly

2. Python is increasingly the popular choice for data science

3. Python is good for production and planning

4. Availability of open source libraries which are used in Data science

5. Python is cool beyond the data science. eg as a script language, web applications. programming IOT devices.

Disadvantages

Python is a interpretable file. Can make it slower.