Wednesday, February 3, 2021

Descriptive statistics

 Types of data:

There are two types of data.

  •  Qualitative data
  •  Quantitative data

Qualitative data(categorical data):

                    Qualitative or categorial attributes are which describe the object under consideration using a finite set of discrete classes. This can be divided further into 

  • Nominal data: Nominal attributes are those qualitative attributes which there is no any natural order that the attributes can take.
  • Ordinal data:  Ordinal attributes are those qualitative attributes which there is a natural order that the attributes can take.

Quantitative data:

                    Quantitative attributes are those which have numeric values and which are used to count or measure certain properties of the population. This can be divided further into
  • Discrete data: Discrete attributes are those quantitative attributes which can take on only finite number of numerical values(Integers).
  • Continuous data: Continuous attributes are those quantitative attributes which can take on fractional values(Real numbers).

Why bother about data types: Because the type of statistical analysis depends on the type of the variable. 

Suppose if we ask below questions for a qualitative attribute

What is the average color of all the shirts in my catalogue? - Doesn't make sense
What is the average nationality of all the students in the class? - Doesn't make sense
What is the frequency of color red in my catalogue? - Right question
Regression analysis(between two nums) - Doesnt make sense for qualitative attributes
Analysis of variance(ANOVA) - Right tool for qualitative attributes
Chi-square testRight tool for qualitative attributes

We can ask the below questions for quantitative discrete attributes,
What is the average value in the dataset?
What is the spread of the data?
What is the frequency of the given value?
Regression analysis.

We can ask the below questions for quantitative continuous attributes,
What is the average value in the dataset?
What is the spread of the data?
Regression analysis.

But,
What is the frequency of the given value? - This wont make sense as continuous attributes will have fractional values and may not repeat often.

Numbers:

Whole numbers: 0, 1, 2 ...(No fractions, No negatives)
Integers: .... - 3,-2,-1, 0 , 1, 2 ,3....(No fractions)
Rational numbers: 1/2, 1/3,5/2, (Ratio of two integers)
Irrational numbers: Cannot be expressed as ratio of two integers (pi, sq.root of 2)
Real numbers : Rational + Irrational


Whole num is a subset of Integers is a subset of Rational num is a subset of Real numbers

Irrational num is a subset of Real numbers

How to describe Qualitative data?

Generally, the values of the Categorical/Qualitative data keeps repeating in the data. 
For example,
  • How many red color shirts in the catalogue? 
  • How many times does LWB appear? 
  • How many kharif crops are there in the data?
This led to the term Frequency.

Primary question here is ,

What is the frequency of different categories?

The count of total number of the times the value appears in the data is called frequency. Frequency can be described by frequency table where there are two columns, one the values and the other is the frequency of appearance as shown below.

Number of centuries scored by Sachin against country.

Frequency plots:

The better and efficient way to represent frequency is using frequency plot as shown below,

Frequency plot of no. of centuries scored by Sachin against each country.

Here, the horizontal axis is the categorial attribute and its frequency is mapped along the Y axis. The height of the bar is proportionate to the the count. 

Suppose if I plot the frequency plot of crops grown in India and I want to know the 7th most grown in the country, it gets difficult as the chart is not sorted. So Sort the values by their counts (along the Y axis) for better visualization. 

Sorted along the frequency gives a better visualisation.

Frequency plots- Long tailed distribution:

As shown in the above graph, a long tailed distribution has,
  • A large number of tall bars at the beginning.
  • A large number of short bars at the end.
  • Very common in real world scenarios.

Frequency plots - Uniform distribution:


Frequency plot of rolling a die. As each number has equal chances of getting rolled, all the bars will be  of almost equal height. This is uniform distribution.

Relative frequency tables:

What percentage of farms grow groundnut? To calculate the percent I need to know the total number of farms which is difficult to figure from the frequency plot. Hence we use relative frequency tables as it is easier to interpret than absolute frequencies.



Relative frequency plots are easy to answer the percentage questions in visual form.

Grouped frequency plots:

Till now, we had one data set and were plotting the frequencies for that set. Suppose if I want to compare two data sets. If I have farming data for two or three consecutive years and I want to find if the farming pattern has changed over the years, I go for grouped frequency plots as shown below.


Grouped relative frequency charts:

From the above groped frequency bar charts, it might look like the rice farms has decreased over the years. But we are not sure as the total number of farms could have decreased. So its better to use relative grouped relative frequency charts.

So from the above chart, we can understand the rice farms have in fact increased.


How to describe Qualitative data?

Does the frequency make sense to Qualitative data? 
Lets consider sachin's ODI matches. The number of runs scored in each match is a discrete qualitative attribute.

Suppose we want to know the following questions,
  • What would be the histogram of sachins score would look like?
  • Where would the tallest bar be?
  • Would there be some regions along the x- axis which would have a bar height of 0?
Lets plot the histogram of sachin.


Surprisingly the tall bars are at the beginning indicating he has scored more low scores. There are blank regions along the x-axis.

Issues
  • Too many unique values along the x- axis.
  • How many times was he dismissed in his 90s or single digits? Its difficult to answer these questions. And we are not keen about how many times he scored 40?
Solution:  we will group the values into bins. 0-9,10-19,20-29...



We can say that he has scored in 90s around 18 times from the above chart.

How to choose the right bin size?

If we choose bin size of 5 , will it be too many bins? 
If we choose bin size of 20,  it will lose the granularity.
So both the extremes are bad. Choose something that will have sufficient details yet not too many bins.
So bin size depends up on the range of the data. If the range is large, too small a bin size will not be helpful. And if the range is small, a too big bin size will hide all the important details.

So ideal bin size is one which reveals meaningful patterns (neither hides nor reveals too many details).

Suppose we want to plot the strike rate of sachin which is a continuous quantitative data. Plotting for every unique value will be cumbersome to read. So here too we use bins of correct size to get interesting information and patterns.

What about class boundaries?
Bin size of 0 - 2 will have all values in between 0 to 2. how about 2?

Left end inclusion convention: A class interval that includes it left end boundary but not its right end boundary.

Summary:

Relative frequency histograms:

Suppose we want to know the percentage of matches Sachin scored less than 20 run? It is difficult to answer from the histograms. So as in Qualitative data, we will use relative frequency histograms. 



Where we calculate the relative frequency table and then plot the relative frequency histogram.

Relative frequency histograms are also aids in comparing two charts as its in percentage and will give the real picture. Suppose I want to compare Sachin and Virat Kohli. What percent of matches did Sachin and Virat scored less than 10%?





Frequency Polygons:

If we want to compare more datasets,
  • We individually have to check different charts for each dataset which is cumbersome.
  • We can overlap charts but its hard to distinguish between individual histograms.
  • We can draw grouped bar charts but we will lose the overall trend.
So we go for frequency polygons.


Here the trend is still intact and we can compare more datasets too by grouping. 



Again frequency polygons are easier to compare and will give us the real picture. 



Frequency polygons for continuous data:

We can similarly compare continuous data using frequency polygons.


Cumulative frequency polygons:

Suppose if I want to know in how many matches did Sachin scored less than 60 runs? from the frequency polygons, I have to add subsequent bars to find answers. Instead I can use Cumulative frequency polygons. 



Again if I want to know in percentages, then I can use cumulative Relative frequency polygon.

'
Again comparing becomes easier with Cum. Rel. Freq. Polygons.


Stem and Leaf plots:



  • These are similar to histograms but has more details in the bars. With the first digit of the value as the stem and remaining digits as leaves. 
  • As the number gets larger, we take two, three digits as stems. This is more like choosing the Bin size.
  • It is not useful if the dataset is large and it is efficient for small data sets as its easy to see the patterns.
  • It is good to compare two datasets side by side.
  • Can be done for continuous data by rounding the number.