Sunday, December 26, 2021

The abundant unused love

Right now, there is so much love in my heart. So much it can out run all the prevailing hatred. So much it can light the whole world with joy. So much it can consume the whole world. 

It just flows and touches everyone around me, especially when I am high. It's magical how even the people I hate otherwise, look like a lost child who need some tender love. 

Do alcohol bring out the compassion that I try to hide in my depths, just so I can pretend to be brave.
If so, I wonder what will psychedelics do to me. 

Aren't we all born this way? Full of love, joy and compassion? Where, on the way, do we lose this ability? Where do we learn to hate the fellow beings? May be in the process of protecting ourselves, we turn blind eye to other's sufferings. Are we scared of being seen as a vulnerable person? But isn't being vulnerable a beautiful human experience that allows to make deep connections? Is it possible to let go of our self? Is unconditional love possible if we try to be selfless? Can we sustain being selfless even after knowing that we are being taken advantage of?  I don't know. All I know is I want to get there.

I want to be selfless, just so I can give unconditional love to everyone. May be a glimse of that will bring change in the suffering souls and together we can heal the world. A world full of love and compassion. It's a utopian thought but not an impossible one.

There is love in everyone's heart. But it's all mostly unused. This new year, let's try to be better by letting go of our ego a tiny little bit at a time. Let's forgive others for their mistakes. Let's be kind. Let's put the abundant unused love, to good use.

Merry Christmas and a happy new year!

Friday, November 26, 2021

The beauty of Impermanence.

Margazhi is a Tamil month. Its considered auspicious and temples will be busy throughout the month. We used to wake up to devotional songs from the speakers, piercing our ears, from a temple near our home at around 5 30 am. My cousins and I, would get up, brush our teeth and rush to the temple to get "sakkara pongal", a never to be missed, delicious food given as prasatham. My great grand parents typically started their day with prayers and would draw three lines with ash on our foreheads too, murmuring some blessings. Apart from that, I don't remember being religious or devotional when I was young. I grew up mostly agnostic of God. But my mother's illness made me feel helpless and so I prayed, wishing He would save her miraculously as often told by religions. Unfortunately, the miracle didn't happen and I lost the last bit of faith in God. 

The incident shook my reality and kind of threw me off from everything I believed. Not just in God and religion but others like rituals, festivals, culture, marriage, morality and values. After a period of grieving, I realized life doesn't have any intrinsic meaning and I might die anytime. Hence, I turned into a hedonist and an atheist, a staunch one at it. The only thing I believed was "Happiness and compassion". I did everything that I have only wished doing so far. Indeed it was refreshing, life felt exciting and I was happy. But it didn't last long. I was soon bored. Then started bouts of depression and anxiety, on and off. I had to distract constantly with one thing or the other, to keep life going. But it wasn't easy. At some point, there was nothing that would bring me happiness nor joy. I started questioning the purpose of my life and everything felt pointless. I was desperate but nothing changed for years. I felt something was missing in my life and content was a rare rare feeling,, even though I had every comfort. But I looked around me and I can say no one was happy with what they were doing, how they life was or their relationship was. There was suffering everywhere and only made me think why endure this. We dont stop there, we bring lives into the world, only to see them go through the same. But most of my friends carried on saying its how their fate is. Ahh. I wished I believed in fate or in God or had faith in something like astrology, u know just to feel a little secured. 

Google and reddit told me I was going through an existential crisis and so I thought spirituality might offer some insight. The religion I was born into always spoke about supernatural powers and it felt like bed time stories for adults and even other religions were not very different. To be honest, I have felt religions didn't do any good for the humanity, in fact it only made it worse by encouraging fanatics and fascists'. To be honest, I dismissed anything to do with religion or rituals. In the core of my heart, I know I cant walk that path back. 

So the next stop was philosophies. Most of them agreed life was miserable and meaningless and we cant do much about it. Accepting this somehow brought some relief to me. I turned into a Nihilist which I believe I was since long before I know the term. If this is how its going to be, then I thought I should learn to deal with it. I read about how mind works and took classes on human psychology. It got better, Forming habits and making routines helped a lot. I thought "Of course, life is pointless but lets at least try to be healthy and make it painless". 

If I have to be honest, my days still varied from "what's the point" to "hmmm, not bad". But never contented and happy. So my search still continued. Meditation was slapped onto our face by every self help page and books. Meditation is still sold as a way to motivate, achieve our goals and stay focused, which I was not very keen. But still meditated as it was advocated as an anti depressant. In between all this, I got in touch with my old friends who were into meditation and Buddhism, who helped me see the real deal of meditation. This was a important turning point.

I started reading about Buddhism. To my surprise, it was the closest religion to rationality one can ever find. There is no supernatural stories nor stories to terrorize you into following it nor to keep in check.

Western science is advanced, I agree but it is mostly anatomical. It doesn't have much to say about why we feel the way we feel or why some are "too good to be true" kinds while some are crazy enough to shoot at strangers. Science is way behind when it comes to psychology. While in east, there was not much anatomical studies (We have Siddha and Ayurveda. Unfortunately it didn't live up to the modern science due to constraints) but the focus on mind and spirituality had really flourished. 

Buddhism speaks about the nature of world (Dhamma) that all life is suffering and it is caused by desire and ignorance and it can be ended by following a certain path. He has detailed the problem of life, and what causes it and there is a solution to it and he has charted a pathway for it. He has spoken about envy, desire, loneliness, boredom, loss, attachment as the major cause of suffering. I was awed. 2500 years ago, he has spoken about how mind works in such detail, that science is still struggling to figure out.

I also found, the real purpose of meditation is not just to be more focused or less depressed, although we could still reap these benefits. Meditation is not just avoiding thoughts. Meditation is focusing on an object without getting distracted but not just that, though we start there. Its about being present in the moment but not just that. Its about being present in the moment and skillfully investigating the nature of our mind. Meditation is a tool to see a different perspective of reality, that every moment, every thing is constantly undergoing change. We all know this as a matter of fact, but to see it and experience it, we need to meditate and calm our mind to notice. Not just an hour or two. Not by starving or becoming a monk but by being mindful all the time, every waking moment. By understanding impermanence, we can respond rather than react to situations. I believe this could solve a lot of problems. 

I am still an amateur in meditation and spirituality but I do relate to a lot of things said in Bhuddism and I kind of feel this is the path for me. May be its just the other side looks green now and not sure what it will hold for me when I arrive. I do wonder if that is real and green, then is our reality which we are living right now, a dream? But one can only find out by walking the path. Honestly hoping the other side turns out to be green. 




Tuesday, November 23, 2021

Surprise! Surprise!!

Ladies and Gentlemen, looks like not everyone get depressed. Looks like not everyone will understand what it is like to be depressed. Ahh! Now I see why the conversation I had with my friend about anxiety and depression felt like different species shouting at each other in their own language.

I have always felt like life was pointless and there is no meaning to enduring all the suffering in the name of living. I have always been a Nihilist. No Gods, godmen or anything for that matter helped me change that view of mine. I have had my good vibrant days but those days look like eons ago and rare recently. The joy of living has drained from me. I have been assuming that all adults are naturally downers like me. To my surprise, I found out its not the case. Looks like most of the people are enjoying their lives. Of course, a little ups and downs here and there but overall they are driven by goals, ambitions, love and family. Even during hardships, these people are doing better than what I do on my best day. :P 

Oh, I envy these people. How nice it would be just to be joyful without any efforts? How fun it would be just to be positive and driven all the time? I want to meet one such happy soul and want to bask in their joy and happiness just to get a hang of it. To be honest, I have met such people and have only been judgmental of them. They appeared fake or superficial to me without any depth. May be I should be more open, welcoming and stop smirking at them so much ;).

So I am trying my best, checking out everyone on the self help list of "how to be happy and joyful in 30 days". Meditation? Walking? Nature? Travel? You name it, I have done it. Definitely seeing some change. But the effort I have to put unlike these naturals, make me kind of "what the". But, but, but, we cant give up on ourselves, can we? I hope I will reach that place someday. See you all naturals soon.

Monday, November 22, 2021

Who are we?

Who are we? We probably will answer with our name, along with our parent's name or the education we got or the work identity or the religion or the nationality. But really, who are we without these tags. Nothing?, Everything?  We always identify ourselves with the body we have, the relationships we have, the emotions we have, the likings and aversions we have. We are what we have. If so, aren't we all keep changing all the time? We get old, relationships change, our emotions change and our likings and aversions too change over time. If so, which us is the real us?

Irrespective of everything around us and including us are changing, the only thing we have with us all the time is our mind. Our mind is where we live and where everything and anyone is given any meaning. So the better answer to who we are would be " Our own mind". Is it not?

Sunday, November 21, 2021

Earliest memory

Hey people. I wish to know what is the earliest memory you have of yourself. I mean memory of your youngest self. Pls comment.

Monday, November 15, 2021

Note to self

To stay away from depression:

Do's:

Get exercise and get sunlight
Plan and make goals
Clean your room
Learn a new skill
Read
Meditate
Journal 
list of things we are grateful for.

Don'ts:

No more than 10 mins of fb
No other social media
No phone scrolling
No more than one movie on weekends

To get the happy hormones:

Exercise for dopamine.
achieve small goals, appreciate others for dopamine.
Meet/make friends, Help and be kind to family and friends for serotonin.
Hug your family and friends for oxytocin.
Meditate for all happy hormone and tranquility.

Tuesday, November 9, 2021

Road not taken

 Am I living in a dreamland? While the whole world is striving to make money and achieve goals, I feel I am just day dreaming and living a life far from reality. Even though I don't want to participate in the rat race, the question "am I doing enough? " lingers on my mind. If survival of the fittest is true, ain't I doing just the opposite? Not fighting enough. Not trying my best. With this attitude I am afraid what ideas I will pass on to my daughter. 

Yet I console myself saying I have taken the road less travelled. With lethargy, I am not sure if I will make it to the destination. Even if I make it, will I like what I will find there? What if I am proven wrong? What if the society was right all along? I can only find out when I arrive. I wish I can keep up the drive. I wish I don't spiral down every other week. I wish someone tells me I am not just day dreaming. I wish I could meet someone who has all the answers. 

All I have now is me and I have to keep trying. I hope I will have all the answers one day and be the hope someone is looking for.

Wednesday, October 27, 2021

What is reality?

 I was on this trip. I saw a monkey feeding its infant. There was another monkey around who kissed the infant and threw its arms around. After a while, they dispersed and were their things like climbing trees and looking for food and just playing with random things and hanging around and chilling with other monkeys. Made me wonder, what we humans would be doing now, had we not developed the cognitive skills? May be we will still be doing the same. A realization hit me. How everything that’s been taught to us from the young age, everything we believe, everything we do and want to do are all social constructs. Right from language, money, time, society, government, god, what is right and what is not, what is respected and what is cringey? Every single thing was an idea born in some brilliant mind either to solve a problem or for survival. Its astonishing to think how we (at least the majority) don’t question any of these and believed almost everything that were told to us. I know we are changing and question a few of these constructs now. But what change will it bring to the human mind when it sees none of these are real? Not just one, what if a large population stops believing in money, saving for the future, the respect that money brings? How differently will they live after? Will we live like monks? Isnt that again a social construct?

I wonder if constructs gives us some kind of blueprint for the otherwise senseless lives. If not for these constructs, will our lives turn chaotic and we may go back being barbaric? How will we build or project our images? 

Thinking about this, makes me wonder if this is all man made, what is the reality? 

Wednesday, February 3, 2021

Descriptive statistics

 Types of data:

There are two types of data.

  •  Qualitative data
  •  Quantitative data

Qualitative data(categorical data):

                    Qualitative or categorial attributes are which describe the object under consideration using a finite set of discrete classes. This can be divided further into 

  • Nominal data: Nominal attributes are those qualitative attributes which there is no any natural order that the attributes can take.
  • Ordinal data:  Ordinal attributes are those qualitative attributes which there is a natural order that the attributes can take.

Quantitative data:

                    Quantitative attributes are those which have numeric values and which are used to count or measure certain properties of the population. This can be divided further into
  • Discrete data: Discrete attributes are those quantitative attributes which can take on only finite number of numerical values(Integers).
  • Continuous data: Continuous attributes are those quantitative attributes which can take on fractional values(Real numbers).

Why bother about data types: Because the type of statistical analysis depends on the type of the variable. 

Suppose if we ask below questions for a qualitative attribute

What is the average color of all the shirts in my catalogue? - Doesn't make sense
What is the average nationality of all the students in the class? - Doesn't make sense
What is the frequency of color red in my catalogue? - Right question
Regression analysis(between two nums) - Doesnt make sense for qualitative attributes
Analysis of variance(ANOVA) - Right tool for qualitative attributes
Chi-square testRight tool for qualitative attributes

We can ask the below questions for quantitative discrete attributes,
What is the average value in the dataset?
What is the spread of the data?
What is the frequency of the given value?
Regression analysis.

We can ask the below questions for quantitative continuous attributes,
What is the average value in the dataset?
What is the spread of the data?
Regression analysis.

But,
What is the frequency of the given value? - This wont make sense as continuous attributes will have fractional values and may not repeat often.

Numbers:

Whole numbers: 0, 1, 2 ...(No fractions, No negatives)
Integers: .... - 3,-2,-1, 0 , 1, 2 ,3....(No fractions)
Rational numbers: 1/2, 1/3,5/2, (Ratio of two integers)
Irrational numbers: Cannot be expressed as ratio of two integers (pi, sq.root of 2)
Real numbers : Rational + Irrational


Whole num is a subset of Integers is a subset of Rational num is a subset of Real numbers

Irrational num is a subset of Real numbers

How to describe Qualitative data?

Generally, the values of the Categorical/Qualitative data keeps repeating in the data. 
For example,
  • How many red color shirts in the catalogue? 
  • How many times does LWB appear? 
  • How many kharif crops are there in the data?
This led to the term Frequency.

Primary question here is ,

What is the frequency of different categories?

The count of total number of the times the value appears in the data is called frequency. Frequency can be described by frequency table where there are two columns, one the values and the other is the frequency of appearance as shown below.

Number of centuries scored by Sachin against country.

Frequency plots:

The better and efficient way to represent frequency is using frequency plot as shown below,

Frequency plot of no. of centuries scored by Sachin against each country.

Here, the horizontal axis is the categorial attribute and its frequency is mapped along the Y axis. The height of the bar is proportionate to the the count. 

Suppose if I plot the frequency plot of crops grown in India and I want to know the 7th most grown in the country, it gets difficult as the chart is not sorted. So Sort the values by their counts (along the Y axis) for better visualization. 

Sorted along the frequency gives a better visualisation.

Frequency plots- Long tailed distribution:

As shown in the above graph, a long tailed distribution has,
  • A large number of tall bars at the beginning.
  • A large number of short bars at the end.
  • Very common in real world scenarios.

Frequency plots - Uniform distribution:


Frequency plot of rolling a die. As each number has equal chances of getting rolled, all the bars will be  of almost equal height. This is uniform distribution.

Relative frequency tables:

What percentage of farms grow groundnut? To calculate the percent I need to know the total number of farms which is difficult to figure from the frequency plot. Hence we use relative frequency tables as it is easier to interpret than absolute frequencies.



Relative frequency plots are easy to answer the percentage questions in visual form.

Grouped frequency plots:

Till now, we had one data set and were plotting the frequencies for that set. Suppose if I want to compare two data sets. If I have farming data for two or three consecutive years and I want to find if the farming pattern has changed over the years, I go for grouped frequency plots as shown below.


Grouped relative frequency charts:

From the above groped frequency bar charts, it might look like the rice farms has decreased over the years. But we are not sure as the total number of farms could have decreased. So its better to use relative grouped relative frequency charts.

So from the above chart, we can understand the rice farms have in fact increased.


How to describe Qualitative data?

Does the frequency make sense to Qualitative data? 
Lets consider sachin's ODI matches. The number of runs scored in each match is a discrete qualitative attribute.

Suppose we want to know the following questions,
  • What would be the histogram of sachins score would look like?
  • Where would the tallest bar be?
  • Would there be some regions along the x- axis which would have a bar height of 0?
Lets plot the histogram of sachin.


Surprisingly the tall bars are at the beginning indicating he has scored more low scores. There are blank regions along the x-axis.

Issues
  • Too many unique values along the x- axis.
  • How many times was he dismissed in his 90s or single digits? Its difficult to answer these questions. And we are not keen about how many times he scored 40?
Solution:  we will group the values into bins. 0-9,10-19,20-29...



We can say that he has scored in 90s around 18 times from the above chart.

How to choose the right bin size?

If we choose bin size of 5 , will it be too many bins? 
If we choose bin size of 20,  it will lose the granularity.
So both the extremes are bad. Choose something that will have sufficient details yet not too many bins.
So bin size depends up on the range of the data. If the range is large, too small a bin size will not be helpful. And if the range is small, a too big bin size will hide all the important details.

So ideal bin size is one which reveals meaningful patterns (neither hides nor reveals too many details).

Suppose we want to plot the strike rate of sachin which is a continuous quantitative data. Plotting for every unique value will be cumbersome to read. So here too we use bins of correct size to get interesting information and patterns.

What about class boundaries?
Bin size of 0 - 2 will have all values in between 0 to 2. how about 2?

Left end inclusion convention: A class interval that includes it left end boundary but not its right end boundary.

Summary:

Relative frequency histograms:

Suppose we want to know the percentage of matches Sachin scored less than 20 run? It is difficult to answer from the histograms. So as in Qualitative data, we will use relative frequency histograms. 



Where we calculate the relative frequency table and then plot the relative frequency histogram.

Relative frequency histograms are also aids in comparing two charts as its in percentage and will give the real picture. Suppose I want to compare Sachin and Virat Kohli. What percent of matches did Sachin and Virat scored less than 10%?





Frequency Polygons:

If we want to compare more datasets,
  • We individually have to check different charts for each dataset which is cumbersome.
  • We can overlap charts but its hard to distinguish between individual histograms.
  • We can draw grouped bar charts but we will lose the overall trend.
So we go for frequency polygons.


Here the trend is still intact and we can compare more datasets too by grouping. 



Again frequency polygons are easier to compare and will give us the real picture. 



Frequency polygons for continuous data:

We can similarly compare continuous data using frequency polygons.


Cumulative frequency polygons:

Suppose if I want to know in how many matches did Sachin scored less than 60 runs? from the frequency polygons, I have to add subsequent bars to find answers. Instead I can use Cumulative frequency polygons. 



Again if I want to know in percentages, then I can use cumulative Relative frequency polygon.

'
Again comparing becomes easier with Cum. Rel. Freq. Polygons.


Stem and Leaf plots:



  • These are similar to histograms but has more details in the bars. With the first digit of the value as the stem and remaining digits as leaves. 
  • As the number gets larger, we take two, three digits as stems. This is more like choosing the Bin size.
  • It is not useful if the dataset is large and it is efficient for small data sets as its easy to see the patterns.
  • It is good to compare two datasets side by side.
  • Can be done for continuous data by rounding the number.



Monday, February 1, 2021

Statistics

Statistics in Data science:

Following are the steps involved in data science,

Data collection - If we have to conduct experiments and collect data, then it needs statistics to conduct randomized control experiments.
Processing data - For standardization n normalization
Describe data - Summary statistics needs statistics
model data - Statistical modelling or algorithmic modelling

What is statistics?

Statistics is a process of collecting, describing and drawing inferences from the data. In statistics we are always interested in studying large collection of people or objects. But its expensive to study the whole population or we may not be able to study the whole population due to constraints. As a solution, we select a sample from the population and conduct the study on them.

Population: A population is the total collection of all the objects we are interested in studying.
Sample : Its a sub group of the population that we study to draw inferences about the population.

What are the inferences/parameters we are trying to draw?
Say,
Proportion of the citizen in favor of the candidate.
Average mileage of the cars that are manufactured in a company
variance in the yield of the farms in the state

When we do these study on the sample of the population, it is statistic.

Parameter: Parameter is any numeric property of the entire population
Statistic : Statistic is any numeric property of the sample of the entire population which is used as an estimate for the corresponding parameter of the population

How to select a sample?

A sample should be a representation of the population, only then estimate will be useful.
few sampling strategies,
Simple random sampling
Stratified sampling
Cluster sampling

How to design an experiment to collect data?

Suppose we want to study if consuming walnuts daily helps in decreasing the cholesterol. While studying effect of one variable(walnut) on another(cholesterol) we should ensure we nullify the effect of lurking variables(exercise, smoking).
So as a solution, we follow conduct randomized control experiments, where we take two samples. One will be given just the placebo and the other will be given the walnuts. Now this nullifies the lurking variables as both the samples have them.

How to describe and summarize data?

Usually data will be stored in tabular format. In this tabular format its difficult to answer the simple question. So we try to plot the data as a graph or draw the distribution of the data from which we can visually get the idea of answers we are trying to figure out. Apart from this, we also can summarize data in Mean, Median, mode, variance and standard deviation.
So in descriptive statistics, we will learn different plots and measures like 
(Relative) Frequency charts
(Relative) Frequency polygons
Histograms
Stem and leaf plots
Box plots
Scatter plots
Measures of centrality and spread

Why do we need probability theory?

A sampling strategy is said to be truly random (unbiased) only if every element in the population has equal chance in becoming a part of the sample. 

So what do we mean by 'Chance'?

The branch of mathematics that deal with chances and probabilities is called Probability theory.
If we observe a trend in a small sample, what is the chance that it will reflect in other samples or in the entire population. 

example, If I calculate the mean of a small sample, what is the chance that the mean of the entire population is close to this mean.

How do we give guarantees for the estimate made from the sample?

Suppose we have a population of 10 and sample size of two. In this case, we can choose two samples from 10 in 10*9 different ways. Now if we take 5 such sample combination each of two. Now find the mean of each samples. we will get 5 means now one for each sample. Lets find the distribution of these means we have found. Here we are 95% sure that the Mean of the entire population will fall in the interval of the distribution (Min and Max of the distribution).

To do this we will learn: 
Point estimates
Distributions of sampling statistics
Interval estimates

What is hypothesis and how do we test it? 

Hypothesis is some assumption we want to test if it is true. Example, Bumras mean bowling speed is more than 90mph. 

To do this, we will learn:
Hypothesis testing:
One population, two populations and Multiple population
Z - Tests, T - Tests and Analysis of variance (ANOVA)

How to model relationship between the variables?

Suppose we want to model relationship between the variables(decrease in cholesterol and number of days of treatment), we use the statistical modelling. i.e. Simple relationship like linear equation.

            y = m x + c , where parameters m and c will be estimated from the data sample.

As m and c are estimated from the data sample and not from the population, we have to deal with the uncertainty. i.e. Are we 99% sure that the estimated m and c are close to the real m and c of the entire population.

To do this, we will learn:
Linear regression 
Estimating parameters
Estimating confidence bands 
Measuring goodness of fit.

How well does the model fits the data:

Consider the below hypothesis.

In cricket, the five ways of getting dismissed is equally likely. The model would look as below.


Now I try to estimate the probabilities from the data sample. Say I take the last 100 matches played and model as below,


Now I am interested in "Are the variations observed in the sample significant or is it due to random chance?"

To do this we will learn,
Chi Square test
Determine the goodness of fit
Determine if two variables are independent

We will be covering the below topics in the next chapters.

 



Engineering Data Science systems

Systems thinking:

Engineering aspects of Data Science:

This is a way of thinking of systems as encompassed and global rather than focusing on one particular issue. Suppose if we have to do a sort. we could use the bubble sort algorithm but it doesn't take advantage of the caches. Rather we can use the quick sort which uses the cache to sort as thus makes it more efficient. Hence while building a system always see the larger picture rather than the particular issue.

In data science, systems thinking need Domain knowledge, Hacking skills and Math and Stats.

Roles involved in Data Science:

To have system perspective of data science, one needs business knowledge, programming, statistics and Communication. Data analyst knows the Business knowledge, statistics and communication. Research analyst knows Communication, Statistics and programming.

Processes involved in Data Science:

Engineering systems of data science involves two parts,
  • Process 
  • Programming.

So to become a data engineer, the person should know not just the programming, but also the process.

Process :

Process has two components.

1.  Flow of steps (What are the steps I take?)

2. Agile improvement. (How to improve each step in an agile way?)

One such process followed in data science is CRISP-DM.

CRISP-DM:

1. Business understanding. One needs to understand the context one is working on and should be able to specify the problems we are trying to solve.

2. Data understanding. What data do I have? will this solve the problem I have?

3. Data preparation.

4. Data modelling. How do I model? How do I analyze the hypothesis?

5. Evaluate the model.

6. Deploy the system.

All the steps in the process are iterative and repeated a lot of times till we achieve the expected output. Data science follows MVP where we build a simple complete system and build/add up on the existing system iteratively. This is agile improvement.

Programming tools:

No code environments like IBM Watson, Amazon Lex. Paid interfaces that anyone can use to analyze data.

Spreadsheets and BI tools like Microsoft Excel, Google sheets, power BI, Tableau.

Programming languages like Weka, MATLAB, Mathematica, Python and R.

High performance stacks like Hadoop and Spark.

Why Phython?

1. Python is beginner friendly

2. Python is increasingly the popular choice for data science

3. Python is good for production and planning

4. Availability of open source libraries which are used in Data science

5. Python is cool beyond the data science. eg as a script language, web applications. programming IOT devices.

Disadvantages

Python is a interpretable file. Can make it slower.



Sunday, January 31, 2021

Introduction to Data Science


What is Data Science?

As Data science is an assortment of several tasks and attention of tasks depend on the application, definition varies largely. The data science tasks include,

  • Collection of data
  • Storage of data
  • Processing of data
  • Description of data
  • Modeling of data
So data science is the science of collecting, storing, processing, describing and modeling data.

Collection of data:

It involves  
  • the question on which the data scientist is trying to answer. 
  • the environment the data scientist is working.
To understand, lets consider three scenarios.

1) Suppose amazon, an e - commerce site wants to know what products do the customers buy together. Here amazon already has the data. The data scientist just have to use some SQL queries to collect the data and use some coding (python, R) to process the data further.

2) Suppose a political party wants to know the people's opinion on the new policy it has rolled out. Here the data is not readily available but exist in social media forums as posts and tweets unlike the relational database. The scientist has to collect the data by crawling and scraping ( thro programming) and need to know to work with APIs and hacking skills to access the webpages.

3) Suppose we want to find the effect of type of seed, irrigation and fertilizer on the yield. Here the data scientist needs to design experiments to collect data as three different factors are involved in the experiment. He will need intermediate level programming language, knowledge in statistics and database. 

Storing Data:


In 1970s, at organizations, the operational and transactional data (employee, customer details and invoices) were mostly structed data and so stored in tables (which led to development in relational database) as it was easy for adding, updating and deleting. 

Later in 1990s, as the organizations grew in size and had sub departments, they had to maintain different details of the same customer.  Example, banks has different sections like loan dept, cards dept and investment department where the details of the same customer in different relational databases.  This led to the development of data warehouse (integrated repository and support analytics) which is optimized for analytics. 

Later in 2000s, in the internet era, where lot of un-structed data were generated in social media and you tube and Instagram and SoundCloud.  The data generated is huge and the velocity at which the date is generated is also very high. Because of the variety, volume and velocity of the data generated, this is called big data. This led to the invention of data lakes. Data lakes is a collection of structured and un-structed data where all the data is just stored without curation or cleaning. 

Relational Database

Data warehouse

Data lakes

Structed data

Structured data

Un-structured data

Optimised for SQL queries

Curated data

Un-curated data

 

Optimized for analytics

 


Skills required for storing data:

Programming and Engineering
Knowledge of Relational database
Knowledge of NoSQL
Knowledge of data warehouse
Knowledge of data lakes(Hadoop)

Processing Data:


Following are the steps involved in data processing

1) Data wrangling/Munging:


Wrangling involves extracting, transforming and loading data from an external source to the organizational database. Suppose the book store wants to store the data of the courier service details which is outsourced. The formats which the courier service uses may be different from the book store database. Say the courier service stores its data in Json files (which is a semi structured file) and the book store, stores it in relational database. Hence the data has to be extracted from json file and transformed into the format of the organizational database and loaded into it. This is called Wrangling or Munging.

2) Data Cleaning:

    • Fill the missing values(by the mean of the column)
    • Standardize keywords. Suppose someone entered 'half' instead of 'half sleeve'. Replace with the correct word. 
    • Correcting the spelling errors.
    • Identify and remove outliers.

3) Data Scaling, Normalizing and Standardizing: 

Scaling:

Kms to Meters
Rupees to Dollars

Normalizing:

Converting the data to zero mean data

Standardizing:

Standardize values between 0 to 1. Uniformize the details.

If data processing is to be performed on Big data (on millions of rows of data), then performance becomes a key consideration. So we distribute the data processing among different processors and aggregate the results. This is called as distributed processing. Hadoop (map reduce) allows to deal with the large amount of data efficiently.


Skills required:
Programming skills
Hadoop
Sql and NoSql databases
Basic statistics

Describing data:

1) Visualizing:

Representing data in graphical form as it will be easy to draw insights from it
Bar chart, scatter plots, grouped bar charts are example of graphs

2)Summarizing data:

Mean, median, mode and variance. These numbers are important as it gives an idea of typical data.

Skills required:
Statistics
Python
Tableu
Excel

Modelling:

Statistical modelling:

    • Modelling underlying data distribution.
    • Modelling underlying relations in data (between various variables).
    • Formulate and test Hypotheses.
    • Give statistical guarantees.

Algorithmic Modelling(Machine Learning):

    • Formulate f using data and optimizing techniques
    • For the new patient plug in the value of x to get the value of y.
    • Focus on prediction(dont care about underlying phenomena)

Statistical Modelling

Algorithmic modelling

Simple, Intuitive models

Complex, flexible models

More suited for low dimensional data

Can work with high dimensional data

Robust statistical analysis is possible

Not suitable for robust statistical analysis

Focus on interpretability

Focus on prediction

Data lean models

Data hungry models

More of statistics

More of ML, DL

Linear regression,

Logistic regression,

Linear discriminant analysis

 

Linear regression,

Logistic regression,

Linear discriminant analysis

Decision Trees, K-NNs, SVMs, Naïve bayes, Multi layered neural networks.



When you have a large amount of high dimensional data and you want to learn very complex relationships between output and input using a specific class of complex ML models and algorithms collectively referred to as Deep Learning.

Skills required:
Inferential statistics
Probability theory
Calculas
Optimization algorithms
ML and DL
Python packages and frameworks(numpy, scipy, scikit learn, TF, pytorch, keras)


Artificial Intelligence:

problem solving:

No data and no modelling. Here we build a system to solve a problem with simple rules. Like a maze game.

Knowledge representation and Reasoning: 

 No data and no modelling. Here we build a system to solve a problem with complex rules based on the knowledge and reasoning. 

Decision making: 

Expert systems are systems created based on rules giving by field experts and rules encoded using knowledge representation. The execution of rules and reasoning done by a program. The drawback in expert systems is it doesn't work if the rules are complex or the rules are inexpressible or the rules may be unknown. So the alternative approach is learning from large amount of data. i.e. Machine Learning. Here we intersect with Data science.  When you have a large amount of high dimensional data and you want to learn very complex relationships between output and input using a specific class of complex ML models and algorithms collectively referred to as Deep Learning. Here again we intersect with Data science. Reinforcement Learning where the environment is dynamic, Information is partial, decision making is sequential, No explicit supervision at each step, One- Off rewards from the environment (chess match), the decision is made by learning from large data. Hence we intersect with Data Science.

So this data driven part of AI intersects with Data science. 

Descriptive statistics describes data (for example, a chart or graph).

Inferential statistics allows you to make predictions (“inferences”) from that data.

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.