Sunday, January 31, 2021

Introduction to Data Science


What is Data Science?

As Data science is an assortment of several tasks and attention of tasks depend on the application, definition varies largely. The data science tasks include,

  • Collection of data
  • Storage of data
  • Processing of data
  • Description of data
  • Modeling of data
So data science is the science of collecting, storing, processing, describing and modeling data.

Collection of data:

It involves  
  • the question on which the data scientist is trying to answer. 
  • the environment the data scientist is working.
To understand, lets consider three scenarios.

1) Suppose amazon, an e - commerce site wants to know what products do the customers buy together. Here amazon already has the data. The data scientist just have to use some SQL queries to collect the data and use some coding (python, R) to process the data further.

2) Suppose a political party wants to know the people's opinion on the new policy it has rolled out. Here the data is not readily available but exist in social media forums as posts and tweets unlike the relational database. The scientist has to collect the data by crawling and scraping ( thro programming) and need to know to work with APIs and hacking skills to access the webpages.

3) Suppose we want to find the effect of type of seed, irrigation and fertilizer on the yield. Here the data scientist needs to design experiments to collect data as three different factors are involved in the experiment. He will need intermediate level programming language, knowledge in statistics and database. 

Storing Data:


In 1970s, at organizations, the operational and transactional data (employee, customer details and invoices) were mostly structed data and so stored in tables (which led to development in relational database) as it was easy for adding, updating and deleting. 

Later in 1990s, as the organizations grew in size and had sub departments, they had to maintain different details of the same customer.  Example, banks has different sections like loan dept, cards dept and investment department where the details of the same customer in different relational databases.  This led to the development of data warehouse (integrated repository and support analytics) which is optimized for analytics. 

Later in 2000s, in the internet era, where lot of un-structed data were generated in social media and you tube and Instagram and SoundCloud.  The data generated is huge and the velocity at which the date is generated is also very high. Because of the variety, volume and velocity of the data generated, this is called big data. This led to the invention of data lakes. Data lakes is a collection of structured and un-structed data where all the data is just stored without curation or cleaning. 

Relational Database

Data warehouse

Data lakes

Structed data

Structured data

Un-structured data

Optimised for SQL queries

Curated data

Un-curated data

 

Optimized for analytics

 


Skills required for storing data:

Programming and Engineering
Knowledge of Relational database
Knowledge of NoSQL
Knowledge of data warehouse
Knowledge of data lakes(Hadoop)

Processing Data:


Following are the steps involved in data processing

1) Data wrangling/Munging:


Wrangling involves extracting, transforming and loading data from an external source to the organizational database. Suppose the book store wants to store the data of the courier service details which is outsourced. The formats which the courier service uses may be different from the book store database. Say the courier service stores its data in Json files (which is a semi structured file) and the book store, stores it in relational database. Hence the data has to be extracted from json file and transformed into the format of the organizational database and loaded into it. This is called Wrangling or Munging.

2) Data Cleaning:

    • Fill the missing values(by the mean of the column)
    • Standardize keywords. Suppose someone entered 'half' instead of 'half sleeve'. Replace with the correct word. 
    • Correcting the spelling errors.
    • Identify and remove outliers.

3) Data Scaling, Normalizing and Standardizing: 

Scaling:

Kms to Meters
Rupees to Dollars

Normalizing:

Converting the data to zero mean data

Standardizing:

Standardize values between 0 to 1. Uniformize the details.

If data processing is to be performed on Big data (on millions of rows of data), then performance becomes a key consideration. So we distribute the data processing among different processors and aggregate the results. This is called as distributed processing. Hadoop (map reduce) allows to deal with the large amount of data efficiently.


Skills required:
Programming skills
Hadoop
Sql and NoSql databases
Basic statistics

Describing data:

1) Visualizing:

Representing data in graphical form as it will be easy to draw insights from it
Bar chart, scatter plots, grouped bar charts are example of graphs

2)Summarizing data:

Mean, median, mode and variance. These numbers are important as it gives an idea of typical data.

Skills required:
Statistics
Python
Tableu
Excel

Modelling:

Statistical modelling:

    • Modelling underlying data distribution.
    • Modelling underlying relations in data (between various variables).
    • Formulate and test Hypotheses.
    • Give statistical guarantees.

Algorithmic Modelling(Machine Learning):

    • Formulate f using data and optimizing techniques
    • For the new patient plug in the value of x to get the value of y.
    • Focus on prediction(dont care about underlying phenomena)

Statistical Modelling

Algorithmic modelling

Simple, Intuitive models

Complex, flexible models

More suited for low dimensional data

Can work with high dimensional data

Robust statistical analysis is possible

Not suitable for robust statistical analysis

Focus on interpretability

Focus on prediction

Data lean models

Data hungry models

More of statistics

More of ML, DL

Linear regression,

Logistic regression,

Linear discriminant analysis

 

Linear regression,

Logistic regression,

Linear discriminant analysis

Decision Trees, K-NNs, SVMs, Naïve bayes, Multi layered neural networks.



When you have a large amount of high dimensional data and you want to learn very complex relationships between output and input using a specific class of complex ML models and algorithms collectively referred to as Deep Learning.

Skills required:
Inferential statistics
Probability theory
Calculas
Optimization algorithms
ML and DL
Python packages and frameworks(numpy, scipy, scikit learn, TF, pytorch, keras)


Artificial Intelligence:

problem solving:

No data and no modelling. Here we build a system to solve a problem with simple rules. Like a maze game.

Knowledge representation and Reasoning: 

 No data and no modelling. Here we build a system to solve a problem with complex rules based on the knowledge and reasoning. 

Decision making: 

Expert systems are systems created based on rules giving by field experts and rules encoded using knowledge representation. The execution of rules and reasoning done by a program. The drawback in expert systems is it doesn't work if the rules are complex or the rules are inexpressible or the rules may be unknown. So the alternative approach is learning from large amount of data. i.e. Machine Learning. Here we intersect with Data science.  When you have a large amount of high dimensional data and you want to learn very complex relationships between output and input using a specific class of complex ML models and algorithms collectively referred to as Deep Learning. Here again we intersect with Data science. Reinforcement Learning where the environment is dynamic, Information is partial, decision making is sequential, No explicit supervision at each step, One- Off rewards from the environment (chess match), the decision is made by learning from large data. Hence we intersect with Data Science.

So this data driven part of AI intersects with Data science. 

Descriptive statistics describes data (for example, a chart or graph).

Inferential statistics allows you to make predictions (“inferences”) from that data.

Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward.