What is Data Science?
As Data science is an assortment of several tasks and attention of tasks depend on the application, definition varies largely. The data science tasks include,
- Collection of data
- Storage of data
- Processing of data
- Description of data
- Modeling of data
Collection of data:
- the question on which the data scientist is trying to answer.
- the environment the data scientist is working.
Storing Data:
|
Relational
Database |
Data
warehouse |
Data lakes |
|
Structed data |
Structured data |
Un-structured
data |
|
Optimised for
SQL queries |
Curated data |
Un-curated
data |
|
|
Optimized for
analytics |
|
Programming and EngineeringKnowledge of Relational databaseKnowledge of NoSQLKnowledge of data warehouseKnowledge of data lakes(Hadoop)
Processing Data:
1) Data wrangling/Munging:
Wrangling involves extracting, transforming and loading data from an external source to the organizational database. Suppose the book store wants to store the data of the courier service details which is outsourced. The formats which the courier service uses may be different from the book store database. Say the courier service stores its data in Json files (which is a semi structured file) and the book store, stores it in relational database. Hence the data has to be extracted from json file and transformed into the format of the organizational database and loaded into it. This is called Wrangling or Munging.
2) Data Cleaning:
- Fill the missing values(by the mean of the column)
- Standardize keywords. Suppose someone entered 'half' instead of 'half sleeve'. Replace with the correct word.
- Correcting the spelling errors.
- Identify and remove outliers.
3) Data Scaling, Normalizing and Standardizing:
Scaling:
Kms to MetersRupees to Dollars
Normalizing:
Converting the data to zero mean data
Standardizing:
Standardize values between 0 to 1. Uniformize the details.
If data processing is to be performed on Big data (on millions of rows of data), then performance becomes a key consideration. So we distribute the data processing among different processors and aggregate the results. This is called as distributed processing. Hadoop (map reduce) allows to deal with the large amount of data efficiently.
Skills required:
Programming skillsHadoopSql and NoSql databasesBasic statistics
Describing data:
1) Visualizing:
Representing data in graphical form as it will be easy to draw insights from itBar chart, scatter plots, grouped bar charts are example of graphs
2)Summarizing data:
Mean, median, mode and variance. These numbers are important as it gives an idea of typical data.Skills required:
StatisticsPythonRTableuExcel
Modelling:
Statistical modelling:
- Modelling underlying data distribution.
- Modelling underlying relations in data (between various variables).
- Formulate and test Hypotheses.
- Give statistical guarantees.
Algorithmic Modelling(Machine Learning):
- Formulate f using data and optimizing techniques
- For the new patient plug in the value of x to get the value of y.
- Focus on prediction(dont care about underlying phenomena)
|
Statistical Modelling |
Algorithmic
modelling |
|
Simple,
Intuitive models |
Complex, flexible
models |
|
More suited
for low dimensional data |
Can work with
high dimensional data |
|
Robust
statistical analysis is possible |
Not suitable
for robust statistical analysis |
|
Focus on interpretability |
Focus on prediction |
|
Data lean
models |
Data hungry
models |
|
More of
statistics |
More of ML,
DL |
|
Linear regression, Logistic regression, Linear
discriminant analysis |
Linear regression, Logistic regression, Linear
discriminant analysis Decision
Trees, K-NNs, SVMs, Naïve bayes, Multi layered neural networks. |
When you have a large amount of high dimensional data and you want to learn very complex relationships between output and input using a specific class of complex ML models and algorithms collectively referred to as Deep Learning.Skills required:
Inferential statisticsProbability theoryCalculasOptimization algorithmsML and DLPython packages and frameworks(numpy, scipy, scikit learn, TF, pytorch, keras)
Artificial Intelligence:
problem solving:
No data and no modelling. Here we build a system to solve a problem with simple rules. Like a maze game.
Knowledge representation and Reasoning:
No data and no modelling. Here we build a system to solve a problem with complex rules based on the knowledge and reasoning.
Decision making:
Expert systems are systems created based on rules giving by field experts and rules encoded using knowledge representation. The execution of rules and reasoning done by a program. The drawback in expert systems is it doesn't work if the rules are complex or the rules are inexpressible or the rules may be unknown. So the alternative approach is learning from large amount of data. i.e. Machine Learning. Here we intersect with Data science. When you have a large amount of high dimensional data and you want to learn very complex relationships between output and input using a specific class of complex ML models and algorithms collectively referred to as Deep Learning. Here again we intersect with Data science. Reinforcement Learning where the environment is dynamic, Information is partial, decision making is sequential, No explicit supervision at each step, One- Off rewards from the environment (chess match), the decision is made by learning from large data. Hence we intersect with Data Science.
So this data driven part of AI intersects with Data science.
Descriptive statistics describes data (for example, a chart or graph).
Inferential statistics allows you to make predictions (“inferences”) from that data.

No comments:
Post a Comment