Monday, February 1, 2021

Engineering Data Science systems

Systems thinking:

Engineering aspects of Data Science:

This is a way of thinking of systems as encompassed and global rather than focusing on one particular issue. Suppose if we have to do a sort. we could use the bubble sort algorithm but it doesn't take advantage of the caches. Rather we can use the quick sort which uses the cache to sort as thus makes it more efficient. Hence while building a system always see the larger picture rather than the particular issue.

In data science, systems thinking need Domain knowledge, Hacking skills and Math and Stats.

Roles involved in Data Science:

To have system perspective of data science, one needs business knowledge, programming, statistics and Communication. Data analyst knows the Business knowledge, statistics and communication. Research analyst knows Communication, Statistics and programming.

Processes involved in Data Science:

Engineering systems of data science involves two parts,
  • Process 
  • Programming.

So to become a data engineer, the person should know not just the programming, but also the process.

Process :

Process has two components.

1.  Flow of steps (What are the steps I take?)

2. Agile improvement. (How to improve each step in an agile way?)

One such process followed in data science is CRISP-DM.

CRISP-DM:

1. Business understanding. One needs to understand the context one is working on and should be able to specify the problems we are trying to solve.

2. Data understanding. What data do I have? will this solve the problem I have?

3. Data preparation.

4. Data modelling. How do I model? How do I analyze the hypothesis?

5. Evaluate the model.

6. Deploy the system.

All the steps in the process are iterative and repeated a lot of times till we achieve the expected output. Data science follows MVP where we build a simple complete system and build/add up on the existing system iteratively. This is agile improvement.

Programming tools:

No code environments like IBM Watson, Amazon Lex. Paid interfaces that anyone can use to analyze data.

Spreadsheets and BI tools like Microsoft Excel, Google sheets, power BI, Tableau.

Programming languages like Weka, MATLAB, Mathematica, Python and R.

High performance stacks like Hadoop and Spark.

Why Phython?

1. Python is beginner friendly

2. Python is increasingly the popular choice for data science

3. Python is good for production and planning

4. Availability of open source libraries which are used in Data science

5. Python is cool beyond the data science. eg as a script language, web applications. programming IOT devices.

Disadvantages

Python is a interpretable file. Can make it slower.



No comments: