Home > Resources > Blog

Defining Data Science for Architects

December 30, 2021 by Mikhail Vladimirov
Categories: Architecture , Data Science and Business Analytics

This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.

1.1 What is Data Science?

Data science focuses on the extraction of knowledge and business insights from data
- It does so by leveraging techniques and theories from many applied and pure science fields such as statistics, pattern recognition, machine learning, data warehousing, data visualization, scalable and high-performance computing, etc.

1.2 Data Science, Machine Learning, AI?

Machine learning (ML) is a subset of data science that uses existing data to train ML algorithms to make predictions or take action on new (never seen before) data
- Existing (training) data can be either labeled (classified by humans) or unlabeled
ML is also sometimes being referred to as data mining or predictive analytics
Data science includes, in addition to ML, statistics, advanced data analysis, data visualization, data engineering, etc.
Artificial Intelligence (AI) aims at automating/augmenting/substituting complex human activities through a number of specialized computer-assisted solutions
- Some of the solutions are based on deep learning through neural networks

1.3 The Data Science Ecosystem

Source: http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png

Notes:

Another take on the Data Science Skill Sets

The Data Science Skill Sets Venn Diagram

Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

1.4 Tools of the Trade

Stand-alone Python:
- Modules and libraries: scikit-learn, NumPy, pandas, matplotlib, seaborn
- Dev tools: Jupyter notebooks, Visual Source Code, PyCharm
The Apache Spark scalable platform:
- A choice of programming languages: Python (called PySpark), Scala, and Java
- Spark ML module
- Dev tools: Spark Shell, Jupyter notebooks
R statistical programming language
Deep Learning:
- TensorFlow with its high-level Python API called Keras; PyTorch

1.5 The Data-Related Roles

Data-driven organizations establish the following three data-related roles which are highly interconnected:
- Data Scientist
  - Someone who uses existing data to train machine learning (ML) algorithms to make predictions and/or generalize (take actions) on new (never seen before) data; practitioners in the field apply scientific experimentation techniques when working with data trying out different ML models
- Data Analyst
  - Someone who uses traditional business intelligence (BI) tools to understand, describe, categorize, and report on the existing data
- Data Engineer
  - Most of these activities fall under the category of ETL (Extract, Transform and Load) processes and are carried out in support of the above two roles with their data needs

1.6 Data Scientists at Work

Jeff Hammerbacher, who built the “Data” team at Facebook, described the work done by their data science group as follows:
- “… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analysis to other members of the organization.”

Notes:

Data analysts and data scientists will do themselves and their organizations a big favor by learning basic data engineering skills.

As Maxime Beauchemin wrote in his article [ http://bit.ly/DATENG2019 ]:

“I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.

I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely.

My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods.

We were pioneers. We were data engineers!”

1.7 Examples of Data Science Projects

Build correlation models based on user requests/searches/product reviews (or any other data collected from uses) to predict users’ choices
Engage user data in a feedback loop in which it contributes to improving Company’s products and services
Develop a new customer segmentation model for the marketing department
Recommendation systems (to facilitate cross-selling)
Sentiment analysis
Fraud detection

1.8 The Concept of a Data Product

One of the facets of data science as a discipline is to identify the data aspect of user activities
In some cases, a separate data product needs to be created that would help gain insight into user activities
An early data product on the Web was the CDDB database (http://en.wikipedia.org/wiki/CDDB) built by Gracenote for the task of identifying CDs
- Problem: The audio CD format does not include metadata about the CD (the disc or track titles, performing artists, etc.)
- Solution: A digital “fingerprint” of a CD is created by performing calculations on the CD’s track lengths which then can be used as an index to the CD metadata stored in the CDDB online database
Now, with this data product in place, a whole range of usage/business analytics can be performed using it

1.9 Applied Data Science at Google

Google’s PageRank algorithm was among the first to rank websites in their search engine results based on the number and quality of links pointing to a page
- Google built their infrastructure around this concept
During the Swine Flu epidemic of 2009, Google used their search data to predict flu trends around the world
- Google identified a correlation between how many people search for flu-related topics and how many people actually have flu symptoms

1.10 Data Science and ML Terminology: Features and Observations

In data science, machine learning (ML), and statistics, features are variables (like the year a house was built, number of rooms in a house, presence of a pool, etc.) that are used in making predictions (e.g. the price of the house); they are also called predictors or independent variables
- A feature is similar to a relational table’s column (entity attribute, or property)
- Features are the inputs for a ML model
The value that you predict using features is referred to as response, or outcome, or predicted variable, or dependent variable
Observation is a data point, a single recorded instance of a phenomenon in a problem domain, a.k.a. sample or example
- An observation is like a table’s row or record

Notes:

For more terminology used in data science and ML, visit

https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html#glossary-instance , and/or

https://developers.google.com/machine-learning/crash-course/glossary

1.11 Terminology: Labels and Ground Truth

A label is a type/class of object that we assign to an observation
- You can have labeled and unlabeled observations (examples); the former are mostly used in classification and those labeled observations are referred to as the “ground truth“; the latter — unlabeled observations — occur when we have no recourse to labels and let a machine learning algorithm label those observations using some sort of data grouping/clustering mechanisms in the so-called unsupervised ML
Labels are also used in linear regression models to denote the numeric values we are trying to predict (e.g. the global temperature in the year 2051)

1.12 Label Examples

Label examples:
- Trading recommendation: Buy, Sell, Hold
- E-mail category: Spam, Non-spam
- Disease outbreak category: Outbreak, Endemic, Epidemic, Pandemic
- House sale price: A numeric value
Labels are usually encoded with some numeric values, e.g. the trading recommendations: Buy, Sell, Hold could be encoded as 0,1, and 2

1.13 Terminology: Continuous and Categorical Features

Features can be of two types:
- Continuous: something that can be physically or theoretically measured in numeric values, e.g. blood pressure, size of a black hole, plane speed, humidity, etc.
- Categorical: discrete, enumerated values like hurricane category, day of the week, car type, etc.; this feature type, in turn, is divided into nominal and ordinal features:
  - Nominal categories have no ordering, e.g. card suits: hearts, diamonds, spades, and clubs (ordering may be card game-specific)
  - Ordinal categories imply some sort of ordering, e.g. the ranks in each suit of playing cards: Ace, 2, 3, 4, …., J, Q, K

Notes:

Feature types visually

1.14 Encoding Categorical Features using One-Hot Encoding Scheme

When dealing with categorical features (nominal or ordinal), e.g. trading recommendations made: buy, sell, or hold, you need to have a way to encode the input examples for further processing
The common encoding technique is the “One-Hot” scheme, which works as follows (see the next slide for an example):
- Introduce as many variables/features as there are distinct values in the categorical feature; the variables are usually named after the <feature name>_<categorical value>, e.g. trade_buy, trade_sell, and trade_hold
- Initialize the variables using the one-hot encoding scheme:
  - Assign 1 to the variable if the observation has the matching category name; 0 otherwise
In the one-hot transformation, you, essentially, end up with as many new features as there are levels in that categorical variable

1.15 Example of ‘One-Hot’ Encoding Scheme

In our go-to trading actions example, the one-hot encoding will create these three new variables/features (that may be named trade_buy, trade_sell, and trade_hold) holding either 0 or 1:

trade_buy trade_sell  trade_hold 
1	     0		0   	   #the mapping of 'buy'
0	     0		1   #the mapping of 'hold'
1	     0		0   	   #the mapping of 'buy' again
0	     1	0   	   #the mapping of 'sell'

As you can see, the new three-feature set forms a sparse matrix

1.16 Gartner’s Magic Quadrant for Data Science and Machine Learning Platforms (a Labeling Example)

Notes:

Gartner Research developed a process for assessing the ability of companies in a specific industry to innovate and deliver value to their customers. In addition to square labeling companies according to their ability to execute and the completeness of their strategy and vision using four categories: Niche Players, Visionaries; Challengers, and Leaders, Gartner blended in the spatial positioning of individual companies in Gartner’s magic quadrant.

1.17 Machine Learning in a Nutshell

At the core of ML lies the concept of distance between observations (data points) that helps measure the degree of proximity/affinity/similarity between them
During the model training phase, ML algorithms infer object grouping/classification decision boundary (a function) that performs object discrimination using distances between observations
Regression ML models predict a dependent (response) variable based on the historical/known values of at least one independent (explanatory/predictor) variable

1.18 Common Distance Metrics

For continuous numeric variables, the Minkowski distance is used, which has this generic form:

The Minkowski distance has three special cases:
- For p=1, the distance is known as the Manhattan distance (a.k.a the L1 norm)
- For p=2, the distance is known as the Euclidean distance (a.k.a. the L2 norm)
- When p → +infinity, the distance is known as the Chebyshev distance
In text classification scenarios, the most commonly used distance metric is Hamming distance

Notes:

Calculating an L2 norm of a two-feature vector defined as [3, 4] — essentially, a hypotenuse of a right-angled triangle with sides 3 and 4 (the Pythagorean theorem) — using the NumPy API (two options):

The “hard” way:

import numpy as np
x = np.array([3,4])
np.sqrt(np.sum(x * x))  # 5.0

The “easy” way:

from numpy.linalg import norm
norm(x, ord=2)    # 5.0

1.19 The Euclidean Distance

The most commonly used distance in ML for continuous numeric variables is the Euclidean distance
In Cartesian coordinates, if we have two data points in Euclidean n-space: p and q, the distance from p to q (or from q to p) is given by the Pythagorean formula:

1.20 Decision Boundary Examples (Object Classification)

We have a two-feature (plotted along the X and Y coordinates) dataset of objects of two classes (depicted as red circles and golden triangles) .

Adapted from https://www.semanticscholar.org/

1.21 What is a Model?

In data science and ML, a model is a formula, an algorithm, or a prediction function that establishes a relationship between
- features (predictors)
  - that act as the model’s input and
- labels (the output/predicted variable)
  - that act as the model’s output
A model is trained to predict (make an inference of) the labels or predict continuous values

1.22 Training a Model to Make Predictions

There are two major life-cycle phases of a ML model:
- Model training (fitting)
  - You train or let your model learn on labeled observations (examples) fed into the model
  - During training, the model seeks a set of weights — variable coefficients — (and biases, if applicable) that minimize loss
    - Loss is a quantitative measure of error e.g. the mean squared error
- Inference (predicting)
  - Here you use your trained model to calculate/predict the labels of unlabeled observations (examples) or numeric values

1.23 Types of Machine Learning

There are three main types of machine learning (ML):
- unsupervised learning
- supervised learning, and
- reinforcement learning
In this course, we will be dealing only with the first two types: unsupervised and supervised learning
FYI: The goal of reinforcement learning is to instruct computer-based algorithms to select actions that maximize a domain-specific gain or minimize a cost (which, essentially, emulates the way humans learn)

1.24 Supervised vs Unsupervised Machine Learning

Supervised learning (SL) defines a target variable that needs to be predicted/estimated by applying an SL algorithm using predictor (independent) variables (features).
SL algorithms are built on top of mathematical formulas with predictive capacity
SL uses labeled examples Classification and regression are examples of SL algorithms

Unsupervised learning (UL) is the opposite of SL: UL does not have the concept of a target value that needs to be found or estimated; rather, a UL algorithm, for example, can deal with the task of grouping (forming a cluster of) similar items together based on some automatically defined or discovered criteria of data elements’ affinity (automatic classification technique)
UL uses unlabeled examples
In essence, UL attempts to extract patterns without much human intervention

Notes:

Some classification systems are referred to as expert systems that are created in order to let computers take much of the technical drudgery out of data processing leaving humans with the authority, in most cases, to make the final decision.

1.25 Supervised Machine Learning Algorithms

Some of the more popular supervised ML algorithms are:
- Decision Trees/Random Forest
- k-Nearest Neighbors (kNN)
- Naive Bayes
- Regression (linear simple, multiple, locally weighted, etc.)
- Support Vector Machines (SVMs)
- Logistic Regression

1.26 Unsupervised Machine Learning Algorithms

Some of the more popular unsupervised ML algorithms are:
- k-Means
- Hierarchical clustering
- Gaussian mixture models
- Dimensionality reduction falls into the realm of unsupervised learning:
  - PCA, Isomap, t-SNE (2-D visualizations of high-dimensional datasets)

1.27 Which ML Algorithm to Choose?

Notes:

The rules below may help you get your direction but those are not written in stone.

If you are trying to find the probability of an event or predict a value based on existing historical observations, look at the supervised learning (SL) algorithms. Otherwise, refer to unsupervised learning (UL).

If you are dealing with discrete (nominal) values like TRUE:FALSE, bad:good:excellent, buy:hold:sell, etc., you need to go with the classification algorithms of SL.

If you are dealing with continuous numerical values, you need to go with the regression algorithms of SL.

If you want to let the machine categorize data into a number of groups, you need to go with the clustering algorithms of UL.

1.28 Bias-Variance (Underfitting vs Overfitting) Trade-off

Underfitting is a property of your model which makes your model less accurate by virtue of being too generic, or biased
- Such a model appears to be rather simple failing to account for some important regularities in the training data and that has low variance in predictions
Overfitting is the opposite of underfitting – it makes your model too sensitive to information noise/variance in your training data
- Usually, this property is exhibited in more complex data models which are trying to describe your training data as close as possible
A good model strikes a good balance between bias and its overreaction to variance (a bias-variance balance/trade-off)
The bias-variance trade-off applies to classification and regression models (supervised learning)

Notes:

Balancing Off the Bias-Variance Ratio

The common techniques to balance off the bias-variance ratio are

Dimensionality reduction, feature selection, and regularization

Dimensionality reduction is the process of transforming the original feature set into another one with fewer features: features may be dropped or combined using some inter-feature relationships.

Examples of dimensionality reduction:

Compressing a video stream by reducing the number of colors and/or pixels
Creating a digest (executive summary) of some textual material

Regularization techniques introduce penalty (sort of a dial knob) that can programmatically decrease high variance by increasing the model’s bias (and vice versa); generally, this leads to smoother decision boundaries and simpler ML models.

Another way to decrease variance is by getting larger training sets.

Many ML algorithms offer some configuration mechanisms (called hyperparameters) to control bias and variance.

The scikit-learn’s Ridge regression algorithm improves on the ordinary linear regression models by introducing the alpha hyperparameter which is a penalty on the size of the regression coefficients e.g:

from sklearn import linear_model
regModel = linear_model.Ridge (alpha = .01)
regModel.fit(X, y)

…

To learn more about regularization support in scikit-learn, visit http://scikit-learn.org/stable/modules/linear_model.html

1.29 Underfitting vs Overfitting (a Regression Model Example) Visually

1.30 ML Model Evaluation

The quality of ML models — their predictive capability — is commonly evaluated using the following metrics:
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- Coefficient of determination, denoted by R²
- Confusion matrix
Whatever metric you are using, you need to assess the ability of your model to make accurate predictions (generalize) on new data (not seen during training where the model can simply memorize all the signals, nooks, and patterns that exist in the data)
Note: ML practitioners use the term performance to mean how correct a model is in making predictions

Notes:

There are a number of other model evaluation metrics, e.g. ROC (Receiver Operating Characteristics) curve, that we are not discussing here.

Defining Data Science for Architects

1.1 What is Data Science?

1.2 Data Science, Machine Learning, AI?

1.3 The Data Science Ecosystem

Notes:

1.4 Tools of the Trade

1.5 The Data-Related Roles

1.6 Data Scientists at Work

Notes:

1.7 Examples of Data Science Projects

1.8 The Concept of a Data Product

1.9 Applied Data Science at Google

1.10 Data Science and ML Terminology: Features and Observations

Notes:

1.11 Terminology: Labels and Ground Truth

1.12 Label Examples

1.13 Terminology: Continuous and Categorical Features

Notes:

1.14 Encoding Categorical Features using One-Hot Encoding Scheme

1.15 Example of ‘One-Hot’ Encoding Scheme

1.16 Gartner’s Magic Quadrant for Data Science and Machine Learning Platforms (a Labeling Example)

Notes:

1.17 Machine Learning in a Nutshell

1.18 Common Distance Metrics

Notes:

1.19 The Euclidean Distance

1.20 Decision Boundary Examples (Object Classification)

1.21 What is a Model?

1.22 Training a Model to Make Predictions

1.23 Types of Machine Learning

1.24 Supervised vs Unsupervised Machine Learning

Notes:

1.25 Supervised Machine Learning Algorithms

1.26 Unsupervised Machine Learning Algorithms

1.27 Which ML Algorithm to Choose?

Notes:

1.28 Bias-Variance (Underfitting vs Overfitting) Trade-off

Notes:

1.29 Underfitting vs Overfitting (a Regression Model Example) Visually

1.30 ML Model Evaluation

Notes:

1.31 Mean Squared Error (MSE) and Mean Absolute Error (MAE)

1.32 Coefficient of Determination

Notes:

1.33 Confusion Matrix

1.34 The Binary Classification Confusion Matrix

1.35 The Typical Machine Learning Process

1.36A Better Algorithm or More Data?

1.37 The Typical Data Processing Pipeline in Data Science

1.38 Data Discovery Phase

1.39 Data Harvesting Phase

1.40 Data Cleaning/Priming/Enhancing Phase

1.41 Exploratory Data Analysis and Feature Selection

1.42 Exploratory Data Analysis and Feature Selection Cont’d

1.43 ML Model Planning Phase

1.44 Feature Engineering

Notes:

1.45 ML Model Building Phase

1.46 Capacity Planning and Resource Provisioning

1.47 Communicating the Results

1.48 Production Roll-out

1.49 Data Science Gotchas

Notes:

1.50 Summary

Follow Us

Blog Categories