This tutorial is adapted from the Web Age course https://www.webagesolutions.com/courses/WA3057-data-science-and-data-engineering-for-architects.
1.1 What is Data Science?
- Data science focuses on the extraction of knowledge and business insights from data
- It does so by leveraging techniques and theories from many applied and pure science fields such as statistics, pattern recognition, machine learning, data warehousing, data visualization, scalable and high-performance computing, etc.
1.2 Data Science, Machine Learning, AI?
- Machine learning (ML) is a subset of data science that uses existing data to train ML algorithms to make predictions or take action on new (never seen before) data
- Existing (training) data can be either labeled (classified by humans) or unlabeled
- ML is also sometimes being referred to as data mining or predictive analytics
- Data science includes, in addition to ML, statistics, advanced data analysis, data visualization, data engineering, etc.
- Artificial Intelligence (AI) aims at automating/augmenting/substituting complex human activities through a number of specialized computer-assisted solutions
- Some of the solutions are based on deep learning through neural networks
1.3 The Data Science Ecosystem
Source: http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png
Notes:
Another take on the Data Science Skill Sets
The Data Science Skill Sets Venn Diagram
Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
1.4 Tools of the Trade
- Stand-alone Python:
- Modules and libraries: scikit-learn, NumPy, pandas, matplotlib, seaborn
- Dev tools: Jupyter notebooks, Visual Source Code, PyCharm
- The Apache Spark scalable platform:
- A choice of programming languages: Python (called PySpark), Scala, and Java
- Spark ML module
- Dev tools: Spark Shell, Jupyter notebooks
- R statistical programming language
- Deep Learning:
- TensorFlow with its high-level Python API called Keras; PyTorch
1.5 The Data-Related Roles
- Data-driven organizations establish the following three data-related roles which are highly interconnected:
- Data Scientist
- Someone who uses existing data to train machine learning (ML) algorithms to make predictions and/or generalize (take actions) on new (never seen before) data; practitioners in the field apply scientific experimentation techniques when working with data trying out different ML models
- Data Analyst
- Someone who uses traditional business intelligence (BI) tools to understand, describe, categorize, and report on the existing data
- Data Engineer
- Most of these activities fall under the category of ETL (Extract, Transform and Load) processes and are carried out in support of the above two roles with their data needs
- Data Scientist
1.6 Data Scientists at Work
- Jeff Hammerbacher, who built the “Data” team at Facebook, described the work done by their data science group as follows:
- “… on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data-intensive product or service in Hadoop, or communicate the results of our analysis to other members of the organization.”
Notes:
Data analysts and data scientists will do themselves and their organizations a big favor by learning basic data engineering skills.
As Maxime Beauchemin wrote in his article [ http://bit.ly/DATENG2019 ]:
“I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer.
I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely.
My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods.
We were pioneers. We were data engineers!”
1.7 Examples of Data Science Projects
- Build correlation models based on user requests/searches/product reviews (or any other data collected from uses) to predict users’ choices
- Engage user data in a feedback loop in which it contributes to improving Company’s products and services
- Develop a new customer segmentation model for the marketing department
- Recommendation systems (to facilitate cross-selling)
- Sentiment analysis
- Fraud detection
1.8 The Concept of a Data Product
- One of the facets of data science as a discipline is to identify the data aspect of user activities
- In some cases, a separate data product needs to be created that would help gain insight into user activities
- An early data product on the Web was the CDDB database (http://en.wikipedia.org/wiki/CDDB) built by Gracenote for the task of identifying CDs
- Problem: The audio CD format does not include metadata about the CD (the disc or track titles, performing artists, etc.)
- Solution: A digital “fingerprint” of a CD is created by performing calculations on the CD’s track lengths which then can be used as an index to the CD metadata stored in the CDDB online database
- Now, with this data product in place, a whole range of usage/business analytics can be performed using it
1.9 Applied Data Science at Google
- Google’s PageRank algorithm was among the first to rank websites in their search engine results based on the number and quality of links pointing to a page
- Google built their infrastructure around this concept
- During the Swine Flu epidemic of 2009, Google used their search data to predict flu trends around the world
- Google identified a correlation between how many people search for flu-related topics and how many people actually have flu symptoms
1.10 Data Science and ML Terminology: Features and Observations
- In data science, machine learning (ML), and statistics, features are variables (like the year a house was built, number of rooms in a house, presence of a pool, etc.) that are used in making predictions (e.g. the price of the house); they are also called predictors or independent variables
- A feature is similar to a relational table’s column (entity attribute, or property)
- Features are the inputs for a ML model
- The value that you predict using features is referred to as response, or outcome, or predicted variable, or dependent variable
- Observation is a data point, a single recorded instance of a phenomenon in a problem domain, a.k.a. sample or example
- An observation is like a table’s row or record
Notes:
For more terminology used in data science and ML, visit
https://ml-cheatsheet.readthedocs.io/en/latest/glossary.html#glossary-instance , and/or
1.11 Terminology: Labels and Ground Truth
- A label is a type/class of object that we assign to an observation
- You can have labeled and unlabeled observations (examples); the former are mostly used in classification and those labeled observations are referred to as the “ground truth“; the latter — unlabeled observations — occur when we have no recourse to labels and let a machine learning algorithm label those observations using some sort of data grouping/clustering mechanisms in the so-called unsupervised ML
- Labels are also used in linear regression models to denote the numeric values we are trying to predict (e.g. the global temperature in the year 2051)
1.12 Label Examples
- Label examples:
- Trading recommendation: Buy, Sell, Hold
- E-mail category: Spam, Non-spam
- Disease outbreak category: Outbreak, Endemic, Epidemic, Pandemic
- House sale price: A numeric value
- Labels are usually encoded with some numeric values, e.g. the trading recommendations: Buy, Sell, Hold could be encoded as 0,1, and 2
1.13 Terminology: Continuous and Categorical Features
- Features can be of two types:
- Continuous: something that can be physically or theoretically measured in numeric values, e.g. blood pressure, size of a black hole, plane speed, humidity, etc.
- Categorical: discrete, enumerated values like hurricane category, day of the week, car type, etc.; this feature type, in turn, is divided into nominal and ordinal features:
- Nominal categories have no ordering, e.g. card suits: hearts, diamonds, spades, and clubs (ordering may be card game-specific)
- Ordinal categories imply some sort of ordering, e.g. the ranks in each suit of playing cards: Ace, 2, 3, 4, …., J, Q, K
Notes:
Feature types visually
1.14 Encoding Categorical Features using One-Hot Encoding Scheme
- When dealing with categorical features (nominal or ordinal), e.g. trading recommendations made: buy, sell, or hold, you need to have a way to encode the input examples for further processing
- The common encoding technique is the “One-Hot” scheme, which works as follows (see the next slide for an example):
- Introduce as many variables/features as there are distinct values in the categorical feature; the variables are usually named after the <feature name>_<categorical value>, e.g. trade_buy, trade_sell, and trade_hold
- Initialize the variables using the one-hot encoding scheme:
- Assign 1 to the variable if the observation has the matching category name; 0 otherwise
- In the one-hot transformation, you, essentially, end up with as many new features as there are levels in that categorical variable
1.15 Example of ‘One-Hot’ Encoding Scheme
- In our go-to trading actions example, the one-hot encoding will create these three new variables/features (that may be named trade_buy, trade_sell, and trade_hold) holding either 0 or 1:
trade_buy trade_sell trade_hold 1 0 0 #the mapping of 'buy' 0 0 1 #the mapping of 'hold' 1 0 0 #the mapping of 'buy' again 0 1 0 #the mapping of 'sell'
- As you can see, the new three-feature set forms a sparse matrix
1.16 Gartner’s Magic Quadrant for Data Science and Machine Learning Platforms (a Labeling Example)
Notes:
Gartner Research developed a process for assessing the ability of companies in a specific industry to innovate and deliver value to their customers. In addition to square labeling companies according to their ability to execute and the completeness of their strategy and vision using four categories: Niche Players, Visionaries; Challengers, and Leaders, Gartner blended in the spatial positioning of individual companies in Gartner’s magic quadrant.
1.17 Machine Learning in a Nutshell
- At the core of ML lies the concept of distance between observations (data points) that helps measure the degree of proximity/affinity/similarity between them
- During the model training phase, ML algorithms infer object grouping/classification decision boundary (a function) that performs object discrimination using distances between observations
- Regression ML models predict a dependent (response) variable based on the historical/known values of at least one independent (explanatory/predictor) variable
1.18 Common Distance Metrics
- For continuous numeric variables, the Minkowski distance is used, which has this generic form:
- The Minkowski distance has three special cases:
- For p=1, the distance is known as the Manhattan distance (a.k.a the L1 norm)
- For p=2, the distance is known as the Euclidean distance (a.k.a. the L2 norm)
- When p → +infinity, the distance is known as the Chebyshev distance
- In text classification scenarios, the most commonly used distance metric is Hamming distance
Notes:
Calculating an L2 norm of a two-feature vector defined as [3, 4] — essentially, a hypotenuse of a right-angled triangle with sides 3 and 4 (the Pythagorean theorem) — using the NumPy API (two options):
The “hard” way:
import numpy as np x = np.array([3,4]) np.sqrt(np.sum(x * x)) # 5.0
The “easy” way:
from numpy.linalg import norm norm(x, ord=2) # 5.0
1.19 The Euclidean Distance
- The most commonly used distance in ML for continuous numeric variables is the Euclidean distance
- In Cartesian coordinates, if we have two data points in Euclidean n-space: p and q, the distance from p to q (or from q to p) is given by the Pythagorean formula:
1.20 Decision Boundary Examples (Object Classification)
We have a two-feature (plotted along the X and Y coordinates) dataset of objects of two classes (depicted as red circles and golden triangles) .
Adapted from https://www.semanticscholar.org/
1.21 What is a Model?
- In data science and ML, a model is a formula, an algorithm, or a prediction function that establishes a relationship between
- features (predictors)
- that act as the model’s input and
- labels (the output/predicted variable)
- that act as the model’s output
- features (predictors)
- A model is trained to predict (make an inference of) the labels or predict continuous values
1.22 Training a Model to Make Predictions
- There are two major life-cycle phases of a ML model:
- Model training (fitting)
- You train or let your model learn on labeled observations (examples) fed into the model
- During training, the model seeks a set of weights — variable coefficients — (and biases, if applicable) that minimize loss
- Loss is a quantitative measure of error e.g. the mean squared error
- Inference (predicting)
- Here you use your trained model to calculate/predict the labels of unlabeled observations (examples) or numeric values
- Model training (fitting)
1.23 Types of Machine Learning
- There are three main types of machine learning (ML):
- unsupervised learning
- supervised learning, and
- reinforcement learning
- In this course, we will be dealing only with the first two types: unsupervised and supervised learning
- FYI: The goal of reinforcement learning is to instruct computer-based algorithms to select actions that maximize a domain-specific gain or minimize a cost (which, essentially, emulates the way humans learn)
1.24 Supervised vs Unsupervised Machine Learning
Supervised learning (SL) defines a target variable that needs to be predicted/estimated by applying an SL algorithm using predictor (independent) variables (features). SL algorithms are built on top of mathematical formulas with predictive capacity SL uses labeled examples Classification and regression are examples of SL algorithms | Unsupervised learning (UL) is the opposite of SL: UL does not have the concept of a target value that needs to be found or estimated; rather, a UL algorithm, for example, can deal with the task of grouping (forming a cluster of) similar items together based on some automatically defined or discovered criteria of data elements’ affinity (automatic classification technique) UL uses unlabeled examples In essence, UL attempts to extract patterns without much human intervention |
Notes:
Some classification systems are referred to as expert systems that are created in order to let computers take much of the technical drudgery out of data processing leaving humans with the authority, in most cases, to make the final decision.
1.25 Supervised Machine Learning Algorithms
- Some of the more popular supervised ML algorithms are:
- Decision Trees/Random Forest
- k-Nearest Neighbors (kNN)
- Naive Bayes
- Regression (linear simple, multiple, locally weighted, etc.)
- Support Vector Machines (SVMs)
- Logistic Regression
1.26 Unsupervised Machine Learning Algorithms
- Some of the more popular unsupervised ML algorithms are:
- k-Means
- Hierarchical clustering
- Gaussian mixture models
- Dimensionality reduction falls into the realm of unsupervised learning:
- PCA, Isomap, t-SNE (2-D visualizations of high-dimensional datasets)
1.27 Which ML Algorithm to Choose?
Notes:
The rules below may help you get your direction but those are not written in stone.
If you are trying to find the probability of an event or predict a value based on existing historical observations, look at the supervised learning (SL) algorithms. Otherwise, refer to unsupervised learning (UL).
If you are dealing with discrete (nominal) values like TRUE:FALSE, bad:good:excellent, buy:hold:sell, etc., you need to go with the classification algorithms of SL.
If you are dealing with continuous numerical values, you need to go with the regression algorithms of SL.
If you want to let the machine categorize data into a number of groups, you need to go with the clustering algorithms of UL.
1.28 Bias-Variance (Underfitting vs Overfitting) Trade-off
- Underfitting is a property of your model which makes your model less accurate by virtue of being too generic, or biased
- Such a model appears to be rather simple failing to account for some important regularities in the training data and that has low variance in predictions
- Overfitting is the opposite of underfitting – it makes your model too sensitive to information noise/variance in your training data
- Usually, this property is exhibited in more complex data models which are trying to describe your training data as close as possible
- A good model strikes a good balance between bias and its overreaction to variance (a bias-variance balance/trade-off)
- The bias-variance trade-off applies to classification and regression models (supervised learning)
Notes:
Balancing Off the Bias-Variance Ratio
The common techniques to balance off the bias-variance ratio are
- Dimensionality reduction, feature selection, and regularization
Dimensionality reduction is the process of transforming the original feature set into another one with fewer features: features may be dropped or combined using some inter-feature relationships.
Examples of dimensionality reduction:
- Compressing a video stream by reducing the number of colors and/or pixels
- Creating a digest (executive summary) of some textual material
Regularization techniques introduce penalty (sort of a dial knob) that can programmatically decrease high variance by increasing the model’s bias (and vice versa); generally, this leads to smoother decision boundaries and simpler ML models.
Another way to decrease variance is by getting larger training sets.
Many ML algorithms offer some configuration mechanisms (called hyperparameters) to control bias and variance.
The scikit-learn’s Ridge regression algorithm improves on the ordinary linear regression models by introducing the alpha hyperparameter which is a penalty on the size of the regression coefficients e.g:
from sklearn import linear_model regModel = linear_model.Ridge (alpha = .01) regModel.fit(X, y)
…
To learn more about regularization support in scikit-learn, visit http://scikit-learn.org/stable/modules/linear_model.html
1.29 Underfitting vs Overfitting (a Regression Model Example) Visually
1.30 ML Model Evaluation
Notes:There are a number of other model evaluation metrics, e.g. ROC (Receiver Operating Characteristics) curve, that we are not discussing here. 1.31 Mean Squared Error (MSE) and Mean Absolute Error (MAE)
|