WA3419

Practical Data Science and Machine Learning with Python Training

This Data Science and Machine Learning (ML) with Python training course teaches attendees how to extract meaning from data, empowering you to solve real-world problems, make informed decisions, and uncover hidden patterns. Through hands-on labs and practical exercises, students master essential Python libraries, explore powerful algorithms, and learn to apply real-world best practices to data science and ML.

Course Details

Duration

4 days

Prerequisites

Working knowledge of Python and familiarity with NumPy and pandas libraries

Target Audience

  • Data Practitioners
  • Business Analysts
  • Software Engineers
  • IT Architects

Skills Gained

  • Understand the connection between data science and ML and their distinct role
  • Master the key concepts and terminology of data science and machine learning
  • Build a machine learning project roadmap
  • Cleanse, mend, and prepare your data for analysis
  • Create data visualizations and explore EDA to reveal hidden patterns and trends in data
  • Master common data science and ML algorithms for both supervised and unsupervised learning tasks
Course Outline
  • Defining Data Science and Machine Learning
    • What is Data Science?
    • The Data Science Domain Diagram
    • The Scientific Method
    • Data Scientist vs Machine Learning Specialist
    • The Common Activities and Tasks
    • Making Predictions ...
    • A Testimony
    • “Classical” Machine Learning and Machine Learning using Neural Networks
    • Tools of the Trade
    • Data Analytics Gotchas
    • There is a Lot to Learn ...
    • Summary
  • Core Data Science and Machine Learning Concepts and Terminology
    • Doing the Labs and Hands-on Exercises
    • The Shared Concepts and Terminology
    • Features
    • Targets (Labels)
    • Feature Importance
    • Observations (Examples/Cases)
    • Continuous Features
    • Categorical Features
    • Features Types Visually
    • Hands-on Exercises
    • The Machine Learning Process
    • What is a Machine Learning Model?
    • And Mathematically Speaking ...
    • The Desired Model Properties
    • Parametric and Non-parametric Models
    • Modeling Error Factors
    • Underfitting and Overfitting
    • One Way to Visualize Bias and Variance
    • Underfitting vs Overfitting Visualization
    • Balancing Off the Bias-Variance Ratio
    • Supervised and Unsupervised ML
    • Feature Engineering
    • Feature Blending (Creating Synthetic Features)
    • Scaling of the Features
    • The z-Score Normalization Formula
    • Hands-on Exercises
    • The Linear Algebra Functions in NumPy
    • Vectors
    • Visualizing the Vector
    • Vectors Geometrically
    • Vector Notation
    • Vector Scalar Operations
    • Vector Addition and Subtraction Operations
    • Matrices
    • Matrix Operations
    • Matrix Multiplication
    • Matrix Multiplication Visually
    • Common Representations
    • Matrix Transpose Operation
    • Hands-on Exercises
    • Common Distance Metrics
    • The L2 Norm
    • Cosine Similarity and Distance
    • Hands-on Exercises
    • Dimensionality Reduction (DR)
    • Common DR Algorithms and Techniques
    • The Advantages of DR
    • Hands-on Exercises
    • The Mean Squared Error (MSE)
    • Mean Absolute Error (MAE)
    • Hands-on Exercises
    • The Error Rate
    • Confusion Matrix
    • Confusion Matrix Components
    • The Binary Classification Confusion Matrix
    • Example of a Binary Confusion Matrix
    • Example of a Multi-Class Confusion Matrix
    • Precision and Recall
    • The F1 Score
    • Hands-on Exercises
    • Measuring the Impurity
    • The Shannon Formula
    • Gini Impurity
    • Example of Using Gini Impurity Formula
    • Visualization of a Decision Tree Algorithm For Binary Classification
    • Hands-on Exercises
    • After a Job Well Done ...
    • Summary
  • Machine Learning Project Roadmap
    • The Data-Related Roles
    • The Typical Machine Learning Data Processing Pipeline
    • The Data Discovery Phase
    • The Data Harvesting Phase
    • The Data Priming Phase
    • Exploratory Data Analysis
    • Feature Importance
    • Feature Importance Metrics
    • The Model Planning Phase
    • The Typical Machine Learning Process Diagram
    • Which ML Algorithm to Choose?
    • The Model Building Process
    • Data Splitting into Training, Validation, and Test Datasets
    • Data Splitting Considerations
    • Avoiding Test Data Leakage
    • Cross-Validation Technique
    • Training Error vs Validation Error Diagram
    • The Grid Search Technique
    • Grid Search Caveats
    • Data Splitting, Cross-Validation, and Grid Search Flow Chart
    • Communicating the Results
    • Incorporate Feedback
    • The Operational Aspect
    • Data Processing Workflow Engines
    • Scikit-learn Pipelines
    • Production Roll-out
    • Summary
  • Data Repairing Methods and Techniques
    • Quality is not an Act
    • Losing a Wheel
    • Data Irregularities
    • Dealing with the Missing Data
    • Missing Data Representations
    • Getting Information About the Missing Data (NaN)
    • Dropping Rows/Column with NaNs
    • Dropping Duplicate Rows
    • The dropna() Function
    • Examples of Using dropna()
    • Interpolating Missing Data in pandas
    • Examples of Interpolating Missing Data in pandas
    • Replacing the Missing Values with the User-defined Value
    • Examples of Using fillna()
    • Using NumPy to Mask Data
    • Example of Using a Masked Array
    • Summary
  • Summarizing and Analyzing Data with Descriptive Statistics
    • What are Descriptive Statistics?
    • Geometric Visualization of Central Tendency Measures
    • Calculating Descriptive Statistics and Summary Measures in pandas
    • Calculations Along axes
    • Examples of Axis-Specific DataFrame Operations
    • The nlargest() and nsmallest() Methods
    • Using NumPy for Calculating Descriptive Statistics Measures
    • Finding Min and Max in NumPy
    • Boxplots
    • The Boxplot Visually
    • Correlation
    • Multicollinearity
    • Non-uniformity of a Probability Distribution
    • The Skew Measure
    • Dealing with Skewed Data
    • Summary
  • Data Visualization and EDA in Python
    • The Quote of the Day
    • Data Visualization
    • What is Exploratory Data Analysis?
    • Data Visualization in Python
    • Using the Graphics Libraries in Python
    • Matplotlib
    • Getting Started with matplotlib
    • Visualization with pandas
    • Using matplotlib.pyplot.plot() Example
    • The matplotlib.pyplot.bar() Function
    • The matplotlib.pyplot.pie () Function
    • X and Y Axis Labeling
    • Labeling Plotted Elements
    • Displaying a Part of the Graph
    • Style Themes
    • Setting the Global Runtime Configuration Parameters
    • Annotating Plots with Text
    • Showing Images
    • Figures
    • The matplotlib.pyplot.subplot() Function
    • A Subplot Example
    • The Axes Object
    • The Subplotting Idiom with Axes
    • Boxplots in Matplotlib
    • Saving Figures to a File
    • Seaborn
    • Getting Started with seaborn
    • Histograms and KDE
    • Plotting Bivariate Distributions
    • Scatter plots in seaborn
    • Pair plots in seaborn
    • Heatmaps
    • A Seaborn Scatterplot with Varying Point Sizes and Hues
    • Plotly
    • ggplot
    • Summary
  • Supervised Learning: Regression Models
    • What is Regression?
    • Linear Regression
    • The (Linear) Regression Mathematical Formulation
    • p-values in Linear Regression (LR)
    • Non-linear Regression Models
    • Polynomial Regression
    • Spline Regression
    • The Locally-Weighted Linear Regression
    • Machine Learning Algorithms for Regression
    • Fitting the Regression Model
    • Gradient Descent (GD)
    • The GD Algorithm in a Nutshell
    • Stochastic Gradient Descent (SGD)
    • Decision Trees
    • Decision Tree Regressor
    • Random Forest Regression
    • Gradient Boosting
    • XGBRegressor
    • Evaluating Regression Model Accuracy
    • The Coefficient of Determination (the R2 Score)
    • The Mean Squared Error Score
    • Problems with the Input Data (1 of 2)
    • Problems with the Input Data (2 of 2)
    • Summary
  • Supervised Machine Learning: Classification Models
    • Doing the Labs and Hands-on Exercises
    • Classification Defined
    • Predicting that a Creature is a Duck
    • Machine Learning Classification Tasks Examples
    • How is Classification Performed?
    • Two Categories of Classification Algorithms
    • Instance-Based Learning
    • Model-Based Learning
    • Classification Algorithms
    • k-Nearest Neighbors and Radius Neighbors Algorithms
    • kNN's Characteristics
    • The k-Nearest Neighbors Algorithm Visually
    • Hands-on Exercise
    • Support-Vector Machines (SVMs)
    • SVM Limitations
    • SVM Classification Visually
    • SMV Mathematical Formulation
    • Dealing with Non-Linear Class Boundaries
    • Examples of SVM Decision Boundaries
    • Hands-on Exercise
    • Decision Trees
    • An Annotated Decision Tree
    • Decision Trees in a Nutshell
    • Decision Trees: What the Customer Really Needed ...
    • Properties of the Decision Tree Algorithm
    • Controlling the Overfitting of Decision Trees
    • The Measure-and-Divide Policy
    • Information Gain
    • Gini Impurity
    • Example of Using Gini Impurity Formula
    • Random Forests
    • Example of Random Forests Model's Rules
    • Hands-on Exercise
    • Logistic Regression
    • Logistic Regression's Problem Domain
    • Logistic Regression vs Ordinary Linear Regression
    • The Probabilistic Dimension of Logistic Regression
    • The Math Behind
    • Multi-class Logistic Regression
    • A Classification Example with Logistic Regression
    • Hands-On Exercise
    • The Naive Bayes Classifier
    • The Naive Bayes Classifier Implementations
    • The Bayes' Theorem in a Nutshell
    • The Classification Algorithm
    • The Final Brick in the Wall ...
    • Classification of Documents with Naive Bayes
    • Summary
  • Unsupervised Machine Learning - Clustering
    • Doing the Labs and Hands-on Exercises
    • Unsupervised Machine Learning
    • Unsupervised Machine Learning Algorithms
    • Unsupervised Learning Type: Clustering
    • Clustering Examples
    • k-means Clustering
    • k-means Clustering in a Nutshell
    • k-means Clustering Visually
    • k-means Characteristics
    • The k-means Objective Function
    • Global Minimum vs Local Minimum Explained
    • Hands-on Exercise
    • The Silhouette Coefficient (Score)
    • Interpreting the Silhouette Score Values
    • The Silhouette Scores Visually
    • Hands-on Exercise
    • Hierarchical Clustering
    • Hierarchical Clustering Strategies
    • Common Cluster Linkage Criterion
    • An Agglomerative Clustering Example
    • How Many Clusters Is There?
    • The Mean Shift Clustering
    • Mean Shift Algorithm's Properties
    • Affinity Propagation
    • Example of Clustering using Affinity Propagation
    • Min-Batch k-means
    • Min-Batch k-means in scikit-learn
    • BIRCH
    • What's in a Name?
    • Trees and Nodes
    • The BIRTH Hyperparameters
    • The BIRCH Algorithm in scikit-learn
    • Hands-on Exercise
    • DBSCAN
    • DBSCAN Characteristics
    • DBSCAN Concepts
    • The DBSCAN Algorithm in a Nutshell
    • A DBSCAN Clustering Example
    • A Comparison of The Clustering Algorithms in scikit-learn
    • Summary
  • Unsupervised ML - Dimensionality Reduction
    • PCA
    • Where PCA Can Help
    • PCA and Data Variance
    • PCA Components Visually
    • PCA Variants
    • Importance of Feature Scaling in PCA Visually
    • Handling Large Datasets
    • Dimensionality Reduction in Action
    • The t-SNE Embeddings
    • t-SNE Visually
    • Summary
  • Introduction to Natural Language Processing
    • What is Natural Language Processing (NLP)?
    • Typical NLP Use Cases
    • How Do Machines Understand Text?
    • Popular NLP and Text Mining Libraries
    • Getting the Text Data (Data Collection)
    • Text Formats
    • Common Text Preprocessing Activities
    • Text Normalization
    • The Stop Words
    • Stemming
    • Lemmatization
    • The POS Tagging
    • Name-Entity Recognition
    • Text Corpus Vocabulary
    • Documents as Vectors
    • OOV Tokens
    • The Bag of Words
    • N-Grams
    • TF-IDF
    • The Feature Hashing Trick
    • Cosine Similarity and Distance
    • Limitations of BoW and TF-IDF Representation Schemes
    • Word Embedding
    • Creating Word Embeddings
    • The Word2vec Model
    • Gensim in Action (Bring in Your Own Protractor)
    • Summary
  • Introduction to Inferential Analytics (Supplementary Chapter)
    • Descriptive Statistics vs. Inferential Statistics
    • Population and Sample Visually
    • Population Parameters vs Sample Statistics
    • Examples
    • Estimating (Population) Parameters
    • The Null and Alternative Hypotheses
    • Making Hypotheses ...
    • The Classical Hypothesis-Testing Methodology
    • A Court Proceeding Example
    • Hypothesis Testing in Production Example
    • The Great Tragedy of Science ...
    • Types of Errors
    • Normal Distribution
    • The Normal Distribution and the 68-95-99.7 Empirical Rule
    • Normal Distribution in Statistics
    • The Standard Normal Distribution
    • The Normal Distribution PDF Simplification
    • The Box Plot and PDF Relationship Visually
    • A Confidence Interval
    • The Population Percentiles
    • Percentiles for Samples
    • The Z-Score
    • Getting from Z-Score Back to X
    • Finding Z for the Sampling Distribution of the Mean
    • Quiz
    • Z-Score Tables
    • Z-Score Table Variations
    • The Cumulative Z-Score Table
    • Finding a Z-Score Example
    • A Two-Tail Z-Score Example
    • A Quiz
    • Defining the Confidence Interval
    • The Level of Confidence
    • The Power of Test
    • Putting it All Together
    • The Algorithm
    • The Calculations
    • The p-Value
    • Putting the p-Value to Test
    • T-Tests
    • T-Score
    • Chi-Square Tests
    • The Chi-Square Statistic
    • Chi-Square Test Types
    • ANOVA
  • Introduction to Python (Supplementary Chapter)
    • What is Python?
    • The Friendly Python
    • The Zen of Python (Abbreviated), by Tim Peters
    • Python Documentation
    • Where Can I Use Python?
    • Python Development Environments
    • PEP8
    • Variables
    • Types
    • Variable Names
    • More on Variables
    • More on Built-In Types
    • The Layout of Python Programs
    • Comments and Triple-Delimited String Literals
    • Sample Python Code
    • Getting Help on Python Objects
    • None
    • Strings
    • String Public Method Names
    • Useful String Methods: split()
    • String Formatting
    • The Boolean Type
    • Boolean Operators
    • Relational Operators
    • Numbers
    • "Easy Numbers"
    • Looking Up the Runtime Type of a Variable
    • Divisions
    • Assignment-with-Operation
    • Control Flow With The if-elif-else Triad
    • Conditional Expressions (a.k.a. Ternary Operator)
    • The While-Break-Continue Triad
    • The for Loop
    • The range() Function
    • Examples of Using range()
    • Exceptions and Errors
    • The try-except-finally Construct
    • Lists
    • Slicing Lists (Cont'd from Previous Slide)
    • Main List Methods
    • A Preamble to List Comprehension
    • List Comprehension
    • Zipping Lists
    • The enumerate() Function
    • Dictionaries
    • Creating Dictionaries
    • Getting and Setting Dictionary Values
    • Iterating Over Dictionaries
    • Sets
    • Creating Sets
    • Set Methods
    • Set Operations
    • Examples of Set Operations
    • Finding Unique Elements in a List
    • Common Collection Functions and Operators
    • Tuples
    • Unpacking Tuples
    • Built-in Functions
    • Functions
    • The "Call by Sharing" Parameter Passing
    • Global and Local Variable Scopes
    • Using Global Variables in Functions
    • Default Parameters
    • Named Parameters
    • Dealing with an Arbitrary Number of Parameters
    • Keyword Function Parameters
    • Python Modules
    • Importing Modules
    • Working with Files
    • Reading and Writing Files
    • Summary

Lab Exercises

  • Lab 1. Learning the Colab Jupyter Notebook Environment
  • Lab 2. Core Data Science and ML Concepts
  • Lab 3. Data Repairing and Priming
  • Lab 4. Obtaining Descriptive Statistics
  • Lab 5. Understanding Statistical Concepts (Optional)
  • Lab 6. Data Visualization and EDA in Python
  • Lab 7. Linear Regression Models
  • Lab 8. Regression Models
  • Lab 9. Classification with kNN
  • Lab 10. Classification with SVM and Using Grid Search
  • Lab 11. Classification with Random Forest Project (Optional)
  • Lab 12. Logistic Regression Models
  • Lab 13. Decision Tree Classifier Visualization
  • Lab 14. Assessing Feature Importance
  • Lab 15. Understanding the k-means Algorithm
  • Lab 16. Understanding the Silhouette Score
  • Lab 17. BIRCH (Optional)
  • Lab 18. Understanding PCA
  • Lab 19. Understanding t-SNE
  • Lab 20. Introduction to NLP
  • Lab 21. Using Naive Bayes Classifier for Sentiment Analysis
  • Lab 22. Getting Started with spaCy
  • Lab 23. Word Embeddings (Optional)