WA3057
Data Science and Data Engineering for Architects Training
This intensive Data Science and Data Engineering training course teaches architects the theoretical and practical aspects of applying Data Science and Data Engineering methods in real-world scenarios. Students master relevant concepts, terminology, theory, and tools used in this field.
Gain hands-on experience through practical labs to solidify data science and data engineering concepts.
Course Details
Duration
4 days
Skills Gained
- Leverage Python for data science tasks and explore key libraries like pandas and NumPy
- Master data visualization techniques using Matplotlib and Seaborn to create impactful insights
- Grasp core data science concepts like machine learning and its role in problem-solving
- Build and implement supervised machine learning algorithms with scikit-learn
- Clean and manipulate data using pandas for effective analysis
- Apply unsupervised learning techniques for data exploration and segmentation
- Understand common challenges faced in data science projects and develop strategies to overcome them
- Communicate data insights effectively through compelling visualizations.
Prerequisites
- Working knowledge of Python (or have the programming background and/or the ability to quickly pick up Python's syntax)
- Familiarity with core statistical concepts (variance, correlation, etc.)
Target Audience
- IT Architects
- Technical Managers
Course Outline
- Python for Data Science
- Python Data Science-Centric Libraries
- SciPy
- NumPy
- pandas
- Scikit-learn
- Matplotlib
- Seaborn
- Python Dev Tools and REPLs
- IPython
- Jupyter Notebooks
- Anaconda
- Data Visualization in Python
- Why Do I Need Data Visualization?
- Data Visualization in Python
- Getting Started with matplotlib
- A Basic Plot
- Scatter Plots
- Figures
- Saving Figures to a File
- Seaborn
- Getting Started with seaborn
- Histograms and KDE
- Plotting Bivariate Distributions
- Scatter Plots in seaborn
- Pair plots in seaborn
- Heatmaps
- A Seaborn Scatterplot with Varying Point Sizes and Hues
- Introduction to NumPy
- What is NumPy?
- The First Take on NumPy Arrays
- The ndarray Data Structure
- Understanding Axes
- Indexing Elements in a NumPy Array
- Re-Shaping
- Commonly Used Array Metrics
- Commonly Used Aggregate Functions
- Sorting Arrays
- Vectorization
- Vectorization Visually
- Broadcasting
- Broadcasting Visually
- Filtering
- Array Arithmetic Operations
- Reductions: Finding the Sum of Elements by Axis
- Array Slicing
- 2-D Array Slicing
- The Linear Algebra Functions
- Introduction to pandas
- What is pandas?
- The DataFrame Object
- The DataFrame's Value Proposition
- Creating a pandas DataFrame
- Getting DataFrame Metrics
- Accessing DataFrame Columns
- Accessing DataFrame Rows
- Accessing DataFrame Cells
- Deleting Rows and Columns
- Adding a New Column to a DataFrame
- Getting Descriptive Statistics of DataFrame Columns
- Getting Descriptive Statistics of DataFrames
- Sorting DataFrames
- Reading From CSV Files
- Writing to a CSV File
- Repairing and Normalizing Data
- Repairing and Normalizing Data
- Dealing with the Missing Data
- Sample Data Set
- Getting Info on Null Data
- Dropping a Column
- Interpolating Missing Data in pandas
- Replacing the Missing Values with the Mean Value
- Scaling (Normalizing) the Data
- Data Preprocessing with scikit-learn
- Scaling with the scale() Function
- The MinMaxScaler Object
- Defining Data Science
- What is Data Science?
- Data Science, Machine Learning, AI?
- The Data Science Ecosystem
- Tools of the Trade
- The Data-Related Roles
- Data Scientists at Work
- Examples of Data Science Projects
- The Concept of a Data Product
- Applied Data Science at Google
- Data Science and ML Terminology: Features and Observations
- Terminology: Labels and Ground Truth
- Label Examples
- Terminology: Continuous and Categorical Features
- Encoding Categorical Features using One-Hot Encoding Scheme
- Example of 'One-Hot' Encoding Scheme
- Gartner's Magic Quadrant for Data Science and Machine Learning Platforms (a Labeling Example)
- Machine Learning in a Nutshell
- Common Distance Metrics
- The Euclidean Distance
- Decision Boundary Examples (Object Classification)
- What is a Model?
- Training a Model to Make Predictions
- Types of Machine Learning
- Supervised vs Unsupervised Machine Learning
- Supervised Machine Learning Algorithms
- Unsupervised Machine Learning Algorithms
- Which ML Algorithm to Choose?
- Bias-Variance (Underfitting vs Overfitting) Trade-off
- Underfitting vs Overfitting (a Regression Model Example) Visually
- ML Model Evaluation
- Mean Squared Error (MSE) and Mean Absolute Error (MAE)
- Coefficient of Determination
- Confusion Matrix
- The Binary Classification Confusion Matrix
- The Typical Machine Learning Process
- A Better Algorithm or More Data?
- The Typical Data Processing Pipeline in Data Science
- Data Discovery Phase
- Data Harvesting Phase
- Data Cleaning/Priming/Enhancing Phase
- Exploratory Data Analysis and Feature Selection
- ML Model Planning Phase
- Feature Engineering
- ML Model Building Phase
- Capacity Planning and Resource Provisioning
- Communicating the Results
- Production Roll-out
- Data Science Gotchas
- Overview of the scikit-learn Library
- The scikit-learn Library
- The Navigational Map of ML Algorithms Supported by scikit-learn
- Developer Support
- scikit-learn Estimators, Models, and Predictors
- Annotated Example of the LinearRegression Estimator
- Annotated Example of the Support Vector Classification Estimator
- Data Splitting into Training and Test Datasets
- Data Splitting in scikit-learn
- Cross-Validation Technique
- Classification Algorithms (Supervised Machine Learning)
- Classification (Supervised ML) Use Cases
- Classifying with k-Nearest Neighbors
- k-Nearest Neighbors Algorithm Visually
- Decision Trees
- Decision Tree Terminology
- Decision Tree Classification in the Context of Information Theory
- Using Decision Trees
- Properties of the Decision Tree Algorithm
- The Simplified Decision Tree Algorithm
- Random Forest
- Properties of the Random Forest Algorithm
- Support Vector Machines (SVMs)
- SVM Classification Visually
- Properties of SVMs
- Dealing with Non-Linear Class Boundaries
- Logistic Regression (Logit)
- The Sigmoid Function
- Logistic Regression Classification Example
- Logistic Regression's Problem Domain
- Naive Bayes Classifier (SL)
- Naive Bayesian Probabilistic Model in a Nutshell
- Bayes Formula
- Document Classification with Naive Bayes
- Unsupervised Machine Learning Algorithms
- PCA
- PCA and Data Variance
- PCA Properties
- Importance of Feature Scaling Visually
- Unsupervised Learning Type: Clustering
- Clustering vs Classification
- Clustering Examples
- k-means Clustering
- k-means Clustering in a Nutshell
- k-means Characteristics
- Global vs Local Minimum Explained
- Lab Exercises
- Lab 1. Learning the CoLab Jupyter Notebook Environment
- Lab 2. Data Visualization in Python
- Lab 3. Understanding NumPy
- Lab 4. Data Repairing
- Lab 5. Understanding Common Metrics
- Lab 6. Coding kNN Algorithm in NumPy (Optional)
- Lab 7. Understanding Machine Learning Datasets in scikit-learn
- Lab 8. Building Linear Regression Models
- Lab 9. Spam Detection with Random Forest
- Lab 10. Spam Detection with Support Vector Machines
- Lab 11. Spam Detection with Logistic Regression
- Lab 12. Comparing Classification Algorithms
- Lab 13. Feature Engineering and EDA
- Lab 14. Understanding PCA