WA3020
Data Engineering Bootcamp using Python and PySpark Training
This hands-on Data Engineering Bootcamp teaches attendees the foundations of data engineering using Python and Spark SQL. Students learn how to build production-ready data-driven solutions and gain a comprehensive understanding of data engineering.
Course Details
Duration
5 days
Prerequisites
- Some working experience in any programming language (the students will be introduced to programming in Python).
- Basic understanding of SQL and data processing concepts, including data grouping and aggregation.
Target Audience
- Data Engineers
Skills Gained
- Data Availability and Consistency
- A/B Testing Data Engineering Tasks Project
- Learning the Databricks Community Cloud Lab Environment
- Python Variables
- Dates and Times
- The if, for, and try Constructs
- Dictionaries
- Sets, Tuples
- Functions, Functional Programming
- Understanding NumPy and pandas
- PySpark
Course Outline
- Big Data Concepts and Systems Overview for Data Engineers
- Gartner's Definition of Big Data
- The Big Data Confluence Diagram
- A Practical Definition of Big Data
- Challenges Posed by Big Data
- The Traditional Client - Server Processing Pattern
- Enter Distributed Computing
- Data Physics
- Data Locality (Distributed Computing Economics)
- The CAP Theorem
- Mechanisms to Guarantee a Single CAP Property
- Eventual Consistency
- NoSQL Systems CAP Triangle
- Big Data Sharding
- Sharding Example
- Apache Hadoop
- Hadoop Ecosystem Projects
- Other Hadoop Ecosystem Projects
- Hadoop Design Principles
- Hadoop's Main Components
- Hadoop Simple Definition
- Hadoop Component Diagram
- HDFS
- Storing Raw Data in HDFS and Schema-on-Demand
- MapReduce Defined
- MapReduce Shared-Nothing Architecture
- MapReduce Phases
- The Map Phase
- The Reduce Phase
- Similarity with SQL Aggregation Operations
- Defining Data Engineering
- Data is King
- Translating Data into Operational and Business Insights
- What is Data Engineering
- The Data-Related Roles
- The Data Science Skill Sets
- The Data Engineer Role
- Core Skills and Competencies
- An Example of a Data Product
- What is Data Wrangling (Munging)?
- The Data Exchange Interoperability Options
- Data Processing Phases
- Typical Data Processing Pipeline
- Data Discovery Phase
- Data Harvesting Phase
- Data Priming Phase
- Exploratory Data Analysis
- Model Planning Phase
- Model Building Phase
- Communicating the Results
- Production Roll-out
- Data Logistics and Data Governance
- Data Processing Workflow Engines
- Apache Airflow
- Data Lineage and Provenance
- Apache NiFi
- Python 3 Introduction
- What is Python?
- Python Documentation
- Where Can I Use Python?
- Which version of Python am I running?
- Running Python Programs
- Python Shell
- Dev Tools and REPLs
- IPython
- Jupyter
- The Anaconda Python Distribution
- Python Variables and Types
- Variables and Types
- More on Variables
- Assigning Multiple Values to Multiple Variables
- More on Types
- Variable Scopes
- The Layout of Python Programs
- Comments and Triple-Delimited String Literals
- Sample Python Code
- PEP8
- Getting Help on Python Objects
- Null (None)
- Strings
- Finding Index of a Substring
- String Splitting
- Raw String Literals
- String Formatting and Interpolation
- String Public Method Names
- The Boolean Type
- Boolean Operators
- Relational Operators
- Numbers
- \"Easy Numbers\"
- Looking Up the Runtime Type of a Variable
- Divisions
- Assignment-with-Operation
- Dates and Times
- Control Statements and Data Collections
- Control Flow with The if-elif-else Triad
- An if-elif-else Example
- Conditional Expressions (a.k.a. Ternary Operator)
- The While-Break-Continue Triad
- The for Loop
- The range() Function
- Examples of Using range()
- The try-except-finally Construct
- The assert Expression
- Lists
- Main List Methods
- List Comprehension
- Zipping Lists
- Enumerate
- Dictionaries
- Working with Dictionaries
- Other Dictionary Methods
- Sets
- Set Methods
- Set Operations
- Set Operations Examples
- Finding Unique Elements in a List
- Common Collection Functions and Operators
- Tuples
- Unpacking Tuples
- Functions and Modules
- Built-in Functions
- Functions
- The \"Call by Sharing\" Parameter Passing
- Global and Local Variable Scopes
- Default Parameters
- Named Parameters
- Dealing with Arbitrary Number of Parameters
- Keyword Function Parameters
- What is Functional Programming (FP)?
- Concept: Pure Functions
- Concept: Recursion
- Concept: Higher-Order Functions
- Lambda Functions in Python
- Examples of Using Lambdas
- Lambdas in the Sorted Function
- Python Modules
- Importing Modules
- Installing Modules
- Listing Methods in a Module
- Creating Your Own Modules
- Creating a Module's Entry Point
- File I/O and Useful Modules
- Reading Command-Line Parameters
- Hands-On Exercise (N/A in DCC)
- Working with Files
- Reading and Writing Files
- Random Numbers
- Regular Expressions
- The re Object Methods
- Using Regular Expressions Examples
- Practical Introduction to NumPy
- NumPy
- The First Take on NumPy Arrays
- The ndarray Data Structure
- Getting Help
- Understanding Axes
- Indexing Elements in a NumPy Array
- Understanding Types
- Re-Shaping
- Commonly Used Array Metrics
- Commonly Used Aggregate Functions
- Sorting Arrays
- Vectorization
- Vectorization Visually
- Broadcasting
- Broadcasting Visually
- Filtering
- Array Arithmetic Operations
- Reductions: Finding the Sum of Elements by Axis
- Array Slicing
- 2-D Array Slicing
- Slicing and Stepping Through
- The Linear Algebra Functions
- Practical Introduction to pandas
- What is pandas?
- The Series Object
- Accessing Values and Indexes in Series
- Setting Up Your Own Index
- Using the Series Index as a Lookup Key
- Can I Pack a Python Dictionary into a Series?
- The DataFrame Object
- The DataFrame's Value Proposition
- Creating a pandas DataFrame
- Getting DataFrame Metrics
- Accessing DataFrame Columns
- Accessing DataFrame Rows
- Accessing DataFrame Cells
- Using iloc
- Using loc
- Examples of Using loc
- DataFrames are Mutable via Object Reference!
- The Axes
- Deleting Rows and Columns
- Adding a New Column to a DataFrame
- Appending / Concatenating DataFrame and Series Objects
- Example of Appending / Concatenating DataFrames
- Re-indexing Series and DataFrames
- Getting Descriptive Statistics of DataFrame Columns
- Navigating Rows and Columns For Data Reduction
- Getting Descriptive Statistics of DataFrames
- Applying a Function
- Sorting DataFrames
- Reading From CSV Files
- Writing to the System Clipboard
- Writing to a CSV File
- Fine-Tuning the Column Data Types
- Changing the Type of a Column
- What May Go Wrong with Type Conversion
- Data Grouping and Aggregation with pandas
- Data Aggregation and Grouping
- Sample Data Set
- The pandas.core.groupby.SeriesGroupBy Object
- Grouping by Two or More Columns
- Emulating SQL's WHERE Clause
- The Pivot Tables
- Cross-Tabulation
- Repairing and Normalizing Data
- Repairing and Normalizing Data
- Dealing with the Missing Data
- Sample Data Set
- Getting Info on Null Data
- Dropping a Column
- Interpolating Missing Data in pandas
- Replacing the Missing Values with the Mean Value
- Scaling (Normalizing) the Data
- Data Preprocessing with scikit-learn
- Scaling with the scale() Function
- The MinMaxScaler Object
- Data Visualization in Python
- Data Visualization
- Data Visualization in Python
- Matplotlib
- Getting Started with matplotlib
- The matplotlib.pyplot.plot() Function
- The matplotlib.pyplot.bar() Function
- The matplotlib.pyplot.pie () Function
- The matplotlib.pyplot.subplot() Function
- A Subplot Example
- Figures
- Saving Figures to a File
- Seaborn
- Getting Started with seaborn
- Histograms and KDE
- Plotting Bivariate Distributions
- Scatter plots in seaborn
- Pair plots in seaborn
- Heatmaps
- A Seaborn Scatterplot with Varying Point Sizes and Hues
- ggplot
- Python as a Cloud Scripting Language
- Python's Value
- Python on AWS
- AWS SDK For Python (boto3)
- What is Serverless Computing?
- How Functions Work
- The AWS Lambda Event Handler
- What is AWS Glue?
- PySpark on Glue - Sample Script
- Introduction to Apache Spark
- What is Apache Spark
- The Spark Platform
- Spark vs Hadoop's MapReduce (MR)
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Spark Application Architecture
- The Driver Process
- The Executor and Worker Processes
- Spark Shell
- Jupyter Notebook Shell Environment
- Spark Applications
- The spark-submit Tool
- The spark-submit Tool Configuration
- Interfaces with Data Storage Systems
- The Resilient Distributed Dataset (RDD)
- Datasets and DataFrames
- Spark SQL, DataFrames, and Catalyst Optimizer
- Project Tungsten
- Spark Machine Learning Library
- Spark (Structured) Streaming
- GraphX
- Extending Spark Environment with Custom Modules and Files
- Spark 3
- Spark 3 Updates at a Glance
- The Spark Shell
- The Spark Shell
- The Spark v.2 + Command-Line Shells
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- Jupyter Notebook Shell Environment
- Example of a Jupyter Notebook Web UI (Databricks Cloud)
- The Spark Context (sc) and Spark Session (spark)
- Creating a Spark Session Object in Spark Applications
- The Shell Spark Context Object (sc)
- The Shell Spark Session Object (spark)
- Loading Files
- Saving Files
- Spark RDDs
- The Resilient Distributed Dataset (RDD)
- Ways to Create an RDD
- Supported Data Types
- RDD Operations
- RDDs are Immutable
- Spark Actions
- RDD Transformations
- Other RDD Operations
- Chaining RDD Operations
- RDD Lineage
- The Big Picture
- What May Go Wrong
- Miscellaneous Pair RDD Operations
- RDD Caching
- Parallel Data Processing with Spark
- Running Spark on a Cluster
- Data Partitioning
- Data Partitioning Diagram
- Single Local File System RDD Partitioning
- Multiple File RDD Partitioning
- Special Cases for Small-sized Files
- Parallel Data Processing of Partitions
- Spark Application, Jobs, and Tasks
- Stages and Shuffles
- The "Big Picture"
- Introduction to Spark SQL
- What is Spark SQL?
- Uniform Data Access with Spark SQL
- Using JDBC Sources
- Hive Integration
- What is a DataFrame?
- Creating a DataFrame in PySpark
- Commonly Used DataFrame Methods and Properties in PySpark
- Grouping and Aggregation in PySpark
- The "DataFrame to RDD" Bridge in PySpark
- The SQLContext Object
- Examples of Spark SQL / DataFrame (PySpark Example)
- Converting an RDD to a DataFrame Example
- Example of Reading / Writing a JSON File
- Performance, Scalability, and Fault-tolerance of Spark SQL
- Lab Exercises
- Lab 1. Data Availability and Consistency
- Lab 2. A/B Testing Data Engineering Tasks Project
- Lab 3. Learning the Databricks Community Cloud Lab Environment
- Lab 4. Python Variables
- Lab 5. Dates and Times
- Lab 6. The if, for, and try Constructs
- Lab 7. Understanding Lists
- Lab 8. Dictionaries
- Lab 9. Sets
- Lab 10. Tuples
- Lab 11. Functions
- Lab 12. Functional Programming
- Lab 13. File I/O
- Lab 14. Using HTTP and JSON
- Lab 15. Random Numbers
- Lab 16. Regular Expressions
- Lab 17. Understanding NumPy
- Lab 18. A NumPy Project
- Lab 19. Understanding pandas
- Lab 20. Data Grouping and Aggregation
- Lab 21. Repairing and Normalizing Data
- Lab 22. Data Visualization and EDA with pandas and seaborn
- Lab 23. Correlating Cause and Effect
- Lab 24. Learning PySpark Shell Environment
- Lab 25. Understanding Spark DataFrames
- Lab 26. Learning the PySpark DataFrame API
- Lab 27. Data Repair and Normalization in PySpark
- Lab 28. Working with Parquet File Format in PySpark and pandas
Upcoming Course Dates