WA3208
Programming on Azure Databricks with PySpark, SQL, and Scala Training
This intensive hands-on training course teaches the participants the relevant parts of the (Azure) Databricks cloud platform to get them up to speed quickly and offers a unique opportunity to work with multiple programming languages and systems, including PySpark, SQL, and Scala to determine which language/system is best suited for which task at hand.
Course Details
Duration
3 days
Prerequisites
Practical knowledge of data processing and experience using at least one programming language.
Target Audience
- Data Engineers
- Data and Business Analysts
- Information Architects
- Technical Managers
Skills Gained
This hands-on training course teaches the participants the relevant parts of the (Azure) Databricks cloud platform to get them up to speed quickly and offers a unique opportunity to work with multiple programming languages and systems, including PySpark, SQL, and Scala to determine which language/system is best suited for which task at hand.
Course Outline
- Azure Databricks
- Azure Databricks
- Creating an Azure Databricks Workspace UI
- The Azure Databricks Service Blade
- The Databricks Dashboard
- Databricks Cluster Creation UI
- Databricks File System (DBFS)
- Databricks Integration with Data Lake
- Automation Jobs
- Databricks Developer Experience
- Development Environments
- Which Databricks-Supported Language Should I Use?
- Notebook Runtime Flavor Configuration
- The Notebook UI
- Creating Tables
- Create a New Table UI
- Creating a Table from a DBFS File
- Creating Your Table Visually with Databricks UI (The Preview Screen)
- Querying a Databricks Table using SQL
- A Data Profile Visualization Example
- Performing Exploratory Data Analysis (EDA) with Data Charts
- Spark and Databricks
- Real-time Transformations
- Databricks Machine Learning (ML)
- The Cost of Doing Business on Databricks
- Introduction to Apache Spark
- What is Apache Spark
- The Spark Platform
- Spark vs Hadoop's MapReduce (MR)
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Spark Application Architecture
- The Driver Process
- The Executor and Worker Processes
- Spark Shell
- Jupyter Notebook Shell Environment
- Spark Applications
- The spark-submit Tool
- The spark-submit Tool Configuration
- Interfaces with Data Storage Systems
- The Resilient Distributed Dataset (RDD)
- Datasets and DataFrames
- Spark SQL, DataFrames, and Catalyst Optimizer
- Project Tungsten
- Spark Machine Learning Library
- Spark (Structured) Streaming
- GraphX
- Extending Spark Environment with Custom Modules and Files
- Spark 3
- Spark 3 Updates at a Glance
- The Spark Shell
- The Spark Shell
- The Spark v.2 + Command-Line Shells
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- Jupyter Notebook Shell Environment
- Example of a Jupyter Notebook Web UI (Databricks Cloud)
- The Spark Context (sc) and Spark Session (spark)
- Creating a Spark Session Object in Spark Applications
- The Shell Spark Context Object (sc)
- The Shell Spark Session Object (spark)
- Loading Files
- Saving Files
- Introduction to Spark SQL
- What is Spark SQL?
- Uniform Data Access with Spark SQL
- Using JDBC Sources
- Hive Integration
- What is a DataFrame?
- Creating a DataFrame in PySpark
- Creating a DataFrame in PySpark (Cont'd)
- Commonly Used DataFrame Methods and Properties in PySpark
- Commonly Used DataFrame Methods and Properties in PySpark (Cont'd)
- Grouping and Aggregation in PySpark
- The "DataFrame to RDD" Bridge in PySpark
- The SQLContext Object
- Converting an RDD to a DataFrame Example
- Performance, Scalability, and Fault-tolerance of Spark SQL
- Introduction to pandas
- What is pandas?
- Conversion Between PySpark and pandas DataFrames
- Pandas API on Spark
- The pandas DataFrame Object
- The DataFrame's Value Proposition
- Creating a pandas DataFrame
- Getting DataFrame Metrics
- Accessing DataFrame Columns
- Accessing DataFrame Rows
- Accessing DataFrame Cells
- Deleting Rows and Columns
- Adding a New Column to a DataFrame
- Getting Descriptive Statistics of DataFrame Columns
- Getting Descriptive Statistics of DataFrames
- Sorting DataFrames
- Reading From CSV Files
- Writing to a CSV File
- Data Visualization with seaborn in Python
- Data Visualization
- Data Visualization in Python
- Matplotlib
- Getting Started with matplotlib
- Figures
- Saving Figures to a File
- Seaborn
- Getting Started with seaborn
- Histograms and KDE
- Plotting Bivariate Distributions
- Scatter plots in seaborn
- Pair plots in seaborn
- Heatmaps
- Lab Exercises
- Lab 1. Learning the Databricks Community Cloud Lab Environment
- Lab 2. Learning PySpark Shell Environment
- Lab 3. Loading Data into Databricks from Azure Blob Storage (For Review Only)
- Lab 4. Understanding Spark DataFrames
- Lab 5. Learning the PySpark DataFrame API
- Lab 6. Processing Data in PySpark using the DataFrame API (Project)
- Lab 7. Working with Pivot Tables in PySpark (Project)
- Lab 8. Data Visualization and EDA in PySpark
- Lab 9. Data Visualization and EDA in PySpark (Project)
- Lab 10. Creating a Table in Databricks
- Lab 11. SQL Notebooks in Databricks
- Lab 12. Spark Scala Introduction
- Lab 13. Scala Programming with Datasets and DataFrames