WA3208

Programming on Azure Databricks with PySpark, SQL, and Scala Training

This intensive hands-on training course teaches the participants the relevant parts of the (Azure) Databricks cloud platform to get them up to speed quickly and offers a unique opportunity to work with multiple programming languages and systems, including PySpark, SQL, and Scala to determine which language/system is best suited for which task at hand.

Course Details

Duration

3 days

Prerequisites

Practical knowledge of data processing and experience using at least one programming language.

Target Audience

  • Data Engineers
  • Data and Business Analysts
  • Information Architects
  • Technical Managers

Skills Gained

This hands-on training course teaches the participants the relevant parts of the (Azure) Databricks cloud platform to get them up to speed quickly and offers a unique opportunity to work with multiple programming languages and systems, including PySpark, SQL, and Scala to determine which language/system is best suited for which task at hand.
Course Outline
  • Azure Databricks
    • Azure Databricks
    • Creating an Azure Databricks Workspace UI
    • The Azure Databricks Service Blade
    • The Databricks Dashboard
    • Databricks Cluster Creation UI
    • Databricks File System (DBFS)
    • Databricks Integration with Data Lake
    • Automation Jobs
    • Databricks Developer Experience
    • Development Environments
    • Which Databricks-Supported Language Should I Use?
    • Notebook Runtime Flavor Configuration
    • The Notebook UI
    • Creating Tables
    • Create a New Table UI
    • Creating a Table from a DBFS File
    • Creating Your Table Visually with Databricks UI (The Preview Screen)
    • Querying a Databricks Table using SQL
    • A Data Profile Visualization Example
    • Performing Exploratory Data Analysis (EDA) with Data Charts
    • Spark and Databricks
    • Real-time Transformations
    • Databricks Machine Learning (ML)
    • The Cost of Doing Business on Databricks
  • Introduction to Apache Spark
    • What is Apache Spark
    • The Spark Platform
    • Spark vs Hadoop's MapReduce (MR)
    • Common Spark Use Cases
    • Languages Supported by Spark
    • Running Spark on a Cluster
    • The Spark Application Architecture
    • The Driver Process
    • The Executor and Worker Processes
    • Spark Shell
    • Jupyter Notebook Shell Environment
    • Spark Applications
    • The spark-submit Tool
    • The spark-submit Tool Configuration
    • Interfaces with Data Storage Systems
    • The Resilient Distributed Dataset (RDD)
    • Datasets and DataFrames
    • Spark SQL, DataFrames, and Catalyst Optimizer
    • Project Tungsten
    • Spark Machine Learning Library
    • Spark (Structured) Streaming
    • GraphX
    • Extending Spark Environment with Custom Modules and Files
    • Spark 3
    • Spark 3 Updates at a Glance
  • The Spark Shell
    • The Spark Shell
    • The Spark v.2 + Command-Line Shells
    • The Spark Shell UI
    • Spark Shell Options
    • Getting Help
    • Jupyter Notebook Shell Environment
    • Example of a Jupyter Notebook Web UI (Databricks Cloud)
    • The Spark Context (sc) and Spark Session (spark)
    • Creating a Spark Session Object in Spark Applications
    • The Shell Spark Context Object (sc)
    • The Shell Spark Session Object (spark)
    • Loading Files
    • Saving Files
  • Introduction to Spark SQL
    • What is Spark SQL?
    • Uniform Data Access with Spark SQL
    • Using JDBC Sources
    • Hive Integration
    • What is a DataFrame?
    • Creating a DataFrame in PySpark
    • Creating a DataFrame in PySpark (Cont'd)
    • Commonly Used DataFrame Methods and Properties in PySpark
    • Commonly Used DataFrame Methods and Properties in PySpark (Cont'd)
    • Grouping and Aggregation in PySpark
    • The "DataFrame to RDD" Bridge in PySpark
    • The SQLContext Object
    • Converting an RDD to a DataFrame Example
    • Performance, Scalability, and Fault-tolerance of Spark SQL
  • Introduction to pandas
    • What is pandas?
    • Conversion Between PySpark and pandas DataFrames
    • Pandas API on Spark
    • The pandas DataFrame Object
    • The DataFrame's Value Proposition
    • Creating a pandas DataFrame
    • Getting DataFrame Metrics
    • Accessing DataFrame Columns
    • Accessing DataFrame Rows
    • Accessing DataFrame Cells
    • Deleting Rows and Columns
    • Adding a New Column to a DataFrame
    • Getting Descriptive Statistics of DataFrame Columns
    • Getting Descriptive Statistics of DataFrames
    • Sorting DataFrames
    • Reading From CSV Files
    • Writing to a CSV File
  • Data Visualization with seaborn in Python
    • Data Visualization
    • Data Visualization in Python
    • Matplotlib
    • Getting Started with matplotlib
    • Figures
    • Saving Figures to a File
    • Seaborn
    • Getting Started with seaborn
    • Histograms and KDE
    • Plotting Bivariate Distributions
    • Scatter plots in seaborn
    • Pair plots in seaborn
    • Heatmaps
  • Lab Exercises
    • Lab 1. Learning the Databricks Community Cloud Lab Environment
    • Lab 2. Learning PySpark Shell Environment
    • Lab 3. Loading Data into Databricks from Azure Blob Storage (For Review Only)
    • Lab 4. Understanding Spark DataFrames
    • Lab 5. Learning the PySpark DataFrame API
    • Lab 6. Processing Data in PySpark using the DataFrame API (Project)
    • Lab 7. Working with Pivot Tables in PySpark (Project)
    • Lab 8. Data Visualization and EDA in PySpark
    • Lab 9. Data Visualization and EDA in PySpark (Project)
    • Lab 10. Creating a Table in Databricks
    • Lab 11. SQL Notebooks in Databricks
    • Lab 12. Spark Scala Introduction
    • Lab 13. Scala Programming with Datasets and DataFrames