Programming on Azure Databricks with PySpark, SQL, and Scala Training

This intensive hands-on training course teaches the participants the relevant parts of the (Azure) Databricks cloud platform to get them up to speed quickly and offers a unique opportunity to work with multiple programming languages and systems, including PySpark, SQL, and Scala to determine which language/system is best suited for which task at hand.

Request On-Site or Customized Course Info

Course Details

Duration

3 days

Prerequisites

Practical knowledge of data processing and experience using at least one programming language.

Target Audience

Data Engineers
Data and Business Analysts
Information Architects
Technical Managers

Skills Gained

This hands-on training course teaches the participants the relevant parts of the (Azure) Databricks cloud platform to get them up to speed quickly and offers a unique opportunity to work with multiple programming languages and systems, including PySpark, SQL, and Scala to determine which language/system is best suited for which task at hand.

Course Outline

Azure Databricks
- Azure Databricks
- Creating an Azure Databricks Workspace UI
- The Azure Databricks Service Blade
- The Databricks Dashboard
- Databricks Cluster Creation UI
- Databricks File System (DBFS)
- Databricks Integration with Data Lake
- Automation Jobs
- Databricks Developer Experience
- Development Environments
- Which Databricks-Supported Language Should I Use?
- Notebook Runtime Flavor Configuration
- The Notebook UI
- Creating Tables
- Create a New Table UI
- Creating a Table from a DBFS File
- Creating Your Table Visually with Databricks UI (The Preview Screen)
- Querying a Databricks Table using SQL
- A Data Profile Visualization Example
- Performing Exploratory Data Analysis (EDA) with Data Charts
- Spark and Databricks
- Real-time Transformations
- Databricks Machine Learning (ML)
- The Cost of Doing Business on Databricks
Introduction to Apache Spark
- What is Apache Spark
- The Spark Platform
- Spark vs Hadoop's MapReduce (MR)
- Common Spark Use Cases
- Languages Supported by Spark
- Running Spark on a Cluster
- The Spark Application Architecture
- The Driver Process
- The Executor and Worker Processes
- Spark Shell
- Jupyter Notebook Shell Environment
- Spark Applications
- The spark-submit Tool
- The spark-submit Tool Configuration
- Interfaces with Data Storage Systems
- The Resilient Distributed Dataset (RDD)
- Datasets and DataFrames
- Spark SQL, DataFrames, and Catalyst Optimizer
- Project Tungsten
- Spark Machine Learning Library
- Spark (Structured) Streaming
- GraphX
- Extending Spark Environment with Custom Modules and Files
- Spark 3
- Spark 3 Updates at a Glance
The Spark Shell
- The Spark Shell
- The Spark v.2 + Command-Line Shells
- The Spark Shell UI
- Spark Shell Options
- Getting Help
- Jupyter Notebook Shell Environment
- Example of a Jupyter Notebook Web UI (Databricks Cloud)
- The Spark Context (sc) and Spark Session (spark)
- Creating a Spark Session Object in Spark Applications
- The Shell Spark Context Object (sc)
- The Shell Spark Session Object (spark)
- Loading Files
- Saving Files
Introduction to Spark SQL
- What is Spark SQL?
- Uniform Data Access with Spark SQL
- Using JDBC Sources
- Hive Integration
- What is a DataFrame?
- Creating a DataFrame in PySpark
- Creating a DataFrame in PySpark (Cont'd)
- Commonly Used DataFrame Methods and Properties in PySpark
- Commonly Used DataFrame Methods and Properties in PySpark (Cont'd)
- Grouping and Aggregation in PySpark
- The "DataFrame to RDD" Bridge in PySpark
- The SQLContext Object
- Converting an RDD to a DataFrame Example
- Performance, Scalability, and Fault-tolerance of Spark SQL
Introduction to pandas
- What is pandas?
- Conversion Between PySpark and pandas DataFrames
- Pandas API on Spark
- The pandas DataFrame Object
- The DataFrame's Value Proposition
- Creating a pandas DataFrame
- Getting DataFrame Metrics
- Accessing DataFrame Columns
- Accessing DataFrame Rows
- Accessing DataFrame Cells
- Deleting Rows and Columns
- Adding a New Column to a DataFrame
- Getting Descriptive Statistics of DataFrame Columns
- Getting Descriptive Statistics of DataFrames
- Sorting DataFrames
- Reading From CSV Files
- Writing to a CSV File
Data Visualization with seaborn in Python
- Data Visualization
- Data Visualization in Python
- Matplotlib
- Getting Started with matplotlib
- Figures
- Saving Figures to a File
- Seaborn
- Getting Started with seaborn
- Histograms and KDE
- Plotting Bivariate Distributions
- Scatter plots in seaborn
- Pair plots in seaborn
- Heatmaps
Lab Exercises
- Lab 1. Learning the Databricks Community Cloud Lab Environment
- Lab 2. Learning PySpark Shell Environment
- Lab 3. Loading Data into Databricks from Azure Blob Storage (For Review Only)
- Lab 4. Understanding Spark DataFrames
- Lab 5. Learning the PySpark DataFrame API
- Lab 6. Processing Data in PySpark using the DataFrame API (Project)
- Lab 7. Working with Pivot Tables in PySpark (Project)
- Lab 8. Data Visualization and EDA in PySpark
- Lab 9. Data Visualization and EDA in PySpark (Project)
- Lab 10. Creating a Table in Databricks
- Lab 11. SQL Notebooks in Databricks
- Lab 12. Spark Scala Introduction
- Lab 13. Scala Programming with Datasets and DataFrames

Programming on Azure Databricks with PySpark, SQL, and Scala Training

Duration

Prerequisites

Target Audience

Skills Gained

Course Catalog

Upskilling and Reskilling

Resources

About Us

Contact