WA3262

Practical Data Quality Training

This hands-on Practical Data Quality Training course discusses the core concepts and ideas underlying data quality and introduces the audience to the methods, practices, and techniques used to achieve the much sought-after levels of data quality.
Course Details

Duration

2 days

Prerequisites

Practical work experience in data processing environments.

Target Audience

  • Business Analysts
  • Data Engineers
  • Software Developers
  • Architects
  • Technical Managers 
Course Outline
  • Data Quality Introduction
    • Data Quality Defined
    • Data Quality Dimensions/Properties
    • Interpreting Data Quality Properties
    • The Typical Data Analytics (Machine Learning) Pipeline
    • Data Quality Assurance
    • Common Factors Contributing to Poor Data Quality
    • Is Bad Data Quality a Good or a Bad Thing?
    • Data Quality is a Shared Concern
    • Data Governance
    • Common Issues that can be Prevented through Effective Governance
    • The Data Steward Role
    • Common Steps to Overcome Data Quality Issues
    • Data Observability
    • Application Performance Monitoring (APM) and Observability Magic Quadrant
    • Data Quality and Data Observability Relationship
    • A Glossary of Business Terms
    • Data Dictionaries
    • Example of a Data Dictionary
    • SLAs
    • SLAs and Non-Functional Requirements
    • The Great, Fast, and Cheap Quality Diagram
  • Measuring the Quality of the Data
    • Measuring Data Quality
    • Common Corrective Measures for Data Quality Problems
    • Descriptive Statistics
    • Correlation
    • Normal Distribution and Z-Score
    • Non-uniformity of a Probability Distribution
    • Shannon Entropy
    • Gini Impurity
    • Confusion Matrix
    • The Binary Classification Confusion Matrix
    • A Binary Classification Confusion Matrix Visually
  • Methods and Techniques for Data Quality
    • Connecting to the Digital Realm
    • States of Digital Data
    • The Methods and Techniques to Ensure Data Quality
    • Maintenance
    • Automation
    • Workflow (Pipeline) Orchestration Systems
    • Example of a Workflow Orchestration System: Apache NiFi
    • NiFi Processor Types
    • Building a Simple Data Flow in the NiFi Designer
    • Logging
    • Logging Levels
    • Data Formats
    • Interoperable Data
    • Timeliness
    • Efficient Storage with Columnar Formats
    • Storage and Querying Efficiencies of the Parquet Columnar Storage Format
    • Assertions
    • The assert Expression in Python
    • Two Types of Errors
    • Runtime Errors/Exceptions
    • Life after an Exception
    • Assertions vs Errors (Exceptions)
    • Data Validation
    • Data Normalization
    • DDL-based Data Validation
    • An SQL DDL Schema with Constraints Example
    • Apache Hive and Schema-on-Demand
    • XML and JSON Schemas
    • The Schema Production and Consumption Diagram
    • Regular Expressions
    • Regular Expressions Elements
    • What is Unit Testing and Why Should I Care?
    • Unit Testing and Test-Driven Development
    • TDD Benefits
    • Testing for Failure
    • Logging and Monitoring
  • Data Consistency
    • The Consistency Consensus
    • The Two-phase Commit (2PC) Protocol
    • The CAP Theorem
    • Mechanisms for Guaranteeing a Single CAP Property
    • The CAP Triangle
    • Eventual Consistency
    • How eBay Preempts Possible Database Corruption
    • The Saga Pattern
    • Saga Log and Execution Coordinator
    • The Saga Happy Path
    • A Saga Compensatory Requests Example
    • The Event Sourcing Pattern
    • Event Sourcing Example
    • Applying Efficiencies to Event Sourcing
    • Time Accuracy and Consistency
    • Network Time Protocol (NTP)
  • Data Quality Best Practices