WA3262
Practical Data Quality Training
This hands-on Practical Data Quality Training course discusses the core concepts and ideas underlying data quality and introduces the audience to the methods, practices, and techniques used to achieve the much sought-after levels of data quality.
Course Details
Duration
2 days
Prerequisites
Practical work experience in data processing environments.
Target Audience
- Business Analysts
- Data Engineers
- Software Developers
- Architects
- Technical Managers
Course Outline
- Data Quality Introduction
- Data Quality Defined
- Data Quality Dimensions/Properties
- Interpreting Data Quality Properties
- The Typical Data Analytics (Machine Learning) Pipeline
- Data Quality Assurance
- Common Factors Contributing to Poor Data Quality
- Is Bad Data Quality a Good or a Bad Thing?
- Data Quality is a Shared Concern
- Data Governance
- Common Issues that can be Prevented through Effective Governance
- The Data Steward Role
- Common Steps to Overcome Data Quality Issues
- Data Observability
- Application Performance Monitoring (APM) and Observability Magic Quadrant
- Data Quality and Data Observability Relationship
- A Glossary of Business Terms
- Data Dictionaries
- Example of a Data Dictionary
- SLAs
- SLAs and Non-Functional Requirements
- The Great, Fast, and Cheap Quality Diagram
- Measuring the Quality of the Data
- Measuring Data Quality
- Common Corrective Measures for Data Quality Problems
- Descriptive Statistics
- Correlation
- Normal Distribution and Z-Score
- Non-uniformity of a Probability Distribution
- Shannon Entropy
- Gini Impurity
- Confusion Matrix
- The Binary Classification Confusion Matrix
- A Binary Classification Confusion Matrix Visually
- Methods and Techniques for Data Quality
- Connecting to the Digital Realm
- States of Digital Data
- The Methods and Techniques to Ensure Data Quality
- Maintenance
- Automation
- Workflow (Pipeline) Orchestration Systems
- Example of a Workflow Orchestration System: Apache NiFi
- NiFi Processor Types
- Building a Simple Data Flow in the NiFi Designer
- Logging
- Logging Levels
- Data Formats
- Interoperable Data
- Timeliness
- Efficient Storage with Columnar Formats
- Storage and Querying Efficiencies of the Parquet Columnar Storage Format
- Assertions
- The assert Expression in Python
- Two Types of Errors
- Runtime Errors/Exceptions
- Life after an Exception
- Assertions vs Errors (Exceptions)
- Data Validation
- Data Normalization
- DDL-based Data Validation
- An SQL DDL Schema with Constraints Example
- Apache Hive and Schema-on-Demand
- XML and JSON Schemas
- The Schema Production and Consumption Diagram
- Regular Expressions
- Regular Expressions Elements
- What is Unit Testing and Why Should I Care?
- Unit Testing and Test-Driven Development
- TDD Benefits
- Testing for Failure
- Logging and Monitoring
- Data Consistency
- The Consistency Consensus
- The Two-phase Commit (2PC) Protocol
- The CAP Theorem
- Mechanisms for Guaranteeing a Single CAP Property
- The CAP Triangle
- Eventual Consistency
- How eBay Preempts Possible Database Corruption
- The Saga Pattern
- Saga Log and Execution Coordinator
- The Saga Happy Path
- A Saga Compensatory Requests Example
- The Event Sourcing Pattern
- Event Sourcing Example
- Applying Efficiencies to Event Sourcing
- Time Accuracy and Consistency
- Network Time Protocol (NTP)
- Data Quality Best Practices