Skip to content

Latest commit

 

History

History
424 lines (252 loc) · 13.4 KB

README.md

File metadata and controls

424 lines (252 loc) · 13.4 KB

DP-900 Study Notes

DP-900 Exam guide: https://learn.microsoft.com/en-us/credentials/certifications/exams/dp-900/

LEARNING PATH 1: Microsoft Azure Data Fundamentals: Explore core data concepts

LEARNING PATH 2: Microsoft Azure Data Fundamentals: Explore relational data in Azure

LEARNING PATH 3: Microsoft Azure Data Fundamentals: Explore non-relational data in Azure

LEARNING PATH 4: Microsoft Azure Data Fundamentals: Explore data analytics in Azure


LEARNING PATH 1: Explore core data concepts (DP-900)

Identify data formats

Structured Data - fixed schema, tabular, tables and columns with relational model

Semi-Structured Data - has some structure but allows certain variations

Unstructured Data - documents, images, audio/video/binary files,

Data stores

  • File stores
  • Databases

Explore file storage

  • Delimited text files - good for structured data
  • JavaScript Object Notation (JSON) - good for structured and semi-structured data.
  • Extensible Markup Language (XML)
  • Binary Large Object (BLOB)

Optimized file formats

  • Avro - row based data format, with a header description of structure
  • ORC (Optimized Row Columnar format) - organize data into columns, stripes of data
  • Parquet - columnar data format, row groups

Explore databases

  • Relational databases - store and query structured data
  • Non-relational databases
    • Key-value databases
    • Document databases
    • Column-family databases
    • Graph databases

Transactional data processing

A small descreet unit of work, often high volume

Online Transactional Processing (OLTP) solutions

ACID semantics

  • Atomicity - succeeds completely or fails complteley
  • Consistency - one valid state to another
  • Isolation - no concurrent interference
  • Durability - committed data are persistent

Analytical data processing

Read only snapshot data at a given time or a series of snapshots

  • Data Extracted, Transformed and Loaded (ETL) into Data lake
  • Data loaded into a schema of tables, (two ways)
    • Spark based data lakehouse with tabular abstraction
    • Data warehouse with fully relational SQL engine
  • Data aggregated and loaded into an online analytical processing (OLAP)
  • Data in data lake, data warehouse and analytical model can be queried

Data lakes - large scale data analytics, file based data processing

Data warehouse - store data in optimized relational schema

Data lakehouses - data lakes with data warehouse capabilities

OLAP model - storage optimized for analyical workloads, drill up and down

Different types of users,

  • Data Scientist - works with data files in data lake, explore and model data
  • Data Analyst - query data in data warehouse
  • Business User - consume pre-aggregated data in an OLAP, dashboard or report
  • Database administrators - manage database, maintain, performance, optimization
  • Data engineers - manage infrastructure, pipelines, transfer and transform data
  • Data analysts - create visualizations, insights, analytical models

Data services,

  • Azure SQL - SQL Database, SQL Managed Instance, SQL VM
  • Azure Database for open source databases - MySQL, MariaDB, PostgresSQL
  • Azure Cosmos DB - Global scale no-sql db. json, key-value pairs, column-families and graphs
  • Azure Storage - Blob storage, file shares and tables
  • Azure Data factory - Data pipelines to transfer and transform data, used by data engineers to build extract, transform, and load (ETL) solution
  • Azure Synapse Analytics - unified PaaS for Pipelines, SQL, Spark, KQL querying
  • Azure Databricks - combines Apache Spark with SQL database semantics
  • Azure HDInsight - Azure hosted clusters for open source big data processing, Apache Spark, Hadoop, HBase, Kafka
  • Azure Stream Analytics - real time processing of data streams for analytics
  • Azure Data Explorer - high performance querying of log and telemetry data, with Azure Synaps analytics (IOT)
  • Microsoft Pureview - enterprise wide data governance and discoverability
  • Microsoft Fabric - SaaS analytics platform for lakehouse

LEARNING PATH 2: Explore relational data in Azure (DP-900)

Understand relational data

Strucutred data with relationships

Diferent data types for columns

Most databases supports ANSI standard data types

Understand normalization

Minimizes data duplication

  • Each entity -> own table
  • Atrributes -> into columns
  • Row -> Unique Primary Key
  • Relationships -> with Foreign keys

Explore SQL

Structured Query Language: Microsoft SQL Server, MySQL, PostgreSQL, MariaDB, and Oracle

SELECT, INSERT, UPDATE, DELETE, CREATE, and DROP

Additional proprietary extensions,

  • Transact-SQL by SQL Server database engine
  • pgSQL used by PostgreSQL
  • PL/SQL used by Oracle

SQL statement types

  • DDL - Data Definition Language (CREATE, ALTER, DROP, RENAME) - manage objects
  • DCL - Data Control Language (GRANT, DENY, REVOKE) - manage access to objects
  • DML - Data Manipulation Language (SELECT, INSERT, UPDATE, DELETE) - manipulate data

Describe database objects

    1. View - Virtual table based on a select query (CREATE VIEW Deliveries)
    1. Stored Proceedure - SQL statements that runs on command with parameters (CREATE PROCEDURE RenameProduct)
    1. Index - Specifies a column in a table and create a copy of it in a sorted order with pointers to the rows (CREATE INDEX idx_ProductName)
  • Helps quickly find in massive amount of data

  • Creates a tree based structure for the query optimizer to use

  • Consumes storage space

  • Need to maintain the index according to changes you make to the data

Azure SQL Services and Capabilities

  • SQL Server on Azure VM - lift and shift features
  • Azure SQL Managed instance (PaaS) - automated updates, back and maintenance (mutiple dbs)
  • Azure SQL database (PaaS) - core capabilities of on-prem db (Single DB or Elastic Pool)
  • Azure Edge - IOT workloads, for streaming time-series data

Azure Services for Open-Source

  • Azure Database for MySQL - based on community edition
  • Azure Database for MariaDB - based on community edition
  • Azure Database for PosgreSQL - execute stored proceedures in db itself, pgsql, comes with PostgreSQL Flexible Server - more control and configurations

LEARNING PATH 3: Explore non-relational data in Azure (DP-900)

Azure Blob Storage

Store massive amounts of unstructured binary data, provides an API as well.

Azure Storage account provides blob containers to store the files.

Types of Blobs,

  • Block blobs - optimized for infrequently changed files
  • Page blobs - optimized for random read/write
  • Append blobs - can only append data in small chunks

Access Tiers, (can migrate between as well)

  • Hot - for frequently accessed data
  • Cool - for infrequently accessed data (cheaper)
  • Archive - used for historical data, takes long time to dehydrate (cheaper)

Azure DataLake Storage Gen2

Intergrate into Azure Storage and can be used by Azure DataLake Store systems

Used with Hadoop in Azure HDInsight, Azure Databricks, Azure Synapse analytics

To activate, Enable Hierarchichal Namespace option in Azure Storage account

Azure Files

Cloud based network shares

Used in a Azure Storage account

100 TB in a single account and 1 TB in a single file

  • Standard tier - with hard disk storage
  • Premium tier - with sold state storage

Supported Network file protocols,

  • Server Message Block (SMB) - compatible with mutiple operating systems
  • Network File System (NFS) - for Linux and MacOs, uses a Premium tier

Azure Tables

Used for storing semi-structured key-value data (Denormalized data)

Uses partitioning for grouping similar data for faster access, contains patition key and row key

Azure Cosmos DB

Fully managed and serverless

Fast read write performance and auto scaling

For handling massive amounts of data

Provides an API to work with different data stores

Uses,

  • IoT and telematics.
  • Retail and Marketing
  • Gaming
  • Web and Mobile applications

APIs,

  • Cosmos DB for NoSQL - uses JSON format, works with SQL syntax
  • Cosmos DB for MongoDB - using Binary JSON and MQL query language
  • Cosmos DB for PostgreSQL - native and scales upto multiple nodes
  • Cosmos DB for Table - store data in key-value tables
  • Cosmos DB for Apache Cassandra - uses column-family storage structure
  • Cosmos DB for Apache Gremlin - uses graph data structure, nodes and edges

LEARNING PATH 4: Explore data analytics in Azure (DP-900)

Data warehousing architecture

Elements,

  • Data ingestion and processing

    • Extract, Transform and Load (ETL)
    • Extract, Load and Transform (ELT)
  • Analytical data store

    • data warehouse
    • data lakes
  • Analytical data model

    • pre-aggregates the data for anlytics
  • Data visualization

    • create reports and visulization

Data injestion pipeline

Created using,

  • Azure Data factory
  • Azure Synapse Analyitcs
  • Microsoft Fabric

Pipelines that orchestrates ETL processes

Analytical data stores

  • Data warehouses - data stored in a schema optimized for analytics (structured data)
  • Data lakehouses - data is stored in distributed file system optimized for fast access (semi-structured data)

PaaS Solutions

Azure Synapse Analytics - high performance relationional data warehousing, easy User interface

Azure Databricks - build on Apache Spark, good for multi cloud environments

Azure HDInsight - for open-source data analytics clusters, not very user friendly

Microsoft Fabric

SaaS with a single open format in OneLake

Single integrated environment

Batch Processing

  • Data is collected and stored, then processed together

  • All the data must be ready before processing

  • Problem with a data causes whole process to stop

Stream Processing

  • Data is constantly monitored and processed in real time

  • Ideal for time-critical operations

Many large scale analytics solutions have both batch and stream processing included in their architecture

Architecture for Stream processing

  • data generated by sensors -> captured in a streaming source (queue processor) -> data is process by a query -> results pushed to a dashboard or further processing

Real-time analytics in Azure

  • Azure Stream Analytics - PaaS solution for streaming jobs
  • Spark Structured Streaming - Open source, complex solutions with Azure Synapse Analytics, Azure Databricks, Azure HDInsight
  • Azure Data Explorer - optimized for injesting and querying batch or streaming data

Sources (ingesting data) for Stream Processing

  • Azure Event Hubs
  • Azure IOT Hub
  • Azure Data Lake Store Gen2
  • Apache Kafka

Sinks (output data) for Steam processing

  • Azure Event Hubs
  • Azure Data Lake Store Gen 2 / Azure Blog Storage
  • Azure SQL Database, Synapse Analytics, Databricks
  • Microsoft Power BI

Azure Stream Analytics

Service for complex event processing and analysis of streaming data

Stream analytics query run perpetually as data arrives

Create a Stream Analytics job in Azure or a Stream Analysis cluster for complex tasks

Apache Spark on Microsoft Azure

Distributed processing framework for large scale data anlytics

Used with,

  • Azure Synapse Analytics
  • Azure Databricks
  • Azure HDInsight

Use Spark Structured Streaming API for ingestion, processing and outputting results from a data stream

Delta Lake

Open source storage layer with support for transactional consistency, schema enforcement and other warehousing features

Azure Synapse Analytics and Azure Databricks include support for Delta Lake

Realtime Analytics in Microsoft Fabric

Real time data anlytics and ingestion from multiple sources

Use an eventstream to capture real-time event data

Suite of tools and services for building interactive data visualizations

Import data into Power BI Desktop -> Power BI Service -> View the reports in Web browser or Phone App

Core concepts of data modeling

  • Dimentions - Entities to aggregate (Customer, Product)
  • Measures - Numeric values to analyize
  • Facts - Numeric Measures to be aggregated

Model forms a multi-dimentional structure (cube)

  • Star schema - fact table is related to one or more dimension tables
  • Snowflake schema - dimension tables are related to additional tables

Attribute hierarchies - quickly drill-up and down for aggregated values

Data Visualizations,

  • Tables and text
  • Bar and Column charts
  • Line Charts
  • Pie Charts
  • Scatter Plots
  • Maps
  • Interactive Reports in Power BI