Skip to content

Latest commit

 

History

History
62 lines (52 loc) · 3.22 KB

03-data-understanding-to-preparation.md

File metadata and controls

62 lines (52 loc) · 3.22 KB

Data Understanding

Understanding the data

  • Use of descriptive statistics to understand data
  • Univariate statistics
  • Statistics such as mean, median, min, max, stddev
  • Pairwise correlations, indicative of relationships worth further study, adjustments for potential confounding variables
  • Histograms, graph that shows the frequency of numerical data using rectangles

Look at data quality

  • Missing values
  • Invalid values
  • Duplicate values
  • Formatting issues

Run an iterative process

  • Collect data and understand in multiple iterations

Data Preparation

  • Cleansing data,
  • Takes most time ~70-90% in data science project
  • Analogy: it takes more time to chop onions but to cook a recipe the onion needs to be chopped
  • Data understanding: What does it mean to "prepare" and "clean" the data?
  • Data preparation: What are ways in which data is prepared?
  • Feature engineering is the use of domain knowledge and expertise to enable machine learning process
  • Feature engineering works like a funnel to identify and filter the best candidate variables to use in ML
  • Randomly groups data can be divided into training set and testing set

Case Study

  • Define the precise CHF, be very precise as there are many types of CHF
  • Define the CHF readmission criteria and use clinical expertise
  • Collect and aggregate all available transactions and records of each CHF patient
  • Standardize the data so each patient shared the same columns
  • Run literary review to ensure data quality and correctness of procedures
  • Consolidate all data into single table
  • Define the target, patient features and diagnosis flags such us existing illness
  • Divide dataset into trainig and testing sets

Correlation /= Causation

While there is a correlation between ice cream and weight gain, it does not equal causation. Think of it a different way. If a person ate 500 calories of ice cream each day and nothing else they would lose weight. Ice cream is not the cause, it is the restricted number of calories causing the weight loss. The reverse holds true for weight gain. - @DecisionSkills / YT Channel

Summary

  • Data understanding covers tasks such as building the dataset and ensuring fitness of it we are trying to answer
  • Data understanding uses various analytic approach such as descriptive statistics, predicitive statistics or both
  • Data understanding should make complete definition of the requirements

References