[ ITDM ] Lecture 3 - Data Cleaning

  • Outline:
    • Data
    • Data selection
    • Data cleaning
    • Data normalization
    • Data discretization
      • equal width, equal depth, cluster analysis
    • Dimension reduction
      • Feature selection
      • Feature extraction: PCA

Data

  • A collection of data objects and their attributes
  • Attribute: property or characteristic of an object

    • variable, field, characteristic, or feature
  • A collection of attrubures describe an object

    • record, point, case, sample, entity, or instance

Types of attrubutes

  • Nominal
    • id numbers, eye color, zip codes
  • Ordinal
    • rankings, grades, height
  • Interval
    • calendar dates, temperatures in Celsius or Fahrenheit
  • Ratio
    • temperature in Kelvin, length, time, counts

Properties of Attributes

  • Distinctness: =&!=
  • Order: <>
  • Addition: + -
  • Multiplication: */

  • Nominal attribute: distinctness
  • Ordinal attribute: distinctness & order
  • interval attribute: distinctness, order & addition
  • ratio attribute: all four properties

Data Selection

Sampling

  • main technique
  • often used for both preliminary investigation of the data and the final data analysis
  • Why: Obtaining the entire dataset is too expensive or impossible
    • Processing the data is too expensive or time consuming

Key Principle of Sampling: Representative

  • using a sample should work as well as the entire dataset
  • representative if it has approximately the same distribution of properties as the original set of data

Types of Sampling

  • Simple random sampling
  • Sampling without replacement
  • Sampling with replacement

  • Stratified sampling

1
2
3
4
5
6
7
8
9
sample(x, size, replace=FALSE, prob=NULL)
#sample 5 observations from grade
set.seed(1)
sampled_rows = sample(1:nrow(grade),5)
sampled_grade = grade[sampled_rows,]

#or equally using:
set.seed(1)
sampled_grade = grade[saple(1:nrow(grade), 5),]

Data Cleaning

Steps of Data Cleaning

  • real-world data tends to be incomplete, noisy, and inconsistent
    • Handle missing values
    • Quality-check
    • Deduplicate

Missing values

  • Types:
    • Missing completely at random
    • missing at random
    • missing not at random

Handling Missing Values

  • Delete the observation
  • Fill in missing values with:
    • the attribute mean, max, min, 0 or specified value
    • the attribute mean for all samples belonging to the same class (smarter)
    • the most probable value: inference-based such as Bayesian formula or decision tree

Data Normalization

Decimal Scaling

  • $v’ = \frac{v}{10^j}$
    • where j is the smallest integer such that max(|v’|)<1

Min-Max Normalization

  • linear transformation among the original data
  • $v’ = \frac{v-min_A}{max_A-min_A}(newMax_A-newMin_A)+new_min_A$

Z-Score Normalization

  • 正态化
1
2
3
x=c(1,3,5,7)
scale(x, center=T, scale=F) #不收拢
scale(x) #标准化

Data Discretization

  • Divide the range of a continuous attribute into intervals
  • Reasons
    • algorithms only accept categorical attributes
    • reduce data size by discretization

Equal-width Discretization

  • L = $\frac{max-min}{N}$
  • Characteristics:
    • The most straightforward method, but not handle well on skewed data.
    • Sensitive to N and outliers

Equal-depth Discretization

  • Divides the range into N intervals, each containing approximately the same number of samples.
  • Characteristics
    • Good data scaling
    • Not so sensitive to outliers

Cluster Analysis

  • Discretize a numeric attribute by partitioning the values of the attribute into clusters or groups.
  • Characteristics
    -