29  EDA з dlookR


https://rpubs.com/linggaajiandika/EDA

29.1 00. Introduction

Exploratory Data Analysis (EDA) is the first step in data analysis process developed by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.

note: some processed data is skipped or ignored like cbind or selecting variable because it’s not the focus of this tutorial

29.2 01. Load Package

#install.packages("dlookr")
library(dplyr) #A Grammar of Data Manipulation
library(tibble) #modern take on data frames.
library(dlookr) #Tools for Data Diagnosis, Exploration, Transformation (main library)

29.3 02. Load Dataset

he dataste used is airquality which is already available in R, the airquality dataset is daily air quality measurements in New York, May to September 1973.

# View Data
head(airquality)
A data.frame: 6 × 6
Ozone Solar.R Wind Temp Month Day
<int> <int> <dbl> <int> <int> <int>
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

29.4 03. Data Diagnosis

The first step is diagnosis from simple data

29.4.1 3.1 General Data Diagnosis

diagnose(airquality)
A tibble: 6 × 6
variables types missing_count missing_percent unique_count unique_rate
<chr> <chr> <int> <dbl> <int> <dbl>
Ozone integer 37 24.183007 68 0.44444444
Solar.R integer 7 4.575163 118 0.77124183
Wind numeric 0 0.000000 31 0.20261438
Temp integer 0 0.000000 40 0.26143791
Month integer 0 0.000000 5 0.03267974
Day integer 0 0.000000 31 0.20261438
  • variables : variable names
  • types : the data type of the variables
  • missing_count : number of missing values
  • missing_percent : percentage of missing values
  • unique_count : number of unique values
  • unique_rate : rate of unique value. unique_count / number of observation

29.4.2 3.2 Diagnose Numeric Variable

Only Numeric Variable

diagnose_numeric(airquality)
A data.frame: 6 × 10
variables min Q1 mean median Q3 max zero minus outlier
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
Ozone 1.0 18.00 42.129310 31.5 63.25 168.0 0 0 2
Solar.R 7.0 115.75 185.931507 205.0 258.75 334.0 0 0 0
Wind 1.7 7.40 9.957516 9.7 11.50 20.7 0 0 3
Temp 56.0 72.00 77.882353 79.0 85.00 97.0 0 0 0
Month 5.0 6.00 6.993464 7.0 8.00 9.0 0 0 0
Day 1.0 8.00 15.803922 16.0 23.00 31.0 0 0 0
#diagnose_web_report(airquality)