Exploratory Data Analysis in R

R
Data Manipulations
Author

Levent Bulut

Data Manipulations in R

Exploratory Data Analysis is an essential step in making sense of large and complex datasets as it allows us to uncover patterns, identify outliers, and gain a deeper understanding of our data. First, we will go over some basic data manipulations in R.

Our data is stored in an R object called mydata.

mydata<-read.csv("mlb11.csv",header = TRUE)
library(corrplot)
Warning: package 'corrplot' was built under R version 4.2.2
corrplot 0.92 loaded
library (dplyr)
Warning: package 'dplyr' was built under R version 4.2.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.3
library(knitr)
Warning: package 'knitr' was built under R version 4.2.3

Check the column names and the data dimension

In R, you can check the column names of a data frame using the colnames() or names() function

colnames(mydata)
 [1] "team"         "runs"         "at_bats"      "hits"         "homeruns"    
 [6] "bat_avg"      "strikeouts"   "stolen_bases" "wins"         "new_onbase"  
[11] "new_slug"     "new_obs"      "Region"      

You can check the dimensions of a data frame using the dim() function

dim(mydata)
[1] 36 13

You can use the str() function to check the data structure of an object.

str(mydata)
'data.frame':   36 obs. of  13 variables:
 $ team        : chr  "Texas Rangers" "Boston Red Sox" "Detroit Tigers" "Kansas City Royals" ...
 $ runs        : int  855 875 787 730 762 718 867 721 735 615 ...
 $ at_bats     : int  5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 ...
 $ hits        : int  1599 1600 1540 1560 1513 1477 1452 1422 1429 1442 ...
 $ homeruns    : int  210 203 169 129 162 108 222 185 163 95 ...
 $ bat_avg     : num  0.283 0.28 0.277 0.275 0.273 0.264 0.263 0.261 0.258 0.258 ...
 $ strikeouts  : int  930 1108 1143 1006 978 1085 1138 1083 1201 1164 ...
 $ stolen_bases: int  143 102 49 153 57 130 147 94 118 118 ...
 $ wins        : int  96 90 95 71 90 77 97 96 73 56 ...
 $ new_onbase  : num  0.34 0.349 0.34 0.329 0.341 0.335 0.343 0.325 0.329 0.311 ...
 $ new_slug    : num  0.46 0.461 0.434 0.415 0.425 0.391 0.444 0.425 0.41 0.374 ...
 $ new_obs     : num  0.8 0.81 0.773 0.744 0.766 0.725 0.788 0.75 0.739 0.684 ...
 $ Region      : chr  "South" "North" "North" "North" ...

Assign multiple columns as factor

We can use lapply() to apply the factor() function to the columns stored as list. The resulting data frame mydata will have columns “team” and “Region” as factors.

Note that the factor() function is applied to each column separately using lapply(). This is a convenient way if you have a lot of columns to be converted to a factor.

list<-c('team', 'Region')
mydata[list]<-lapply(mydata[list], factor)

Check missing values in a dataframe

You can use the is.na() function to check for missing values in a vector or data frame. To check for missing values in a data frame, you can use the sum() function in combination with is.na(). The sum() function will count the number of missing values in each column.

Check the total number of missing values in column hits. is.na(mydata$hits) creates a boolean that will tell you if there is a missing value for runs. The sum() function will count the number of missing values.

sum(is.na(mydata$hits))
[1] 6

Check the total number of missing values in mydata. This code will sum up missing values in all column.

sum(is.na(mydata))
[1] 66

Check the total numner of missing values in each column in mydata

sapply(mydata, function(x) sum(is.na(x)))
        team         runs      at_bats         hits     homeruns      bat_avg 
           0            6            6            6            6            6 
  strikeouts stolen_bases         wins   new_onbase     new_slug      new_obs 
           6            6            6            6            6            6 
      Region 
           0 

Remove or drop rows with NA using omit() function:

You can use the omit() function to remove rows with missing values from a data frame. Let’s remove from mydata with all rows with a missing value and call the reduced dataset mydata1. Note that na.omit() returns a new data frame object and does not modify the original data frame.

mydata1 <- na.omit(mydata)
# Check the total number of missing values in mydata1
sum(is.na(mydata1))
[1] 0

Alternatively, we can use complete.cases() function to do the same task. Let’s remove from mydata with all rows with a missing value and call the reduced dataset mydata2 with complete.cases() function

mydata2 <- mydata[complete.cases(mydata),]

# Check the total number of missing values in mydata2
sum(is.na(mydata2))
[1] 0

How about if we want remove rows for a certain column? Let’s remove rows from mydata when hits column takes a missing value and store the new dataset as mydata3.

mydata3 <- mydata[complete.cases(mydata$hits), ]

complete.cases() creates a logical vector that identifies rows with complete cases for column hits.

Data Filtering in R

subset() function is one method to create a subset of a dataframe based on some logical conditions. For instance, we can use subset() function to remove missing values in a dataframe. The code below takes the mydata and select rows when runs column has no missing values.

mydata3<-subset(mydata, !is.na(mydata$runs))

The filter() function from the dplyr package is another way to filter our dataframe: We can achive the same goal with the following line of code.

#library(dplyr)
mydata3<-filter(mydata, !is.na(mydata$runs))

Correlation

Correlation analysis is another powerful tool in data analysis that helps us understand the relationship between two variables. One common measure is called Pearson’s correlation which measures the linear relationship between two continuous variables.

To calculate the Pearson correlation coefficient between two variables in R, we can use the cor function. For example, let’s calculate the correlation between the mydata2 dataset’s runs and wins variables:

cor(mydata2$runs , mydata2$wins)
[1] 0.6008088

The correlation is positive and strong. Let’s plot it below. The code below creates a scatter plot with a regression line that shows the relationship between runs and wins. method = "lm" will draw the fitted line. se=TRUE will also draw the confidence intervals.

library(ggplot2)

ggplot(mydata2, aes(x = runs, y = wins)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE)
`geom_smooth()` using formula = 'y ~ x'

We can use the corrplot package for nice visualization.

# Filter mydata2 to view only continuous variables, calculate the correlation matrix and call it Correl
mydata2 %>% select(-Region, -team) %>% cor() -> Correl


# Let's reduce the decimal points in Correl R object to 2 digits

Correl<-round(cor(Correl),2)  

# Visualize the correlation matrix using a correlation plot
corrplot(Correl, method="circle")

# Alternatively, you can use the method="number" option to print the correlation coefficient numbers

Now, let’s sort the correlation coefficients by absolute value and list the top n variables that have the highest correlation coefficients (in absolute value) with a variable of interest, say runs. runs variable is stored in the first column in our correlation matrix.

# Sort the correlation matrix by absolute value for the first column which stores data for runs. Name the new R object as sorted_matrix, we only sort the matrix in descending order for the first column where we have pearson correlation coefficients between each variable in our dataset and runs
sorted_matrix <- Correl[order(abs(Correl[, 1]), decreasing = TRUE), ]

# Select only the top 5 variables from the sorted matrix, 1:5 will only print the first 5 rows. top_5_variables only stores the corrrelation coefficient againts runs. 
top_5_variables <- sorted_matrix[1:5, 1]

# Print the top 5 variables
top_5_variables
      runs    new_obs new_onbase   new_slug    bat_avg 
      1.00       1.00       0.99       0.99       0.94