Exploratory Data Analysis is an essential step in making sense of large and complex datasets as it allows us to uncover patterns, identify outliers, and gain a deeper understanding of our data. First, we will go over some basic data manipulations in R.
We can use lapply() to apply the factor() function to the columns stored as list. The resulting data frame mydata will have columns “team” and “Region” as factors.
Note that the factor() function is applied to each column separately using lapply(). This is a convenient way if you have a lot of columns to be converted to a factor.
You can use the is.na() function to check for missing values in a vector or data frame. To check for missing values in a data frame, you can use the sum() function in combination with is.na(). The sum() function will count the number of missing values in each column.
Check the total number of missing values in column hits. is.na(mydata$hits) creates a boolean that will tell you if there is a missing value for runs. The sum() function will count the number of missing values.
sum(is.na(mydata$hits))
[1] 6
Check the total number of missing values in mydata. This code will sum up missing values in all column.
sum(is.na(mydata))
[1] 66
Check the total numner of missing values in each column in mydata
Remove or drop rows with NA using omit() function:
You can use the omit() function to remove rows with missing values from a data frame. Let’s remove from mydata with all rows with a missing value and call the reduced dataset mydata1. Note that na.omit() returns a new data frame object and does not modify the original data frame.
mydata1 <-na.omit(mydata)# Check the total number of missing values in mydata1sum(is.na(mydata1))
[1] 0
Alternatively, we can use complete.cases() function to do the same task. Let’s remove from mydata with all rows with a missing value and call the reduced dataset mydata2 with complete.cases() function
mydata2 <- mydata[complete.cases(mydata),]# Check the total number of missing values in mydata2sum(is.na(mydata2))
[1] 0
How about if we want remove rows for a certain column? Let’s remove rows from mydata when hits column takes a missing value and store the new dataset as mydata3.
mydata3 <- mydata[complete.cases(mydata$hits), ]
complete.cases() creates a logical vector that identifies rows with complete cases for column hits.
Data Filtering in R
subset() function is one method to create a subset of a dataframe based on some logical conditions. For instance, we can use subset() function to remove missing values in a dataframe. The code below takes the mydata and select rows when runs column has no missing values.
mydata3<-subset(mydata, !is.na(mydata$runs))
The filter() function from the dplyr package is another way to filter our dataframe: We can achive the same goal with the following line of code.
Correlation analysis is another powerful tool in data analysis that helps us understand the relationship between two variables. One common measure is called Pearson’s correlation which measures the linear relationship between two continuous variables.
To calculate the Pearson correlation coefficient between two variables in R, we can use the cor function. For example, let’s calculate the correlation between the mydata2 dataset’s runs and wins variables:
cor(mydata2$runs , mydata2$wins)
[1] 0.6008088
The correlation is positive and strong. Let’s plot it below. The code below creates a scatter plot with a regression line that shows the relationship between runs and wins.method = "lm" will draw the fitted line. se=TRUE will also draw the confidence intervals.
library(ggplot2)ggplot(mydata2, aes(x = runs, y = wins)) +geom_point() +geom_smooth(method ="lm", se =TRUE)
`geom_smooth()` using formula = 'y ~ x'
We can use the corrplot package for nice visualization.
# Filter mydata2 to view only continuous variables, calculate the correlation matrix and call it Correlmydata2 %>%select(-Region, -team) %>%cor() -> Correl# Let's reduce the decimal points in Correl R object to 2 digitsCorrel<-round(cor(Correl),2) # Visualize the correlation matrix using a correlation plotcorrplot(Correl, method="circle")
# Alternatively, you can use the method="number" option to print the correlation coefficient numbers
Now, let’s sort the correlation coefficients by absolute value and list the top n variables that have the highest correlation coefficients (in absolute value) with a variable of interest, say runs. runs variable is stored in the first column in our correlation matrix.
# Sort the correlation matrix by absolute value for the first column which stores data for runs. Name the new R object as sorted_matrix, we only sort the matrix in descending order for the first column where we have pearson correlation coefficients between each variable in our dataset and runssorted_matrix <- Correl[order(abs(Correl[, 1]), decreasing =TRUE), ]# Select only the top 5 variables from the sorted matrix, 1:5 will only print the first 5 rows. top_5_variables only stores the corrrelation coefficient againts runs. top_5_variables <- sorted_matrix[1:5, 1]# Print the top 5 variablestop_5_variables