Exploratory Data Analysis
This post addresses the following prompt:
Exploratory Data Analysis (EDA) is often the first step of dealing with data. However, EDA is somewhat of an art and something that you get better at with experience. Often, people new to EDA don’t know what they should be looking for! For your blog post, write up the strategy you use for EDA. What is your overall goal when doing an EDA? What methods do you think are important? What things do you try to look for?
The overall goal when doing EDA is to be able to describe the dataset at hand. In order to “describe” it, we can create graphs and visualizations which allow us to identify patterns and other useful information within the data, such as whether there are outliers. Creating numeric summaries is also helpful in allowing us to describe data.
The strategy I use for EDA is first understanding what outcome we are seeking. Next, I take a deeper look at the raw data by using the str()
function or the head()
function. Once I have a good sense of what variables are in the dataset, how big the dataset is, etc., I would calculate some summary statistics for quantitative variables. Lastly, I would create graphics to explore certain relationships between variables and to look for patterns, outliers, etc. using the ggplot2
function. I believe that the method of using both summary statistics and graphs are important in EDA.