R Programming Tutorial: Reading and Analyzing CSV Dataset with Data Frames
This tutorial provides a comprehensive guide to reading a CSV dataset file into R, creating a data frame, and performing various data frame operations. The tutorial uses a sample dataset, “sales_data.csv,” which contains sales information from different store locations. Step-by-step examples demonstrate how to read, inspect, manipulate, and analyze the data using R. The dataset is provided as a separate artifact, and this tutorial assumes it is saved as “sales_data.csv” in your R working directory.
1. Introduction
A data frame in R is a two-dimensional structure ideal for storing and analyzing tabular data, such as datasets stored in CSV files. This tutorial focuses on reading a CSV file into a data frame and performing common data frame operations, including inspection, filtering, aggregation, and visualization. The sample dataset contains sales data with columns for store ID, location, date, sales amount, units sold, and product category.
2. Setting Up the Environment
Before starting, ensure you have R installed and the dataset file “sales_data.csv” saved in your R working directory. You can check or set your working directory using:
# Check current working directory
getwd()
# Set working directory (replace with your path if needed)
setwd("path/to/your/directory")
3. Reading the CSV File into a Data Frame
The read.csv()
function is used to read a CSV file into a data frame.
Example: Reading the Sales Dataset
# Step 1: Read the CSV file
sales_data <- read.csv("sales_data.csv", stringsAsFactors = FALSE)
# Step 2: Display the first few rows
head(sales_data)
Output:
StoreID Location Date Sales UnitsSold Category
1 1 New York 2023-01-01 1500 50 Electronics
2 2 Los Angeles 2023-01-01 1200 40 Clothing
3 3 Chicago 2023-01-01 800 30 Books
4 4 New York 2023-01-02 1600 55 Electronics
5 5 Los Angeles 2023-01-02 1300 45 Clothing
6 6 Chicago 2023-01-02 900 35 Books
Explanation:
read.csv()
loads the CSV file into a data frame namedsales_data
.stringsAsFactors = FALSE
ensures character columns (e.g.,Location
,Category
) are not converted to factors.head()
displays the first six rows by default, providing a quick preview.
4. Inspecting the Data Frame
Inspecting the data frame helps understand its structure, data types, and content.
Example: Inspecting the Sales Data
# Step 1: View the structure
str(sales_data)
# Step 2: Display summary statistics
summary(sales_data)
# Step 3: Check dimensions
dim(sales_data)
Output:
# str(sales_data)
'data.frame': 10 obs. of 6 variables:
$ StoreID : int 1 2 3 4 5 6 7 8 9 10
$ Location : chr "New York" "Los Angeles" "Chicago" "New York" ...
$ Date : chr "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02" ...
$ Sales : int 1500 1200 800 1600 1300 900 1400 1100 850 1700
$ UnitsSold : int 50 40 30 55 45 35 48 38 32 60
$ Category : chr "Electronics" "Clothing" "Books" "Electronics" ...
# summary(sales_data)
StoreID Location Date Sales UnitsSold Category
Min. : 1.00 Length:10 Length:10 Min. : 800 Min. :30.0 Length:10
1st Qu.: 3.25 Class :character Class :character 1st Qu.: 925 1st Qu.:34.5 Class :character
Median : 5.50 Mode :character Mode :character Median :1250 Median :42.5 Mode :character
Mean : 5.50 Mean :1235 Mean :43.3
3rd Qu.: 7.75 3rd Qu.:1525 3rd Qu.:49.5
Max. :10.00 Max. :1700 Max. :60.0
# dim(sales_data)
[1] 10 6
Explanation:
str()
shows column names, data types (e.g., integer, character), and a preview of values.summary()
provides statistical summaries for numeric columns (e.g.,Sales
,UnitsSold
) and length for character columns.dim()
returns the number of rows (10) and columns (6).
5. Converting Date Column
The Date
column is read as a character. Convert it to a proper date format for analysis.
Example: Converting Date
# Step 1: Convert Date to Date type
sales_data$Date <- as.Date(sales_data$Date, format = "%Y-%m-%d")
# Step 2: Verify conversion
str(sales_data$Date)
Output:
Date[1:10], format: "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02" ...
Explanation:
as.Date()
converts theDate
column to a Date object.format = "%Y-%m-%d"
matches the CSV’s date format (YYYY-MM-DD).str()
confirms the column is now of typeDate
.
6. Accessing and Modifying the Data Frame
Data frames can be accessed and modified using indexing or column names.
Example: Accessing and Modifying Sales Data
# Step 1: Access the Sales column
sales_values <- sales_data$Sales
print(sales_values)
# Step 2: Access rows for New York
ny_sales <- sales_data[sales_data$Location == "New York", ]
print(ny_sales)
# Step 3: Modify Sales for StoreID 1
sales_data$Sales[sales_data$StoreID == 1] <- 1550
print(sales_data[sales_data$StoreID == 1, ])
# Step 4: Add a new column for Revenue per Unit
sales_data$RevenuePerUnit <- sales_data$Sales / sales_data$UnitsSold
print(head(sales_data))
Output:
# sales_values
[1] 1500 1200 800 1600 1300 900 1400 1100 850 1700
# ny_sales
StoreID Location Date Sales UnitsSold Category
1 1 New York 2023-01-01 1500 50 Electronics
4 4 New York 2023-01-02 1600 55 Electronics
7 7 New York 2023-01-03 1400 48 Electronics
10 10 New York 2023-01-04 1700 60 Electronics
# Modified StoreID 1
StoreID Location Date Sales UnitsSold Category
1 1 New York 2023-01-01 1550 50 Electronics
# head(sales_data) with RevenuePerUnit
StoreID Location Date Sales UnitsSold Category RevenuePerUnit
1 1 New York 2023-01-01 1550 50 Electronics 31.0
2 2 Los Angeles 2023-01-01 1200 40 Clothing 30.0
3 3 Chicago 2023-01-01 800 30 Books 26.7
4 4 New York 2023-01-02 1600 55 Electronics 29.1
5 5 Los Angeles 2023-01-02 1300 45 Clothing 28.9
6 6 Chicago 2023-01-02 900 35 Books 25.7
Explanation:
- Use
$
to access a column (e.g.,sales_data$Sales
). - Filter rows using logical conditions (e.g.,
Location == "New York"
). - Modify specific values using conditional indexing.
- Add a new column by computing
RevenuePerUnit
(Sales ÷ UnitsSold).
7. Filtering and Subsetting
Filtering extracts specific rows or columns based on conditions.
Example: Filtering Sales Data
# Step 1: Filter sales above 1400
high_sales <- subset(sales_data, Sales > 1400)
print(high_sales)
# Step 2: Select specific columns
location_sales <- sales_data[, c("Location", "Sales", "Category")]
print(head(location_sales))
# Step 3: Filter Electronics category in New York
ny_electronics <- subset(sales_data, Location == "New York" & Category == "Electronics")
print(ny_electronics)
Output:
# high_sales
StoreID Location Date Sales UnitsSold Category RevenuePerUnit
1 1 New York 2023-01-01 1550 50 Electronics 31.0
4 4 New York 2023-01-02 1600 55 Electronics 29.1
10 10 New York 2023-01-04 1700 60 Electronics 28.3
# head(location_sales)
Location Sales Category
1 New York 1550 Electronics
2 Los Angeles 1200 Clothing
3 Chicago 800 Books
4 New York 1600 Electronics
5 Los Angeles 1300 Clothing
6 Chicago 900 Books
# ny_electronics
StoreID Location Date Sales UnitsSold Category RevenuePerUnit
1 1 New York 2023-01-01 1550 50 Electronics 31.0
4 4 New York 2023-01-02 1600 55 Electronics 29.1
7 7 New York 2023-01-03 1400 48 Electronics 29.2
10 10 New York 2023-01-04 1700 60 Electronics 28.3
Explanation:
subset()
filters rows based on conditions (e.g.,Sales > 1400
).- Select columns using a vector of column names.
- Combine conditions with
&
for complex filtering (e.g.,Location == "New York" & Category == "Electronics"
).
8. Aggregating Data
Aggregation summarizes data, such as calculating means or sums by group.
Example: Aggregating Sales Data
# Step 1: Calculate average sales by Location
avg_sales_location <- aggregate(Sales ~ Location, data = sales_data, mean)
print(avg_sales_location)
# Step 2: Calculate total units sold by Category
total_units_category <- aggregate(UnitsSold ~ Category, data = sales_data, sum)
print(total_units_category)
Output:
# avg_sales_location
Location Sales
1 Chicago 850.0
2 Los Angeles 1200.0
3 New York 1562.5
# total_units_category
Category UnitsSold
1 Books 97
2 Clothing 123
3 Electronics 213
Explanation:
aggregate()
groups data byLocation
orCategory
and appliesmean
orsum
.- The result is a new data frame with one row per group.
9. Sorting Data
Sorting organizes the data frame for better analysis.
Example: Sorting Sales Data
# Step 1: Sort by Sales (descending)
sorted_by_sales <- sales_data[order(-sales_data$Sales), ]
print(head(sorted_by_sales))
# Step 2: Sort by Date (ascending)
sorted_by_date <- sales_data[order(sales_data$Date), ]
print(head(sorted_by_date))
Output:
# head(sorted_by_sales)
StoreID Location Date Sales UnitsSold Category RevenuePerUnit
10 10 New York 2023-01-04 1700 60 Electronics 28.3
4 4 New York 2023-01-02 1600 55 Electronics 29.1
1 1 New York 2023-01-01 1550 50 Electronics 31.0
7 7 New York 2023-01-03 1400 48 Electronics 29.2
5 5 Los Angeles 2023-01-02 1300 45 Clothing 28.9
2 2 Los Angeles 2023-01-01 1200 40 Clothing 30.0
# head(sorted_by_date)
StoreID Location Date Sales UnitsSold Category RevenuePerUnit
1 1 New York 2023-01-01 1550 50 Electronics 31.0
2 2 Los Angeles 2023-01-01 1200 40 Clothing 30.0
3 3 Chicago 2023-01-01 800 30 Books 26.7
4 4 New York 2023-01-02 1600 55 Electronics 29.1
5 5 Los Angeles 2023-01-02 1300 45 Clothing 28.9
6 6 Chicago 2023-01-02 900 35 Books 25.7
Explanation:
order()
with-
sorts in descending order (e.g.,-sales_data$Sales
).- Without
-
,order()
sorts in ascending order (e.g.,sales_data$Date
).
10. Visualizing Data
Basic visualization helps explore trends. We’ll use the base R plot()
function for simplicity.
Example: Visualizing Sales Trends
# Step 1: Plot Sales over Date by Location
plot(sales_data$Date[sales_data$Location == "New York"],
sales_data$Sales[sales_data$Location == "New York"],
type = "l", col = "blue", xlab = "Date", ylab = "Sales",
main = "Sales Trend in New York")
Output:
- A line plot showing sales trends for New York over the four days.
Explanation:
plot()
creates a line plot (type = "l"
) for New York’s sales.- Subset data using
sales_data$Location == "New York"
to focus on one location. - Customize with
xlab
,ylab
, andmain
for clarity.
11. Exporting the Data Frame
Save the modified data frame to a new CSV file.
Example: Exporting to CSV
# Step 1: Export the modified data frame
write.csv(sales_data, "sales_data_modified.csv", row.names = FALSE)
# Step 2: Verify by reading back
modified_data <- read.csv("sales_data_modified.csv")
head(modified_data)
Output:
StoreID Location Date Sales UnitsSold Category RevenuePerUnit
1 1 New York 2023-01-01 1550 50 Electronics 31.0
2 2 Los Angeles 2023-01-01 1200 40 Clothing 30.0
3 3 Chicago 2023-01-01 800 30 Books 26.7
4 4 New York 2023-01-02 1600 55 Electronics 29.1
5 5 Los Angeles 2023-01-02 1300 45 Clothing 28.9
6 6 Chicago 2023-01-02 900 35 Books 25.7
Explanation:
write.csv()
saves the data frame to “sales_data_modified.csv”.row.names = FALSE
excludes row indices.- Reading back verifies the export.
12. Best Practices
- Verify File Path: Ensure the CSV file is in the correct directory.
- Check Data Types: Use
str()
to confirm column types (e.g., convert dates). - Handle Missing Values: Check for
NA
values if the dataset is large. - Comment Code: Add comments for clarity and reproducibility.
- Save Modified Data: Export changes to avoid losing work.
13. Conclusion
This tutorial demonstrated how to read a CSV dataset into a data frame in R, inspect its structure, manipulate data, and perform analyses like filtering, aggregation, sorting, and visualization. The sample dataset “sales_data.csv” provided a practical context for learning. By mastering these techniques, you can efficiently work with real-world datasets in R.