Descriptive Statistics using R
Objective
Analyzing a student’s transcript dataset to understand performance metrics.
Dataset Overview
Columns: year, semester, course_number, credits, letter_grade, numerical_value (GPA)
Four academic years of data
Grades for both major and non-major courses
Loading the Dataset
df = read.csv ("https://raw.githubusercontent.com/ahmedmoustafa/datasets/main/transcript/transcript.csv" )
head (df)
Freshman
Fall
COMP100
3
A-
3.7
Freshman
Fall
COMP110
3
A
4.0
Freshman
Fall
HUMA181
3
A-
3.7
Freshman
Fall
SOCI181
3
A-
3.7
Freshman
Fall
ELEC181
3
B+
3.3
Freshman
Spring
COMP120
3
A
4.0
Measures of Central Tendency - Mode
Mode : The value that appears most frequently in a set.
As discussed before, mode is more appropriate for qualitative data values.
So, let’s compute mode for the letter_grade
However, in R, there is no built-in function to compute the mode directly.
Therefore, we need to install the DescTools
package
if (! require (DescTools))
install.packages ("DescTools" ,repos = "http://cran.us.r-project.org" )
Now we can run the Mode()
function from DescTools
package
library (DescTools)
Mode (df$ letter_grade)
[1] "A"
attr(,"freq")
[1] 20
Measures of Spread - Range
Range : Difference between the largest and smallest values, \[ \text{Range} = x_{\text{max}} - x_{\text{min}} \]
max (df$ numerical_value) - min (df$ numerical_value)
range (df$ numerical_value) # The base R (built-in)
Range (df$ numerical_value) # From DescTools
[1] 0.7
attr(,"bounds")
[1] 3.3 4.0
Measures of Spread - IQR
Interquartile Range (IQR) : Difference between the first and third quartiles, \[ \text{IQR} = Q_3 - Q_1 \]
First Quartile:
Also known as the lower quartile or the 25th percentile.
It is the value below which 25% of the data falls. In other words, it cuts off the lowest 25% of the data.
Third Quartile:
Also known as the upper quartile or the 75th percentile.
It is the value below which 75% of the data falls, meaning it cuts off the lowest 75% of data points.
quantile (df$ numerical_value)
0% 25% 50% 75% 100%
3.30 3.70 3.85 4.00 4.00
quantiles = quantile (df$ numerical_value)
quantiles[4 ] - quantiles[2 ]
Measures of Spread - Standard Deviation & Variance
Standard Deviation : Measures the amount of variation or dispersion of a set of values, \[s = \sqrt {\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}\]
Variance : \[var(x) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} = s^2\]
s = sd (df$ numerical_value)
s^ 2
Var (df$ numerical_value) # From DescTools
Measures of Spread - Mean Absolute Deviation (MAD)
Mean Absolute Deviation : (MAD ) a measure of dispersion representing the average distance of each data point from the mean
\[ MAD = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}| \]
mean (abs (df$ numerical_value - mean (df$ numerical_value)))
MAD is sensitive to outliers.
Exericse - Major GPA vs. non-Major GPA
Using the provided dataset, compare the GPA (the numerical_value
column) of the student in their major courses versus the non-major courses. For this dataset, Computer Science courses are the major courses, and their course numbers start with "COMP"
.
Hint: You might find the startsWith()
function in R useful to filter rows based on the course number.
Solution - Major GPA vs. non-Major GPA
We can search for the rows with major courses using startsWith()
flag = startsWith (df$ course_number, "COMP" )
flag
[1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE
[13] FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[25] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
[37] TRUE FALSE FALSE FALSE
TRUE
: a major course
FALSE
: a non-major course
major_courses = df[flag, ]
head (major_courses)
1
Freshman
Fall
COMP100
3
A-
3.7
2
Freshman
Fall
COMP110
3
A
4.0
6
Freshman
Spring
COMP120
3
A
4.0
7
Freshman
Spring
COMP130
3
A-
3.7
11
Sophomore
Fall
COMP200
4
A
4.0
12
Sophomore
Fall
COMP210
4
A
4.0
major_gpa = median (major_courses$ numerical_value)
major_gpa
To filter for non-major courses, we can just negate flag
i.e.,
[1] FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
[13] TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
[25] TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
[37] FALSE TRUE TRUE TRUE
TRUE
: a non-major course
FALSE
: a major course
nonmajor_courses = df[! flag, ]
head (nonmajor_courses)
3
Freshman
Fall
HUMA181
3
A-
3.7
4
Freshman
Fall
SOCI181
3
A-
3.7
5
Freshman
Fall
ELEC181
3
B+
3.3
8
Freshman
Spring
HUMA191
4
B+
3.3
9
Freshman
Spring
SOCI191
2
A-
3.7
10
Freshman
Spring
ELEC191
3
A-
3.7
nonmajor_gpa = median (nonmajor_courses$ numerical_value)
nonmajor_gpa
Using the summary()
function
summary (major_courses$ numerical_value)
summary (nonmajor_courses$ numerical_value)