Descriptive Statistics using R

The Dataset: Student Transcript

Student Transcript

Objective

Analyzing a student’s transcript dataset to understand performance metrics.

Dataset Overview

  • Columns: year, semester, course_number, credits, letter_grade, numerical_value (GPA)
  • Four academic years of data
  • Grades for both major and non-major courses

Loading the Dataset

df = read.csv ("https://raw.githubusercontent.com/ahmedmoustafa/datasets/main/transcript/transcript.csv")
head(df)
year semester course_number credits letter_grade numerical_value
Freshman Fall COMP100 3 A- 3.7
Freshman Fall COMP110 3 A 4.0
Freshman Fall HUMA181 3 A- 3.7
Freshman Fall SOCI181 3 A- 3.7
Freshman Fall ELEC181 3 B+ 3.3
Freshman Spring COMP120 3 A 4.0

Measures of Central Tendency - Mean & Median

  • Mean: The average of a set of numbers, \(\bar{x} = \frac{\sum_{i=1}^{n} x_i}{n}\)
mean(df$numerical_value)
[1] 3.78
  • Weighted Mean: The mean where some values contribute more than others, \(\bar{x}_w = \frac{\sum_{i=1}^{n} w_i x_i}{\sum_{i=1}^{n} w_i}\)
sum(df$numerical_value * df$credits)/sum(df$credits)
[1] 3.790244
weighted.mean(df$numerical_value, w = df$credits)
[1] 3.790244
  • Trimmed Mean: The mean after removing a specified number of the highest and lowest values.
mean(df$numerical_value, trim = 0.1)
[1] 3.8125
  • Median: The middle value in a sorted list of numbers
median(df$numerical_value)
[1] 3.85

Measures of Central Tendency - Mode

  • Mode: The value that appears most frequently in a set.
  • As discussed before, mode is more appropriate for qualitative data values.
  • So, let’s compute mode for the letter_grade
  • However, in R, there is no built-in function to compute the mode directly.
  • Therefore, we need to install the DescTools package
if(!require(DescTools))
  install.packages("DescTools",repos = "http://cran.us.r-project.org")
  • Now we can run the Mode() function from DescTools package
library(DescTools)
Mode(df$letter_grade)
[1] "A"
attr(,"freq")
[1] 20

Descriptive Statistics using DescTools

Measure Function Description
Mode Mode(data) Computes the mode. Returns multiple modes if they exist.
Mean Mean(data) Computes the arithmetic mean.
Weighted Mean WtdMean(data) Computes the weighted mean.
Median Median(data) Computes the median.
Trimmed Mean Mean(data, trim) Computes trimmed mean. trim is fraction (0 to 0.5) of observations to be trimmed.
Standard Deviation Std(data) Computes the sample standard deviation.
Variance Var(data) Computes the variance.
Range Range(data) Computes the range (difference between max and min).
Interquartile Range IQR(data) Computes the interquartile range.

Measures of Spread - Range

  • Range: Difference between the largest and smallest values, \[ \text{Range} = x_{\text{max}} - x_{\text{min}} \]
max(df$numerical_value) - min(df$numerical_value)
[1] 0.7
range(df$numerical_value) # The base R (built-in) 
[1] 3.3 4.0
Range(df$numerical_value) # From DescTools
[1] 0.7
attr(,"bounds")
[1] 3.3 4.0

Measures of Spread - IQR

  • Interquartile Range (IQR): Difference between the first and third quartiles, \[ \text{IQR} = Q_3 - Q_1 \]
    • First Quartile:
      • Also known as the lower quartile or the 25th percentile.
      • It is the value below which 25% of the data falls. In other words, it cuts off the lowest 25% of the data.
    • Third Quartile:
      • Also known as the upper quartile or the 75th percentile.
      • It is the value below which 75% of the data falls, meaning it cuts off the lowest 75% of data points.
quantile(df$numerical_value)
  0%  25%  50%  75% 100% 
3.30 3.70 3.85 4.00 4.00 
quantiles = quantile(df$numerical_value)
quantiles[4] - quantiles[2]
75% 
0.3 
IQR(df$numerical_value)
[1] 0.3

Measures of Spread - Standard Deviation & Variance

  • Standard Deviation: Measures the amount of variation or dispersion of a set of values, \[s = \sqrt {\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}\]
sd(df$numerical_value)
[1] 0.2613574
  • Variance: \[var(x) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1} = s^2\]
s = sd(df$numerical_value)
s^2
[1] 0.06830769
Var(df$numerical_value) # From DescTools
[1] 0.06830769

Measures of Spread - Mean Absolute Deviation (MAD)

  • Mean Absolute Deviation: (MAD) a measure of dispersion representing the average distance of each data point from the mean

\[ MAD = \frac{1}{n} \sum_{i=1}^{n} |x_i - \bar{x}| \]

mean(abs(df$numerical_value - mean(df$numerical_value)))
[1] 0.22
  • MAD is sensitive to outliers.

Exericse - Major GPA vs. non-Major GPA

  • Using the provided dataset, compare the GPA (the numerical_value column) of the student in their major courses versus the non-major courses. For this dataset, Computer Science courses are the major courses, and their course numbers start with "COMP".

  • Hint: You might find the startsWith() function in R useful to filter rows based on the course number.

Solution - Major GPA vs. non-Major GPA

  • Major Courses

We can search for the rows with major courses using startsWith()

flag = startsWith(df$course_number, "COMP")
flag
 [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE
[13] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
[25] FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE
[37]  TRUE FALSE FALSE FALSE
  • TRUE : a major course
  • FALSE : a non-major course
major_courses = df[flag, ]
head(major_courses)
year semester course_number credits letter_grade numerical_value
1 Freshman Fall COMP100 3 A- 3.7
2 Freshman Fall COMP110 3 A 4.0
6 Freshman Spring COMP120 3 A 4.0
7 Freshman Spring COMP130 3 A- 3.7
11 Sophomore Fall COMP200 4 A 4.0
12 Sophomore Fall COMP210 4 A 4.0
major_gpa = median(major_courses$numerical_value)
major_gpa
[1] 4
  • Non-Major Courses

To filter for non-major courses, we can just negate flag i.e.,

!flag
 [1] FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE
[13]  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
[25]  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE
[37] FALSE  TRUE  TRUE  TRUE
  • TRUE : a non-major course
  • FALSE : a major course
nonmajor_courses = df[!flag, ]
head(nonmajor_courses)
year semester course_number credits letter_grade numerical_value
3 Freshman Fall HUMA181 3 A- 3.7
4 Freshman Fall SOCI181 3 A- 3.7
5 Freshman Fall ELEC181 3 B+ 3.3
8 Freshman Spring HUMA191 4 B+ 3.3
9 Freshman Spring SOCI191 2 A- 3.7
10 Freshman Spring ELEC191 3 A- 3.7
nonmajor_gpa = median(nonmajor_courses$numerical_value)
nonmajor_gpa
[1] 3.7
  • Using the summary() function
summary(major_courses$numerical_value)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.3 3.7 4 3.8375 4 4
summary(nonmajor_courses$numerical_value)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.3 3.7 3.7 3.741667 4 4