R Essential Functions

Miscellaneous Functions

unique()

  • The unique() function removes duplicated elements from a vector or data frame.
  • Example: unique()
x = c(1, 2, 2, 3, 4, 4, 5)
unique(x)
[1] 1 2 3 4 5

any() and all()

  • any() returns TRUE if any of the values are TRUE.
v = c(FALSE, FALSE, TRUE)
any(v)
[1] TRUE
  • all() returns TRUE if all of the values are TRUE.
all(v)
[1] FALSE

ifelse()

  • The ifelse function applies a function to elements of a vector depending on a condition.

  • Example:

numbers = 1:10
numbers
 [1]  1  2  3  4  5  6  7  8  9 10
ifelse(numbers %% 2 == 0, "Even", "Odd")
 [1] "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even" "Odd"  "Even"

cbind() and rbind()

  • cbind() combines vectors, matrices, or data frames by columns.
  • rbind() combines vectors, matrices, or data frames by rows.
A = matrix(1:4, ncol=2)
A
1 3
2 4
B = matrix(5:8, ncol=2)
B
5 7
6 8
cbind(A, B)
1 3 5 7
2 4 6 8
rbind(A, B)
1 3
2 4
5 7
6 8

Set Functions

  • Set is a collection of distinct elements.
  • Set functions perform operations on sets of elements.

Set Theory

Union (\(A \cup B\))

  • The set of all elements in A, or in B, or in both.
  • \(A \cup B = \{x | x \in A \text{ or } x \in B\}\)
  • union()
  • Example:
A = c(1, 2, 3, 4)
B = c(3, 4, 5, 6)
union(A, B)
[1] 1 2 3 4 5 6

Intersection (\(A \cap B\))

  • The set of all elements that are both in A and B.
  • \(A \cap B = \{x | x \in A \text{ and } x \in B\}\)
  • intersect()
  • Example:
A = c(1, 2, 3, 4)
B = c(3, 4, 5, 6)
intersect(A, B)
[1] 3 4

Set Difference (\(A - B\))

  • The set of all elements that are in A but not in B.
  • \(A - B = \{x | x \in A \text{ and } x \notin B\}\)
  • setdiff()
  • Example:
A = c(1, 2, 3, 4)
B = c(3, 4, 5, 6)
setdiff(A, B)
[1] 1 2

Subset (\(A \subseteq B\))

  • A is a subset of B if every element of A is also an element of B.
  • \(A \subseteq B \iff (\forall x)(x \in A \implies x \in B)\)
  • Example:
A = c(1, 2, 3, 4)
all(A %in% c(1, 2, 3, 4, 5))
[1] TRUE
all(A %in% c(1, 2, 3))
[1] FALSE

Set Equality (\(A = B\))

  • Two sets are equal if they have exactly the same elements.
  • \(A = B \iff (A \subseteq B) \text{ and } (B \subseteq A)\)
  • setequal()
  • Example:
A = c(1, 2, 3, 4)
B = c(3, 4, 5, 6)
setequal(A, B)
[1] FALSE

Random Sampling

Simple Random Sample (SRS): is a subset of a population, chosen in such a way that every possible sample of a given size has an equal chance of being selected. This method ensures that each individual or item within the population has an equal probability of being included in the sample, and the selection process is entirely by chance, without any bias.

Random Sampling

With vs. Without Replacement

Sampling With Replacement (SWR): In this method, after an individual or item is selected for the sample, it is placed back into the population before the next selection is made, allowing for the possibility of being chosen more than once. This method is particularly useful when dealing with small population sizes or when it’s important to maintain the same population size for each draw.

Sampling Without Replacement (SWOR): Contrary to SWR, in Sampling Without Replacement, once an individual or item is selected, it is not placed back into the population, and hence, cannot be selected again. This method is often utilized when the population size is large, or when maintaining the same population size for each draw is not crucial.

The sample() function

  • The sample() function draws random samples from a vector.

  • Syntax:

sample(x, size, replace = FALSE, prob = NULL)
  • Example:
[1]  3  9  7 10  5

Example: Coin Flip Simulation

Part 1: Single Simulation

Write an R function coin_flip() that simulates flipping a coin. The function should return H (for head) or T (for tail).

coin_flip = function() {
  flip = sample(c("H", "T"), size = 1)
  return (flip)
}

coin_flip()
[1] "T"

Part 2: Multiple Simulations

Now, extend your function to perform multiple simulations of coin flips and return the number of heads and tails.

coin_flip = function(n) {
  flips = sample(c("H", "T"), size = n, replace = TRUE)
  return(table(flips))
}

coin_flip(5)
H T
3 2

Part 3: Analysis

Analyze the results of your multiple simulations. What do you observe as the number of flips increases?

coin_flip(10)/10
H T
0.7 0.3
coin_flip(100)/100
H T
0.49 0.51
coin_flip(1000)/1000
H T
0.482 0.518
coin_flip(10000)/10000
H T
0.4993 0.5007

The Apply Functions

The apply functions in R provide a concise and efficient way to apply a function to the elements of data structures such as vectors, lists, data frames, or matrix.

Comaprsion between the Apply Functions

Apply functions provide a concise way to apply a function to data.

Function Description Usage Example
apply() Applies a function over the margins of an array or matrix. apply(X, MARGIN, FUN, ...) apply(matrix(1:9, nrow = 3), 1, sum)
lapply() Applies a function to each element of a list, returning a list. lapply(X, FUN, ...) lapply(list(1:5, 6:10), sum)
sapply() Similar to lapply(), but tries to simplify the result. sapply(X, FUN, ..., simplify = TRUE) sapply(list(1:5, 6:10), sum)

Example: Calculating summary statistics

Creating a numeric named list

numeric_list = list(a = 1:5, b = 3:7, c = 10:14)
numeric_list
$a
[1] 1 2 3 4 5

$b
[1] 3 4 5 6 7

$c
[1] 10 11 12 13 14

Applying mean() using lapply()

lapply(numeric_list, mean)
$a
[1] 3

$b
[1] 5

$c
[1] 12

Applying sum() using sapply()

sapply(numeric_list, sum)
 a  b  c 
15 25 60 

The sweep() Function

The sweep() function in R allows you to perform operations on arrays by “sweeping” out values of a summary statistic across margins.

Syntax of sweep()

sweep(x, MARGIN, STATS, FUN = "-", ...)
  • x: the array to sweep out statistics from.
  • MARGIN: the margin to apply the sweep on.
  • STATS: the summary statistic to be used.
  • FUN: the function to apply.

Example: Centering

Sample matrix

mat = matrix(1:9, nrow = 3)
mat
1 4 7
2 5 8
3 6 9

Calculate column means

col_means = apply(mat, 2, mean)
col_means
[1] 2 5 8

Center the matrix by subtracting column means

centered_mat = sweep(mat, 2, col_means)
centered_mat
-1 -1 -1
0 0 0
1 1 1

Example: Scaling

Calculate max for each column

col_maxs = apply(mat, 2, max)
col_maxs
[1] 3 6 9

Scale the matrix by dividing by column maxs

scaled_mat = sweep(mat, 2, col_maxs, FUN = "/")
scaled_mat
0.3333333 0.6666667 0.7777778
0.6666667 0.8333333 0.8888889
1.0000000 1.0000000 1.0000000

The \(Z\)-scores

The \(Z\)-score of an observation is a metric that indicates how many standard deviations an element is from the mean of the whole set.

\(z = \frac{x - \mu}{\sigma}\)

where:

  • \(x\) is the raw score,
  • \(\mu\) is the mean of the population, and
  • \(\sigma\) is the standard deviation of the population.

Note: The \(Z\)-score is unitless i.e., having no units of measurement

Steps to Calculate \(Z\)-scores

  1. Calculate means and standard deviations for each column.
  2. Center data by subtracting mean values using sweep().
  3. Divide by standard deviation to standardize using sweep().

Example: \(Z\)-score Calculation

Create a sample matrix with random data

data_matrix = matrix(rnorm(100), ncol=10)
data_matrix
0.7720323 -0.0236110 0.0700578 0.5499061 -0.8822566 -0.5208829 1.2807723 -2.2090599 1.3172705 -0.7553829
1.1352704 0.9457801 1.0327003 0.9907401 0.5247179 -0.0071550 -1.0176867 2.6768273 -0.2675364 -0.8119132
-0.3556700 -1.3081242 -0.2164506 0.1440630 0.9130161 -0.0147323 2.1405846 -1.1328485 -0.1673459 -0.6183830
0.3695155 3.1066999 -2.7725347 -0.7530670 0.1043633 0.6799525 -0.8807629 1.8317829 -0.4001457 -1.4748364
1.5371188 -1.4549000 -0.9155727 -0.8977446 -1.1268794 -1.8834163 0.2358407 -1.1359993 0.9284241 -0.5760723
0.4551764 0.9378018 0.8704419 1.2940218 -0.0578897 0.4757888 -0.6402558 -0.2850556 2.0390161 0.4287817
2.0869911 1.4019592 0.8649523 0.0084196 0.2193383 -1.8027268 2.0185865 -0.7242505 -0.5220771 -1.5277922
-2.0259216 -0.6943732 -0.7830958 -0.2226866 0.1012604 -1.6709143 0.0444366 0.2291196 -0.4010404 -0.3937708
-0.3312715 0.4724834 -0.6503196 -0.9633905 1.1221714 -0.6869030 0.0825795 1.6361345 -0.1416749 0.3748363
1.0570444 1.5559028 0.3398075 -0.2588452 -0.0381289 -0.5262985 1.0498150 -0.3115017 -0.1811098 -0.0527232

Calculate column means

col_means = apply(data_matrix, 2, mean)
col_means
 [1]  0.47002857  0.49396187 -0.21600136 -0.01085833  0.08797128 -0.59572878
 [7]  0.43139097  0.05751487  0.22037806 -0.54072559

Calculate column standard deviations

col_sds = apply(data_matrix, 2, sd)
col_means
 [1]  0.47002857  0.49396187 -0.21600136 -0.01085833  0.08797128 -0.59572878
 [7]  0.43139097  0.05751487  0.22037806 -0.54072559

Center the matrix by subtracting column means

data_matrix_centered = sweep(data_matrix, 2, col_means, FUN = "-")
data_matrix_centered
0.3020037 -0.5175729 0.2860591 0.5607645 -0.9702279 0.0748459 0.8493813 -2.2665748 1.0968924 -0.2146573
0.6652418 0.4518182 1.2487017 1.0015984 0.4367466 0.5885738 -1.4490777 2.6193124 -0.4879144 -0.2711876
-0.8256986 -1.8020861 -0.0004493 0.1549214 0.8250448 0.5809965 1.7091936 -1.1903634 -0.3877239 -0.0776574
-0.1005131 2.6127380 -2.5565334 -0.7422087 0.0163920 1.2756813 -1.3121539 1.7742681 -0.6205237 -0.9341108
1.0670902 -1.9488618 -0.6995713 -0.8868863 -1.2148507 -1.2876875 -0.1955503 -1.1935142 0.7080460 -0.0353468
-0.0148522 0.4438399 1.0864433 1.3048802 -0.1458610 1.0715176 -1.0716467 -0.3425705 1.8186380 0.9695073
1.6169625 0.9079973 1.0809537 0.0192779 0.1313670 -1.2069980 1.5871955 -0.7817654 -0.7424552 -0.9870666
-2.4959502 -1.1883350 -0.5670945 -0.2118283 0.0132891 -1.0751855 -0.3869544 0.1716047 -0.6214184 0.1469548
-0.8013000 -0.0214784 -0.4343183 -0.9525322 1.0342001 -0.0911742 -0.3488115 1.5786196 -0.3620529 0.9155619
0.5870158 1.0619410 0.5558089 -0.2479869 -0.1261002 0.0694302 0.6184240 -0.3690166 -0.4014878 0.4880023

Divide by the standard deviation to get z-scores

z_scores = sweep(data_matrix_centered, 2, col_sds, FUN = "/")
z_scores
0.2594376 -0.3669410 0.2494626 0.7242210 -1.3855529 0.0806603 0.7410379 -1.4707892 1.2424232 -0.3215159
0.5714790 0.3203233 1.0889509 1.2935530 0.6237045 0.6342973 -1.2642395 1.6996821 -0.5526487 -0.4061875
-0.7093201 -1.2776158 -0.0003918 0.2000792 1.1782214 0.6261313 1.4911762 -0.7724314 -0.4391654 -0.1163160
-0.0863462 1.8523396 -2.2294670 -0.9585541 0.0234089 1.3747829 -1.1447812 1.1513295 -0.7028520 -1.3991208
0.9166887 -1.3816747 -0.6100727 -1.1454036 -1.7348913 -1.3877218 -0.1706067 -0.7744760 0.8019864 -0.0529427
-0.0127588 0.3146669 0.9474508 1.6852379 -0.2082996 1.1547587 -0.9349521 -0.2222953 2.0599267 1.4521380
1.3890589 0.6437382 0.9426635 0.0248972 0.1876013 -1.3007640 1.3847397 -0.5072906 -0.8409608 -1.4784385
-2.1441573 -0.8424878 -0.4945441 -0.2735738 0.0189778 -1.1587116 -0.3375962 0.1113550 -0.7038654 0.2201104
-0.6883604 -0.0152275 -0.3787544 -1.2301845 1.4769097 -0.0982571 -0.3043186 1.0243724 -0.4100885 1.3713381
0.5042786 0.7528789 0.4847023 -0.3202722 -0.1800798 0.0748239 0.5395405 -0.2394563 -0.4547554 0.7309349

Example: Data Frame Normalization Using sweep()

Normalize a data frame df (with columns X1, X2, X3 each containing \(10\) random integers between \(1\) and \(100\)) by subtracting the median and dividing by the interquartile range of each column.

Creating a data frame

set.seed(123)
df = data.frame(X1 = sample(1:100, 10), 
                X2 = sample(1:100, 10), 
                X3 = sample(1:100, 10))

df
X1 X2 X3
31 90 7
79 91 42
51 69 9
14 99 83
67 57 36
42 92 78
50 9 81
43 93 43
97 72 76
25 26 15

Calculating medians

medians = sapply(df, median) # or apply(df, 2, median)
medians
  X1   X2   X3 
46.5 81.0 42.5 

Calculating interquartile ranges

iqrs = sapply(df, IQR) # or apply(df, 2, IQR)
iqrs
   X1    X2    X3 
29.25 31.75 57.25 

Normalizing the data frame

normalized_df = sweep(sweep(df, 2, medians, "-"), 2, iqrs, "/")
normalized_df
X1 X2 X3
-0.5299145 0.2834646 -0.6200873
1.1111111 0.3149606 -0.0087336
0.1538462 -0.3779528 -0.5851528
-1.1111111 0.5669291 0.7074236
0.7008547 -0.7559055 -0.1135371
-0.1538462 0.3464567 0.6200873
0.1196581 -2.2677165 0.6724891
-0.1196581 0.3779528 0.0087336
1.7264957 -0.2834646 0.5851528
-0.7350427 -1.7322835 -0.4803493

Example: Generating and Analyzing Height Data

Generate a dataset that simulates the heights (in centimeters) of 1000 individuals. Assume an average height of 170 cm and a standard deviation of 10 cm. Follow the following steps:

  • Create the dataset using a normal distribution with the given mean and standard deviation using the rnorm() function.
  • Calculate the \(Z\)-scores for the entire dataset.
  • Determine how many standard deviations away from the mean is a height of 185 cm.

Generating the height dataset

heights = rnorm (1000, 170, 10)
head(heights)
[1] 167.8413 166.6509 159.1430 169.1458 180.7061 168.5461

Displaying the distribution of heights

hist(heights)

Calculating the \(Z\)-scores

zscores = (heights - mean(heights))/sd(heights)
head(zscores)
[1] -0.17243735 -0.29236253 -1.04868498 -0.04103338  1.12352467 -0.10144587

Displaying the distribution of \(Z\)-scores

hist(zscores)

Calculating the \(Z\)-score of a specific height

zscore = (185 - mean(heights))/sd(heights)
zscore
[1] 1.55608

Displaying the \(Z\)-score of a specific height

hist(zscores)
abline(v=zscore, col="red", lwd=2)

Comparison between apply() and sweep()

Feature apply() sweep()
Purpose Apply a function over the margins of an array or matrix to summarize or transform it. Apply arithmetic operations to an array “sweeping” out array summaries.
Usage Used for summarizing data with a function over specified margins (rows or columns) Used for adjusting data using a summary statistic for operations like centering or scaling.
Functionality Used to apply a wide range of functions for summarizing or transforming data across dimensions Used to perform arithmetic operations using a summary statistic and is often used after summarizing data with apply().
Arguments apply(X, MARGIN, FUN, ...) where X is the array, MARGIN specifies rows(1) or columns(2), and FUN is the function to be applied. sweep(x, MARGIN, STATS, FUN = "-", ...) where x is the array, MARGIN specifies the dimension, STATS is the summary statistic, and FUN is the arithmetic function to be applied.
Return Value Returns an array, matrix, or list with the results of the function application, which may be of a different dimension from the input. Returns an adjusted array with the same dimensions as the input, with element-wise arithmetic operations performed.
Exclusive Actions - Can return different structures (vector, array, list) based on the function and margin.
- Can work with higher-dimensional arrays beyond matrices.
- Directly performs arithmetic sweep operations using a summary statistic.
- Ideal for data adjustments after using apply() to calculate the summary statistic.
Limitations - Cannot directly adjust data using a summary statistic; additional steps are required to integrate the summary before or after using apply(). - Not designed for summarizing data; it requires pre-calculated statistics to perform the sweep operation.
Flexibility - Can use any function, including user-defined ones, for summarization or transformation.
- More general-purpose in data manipulation.
- Limited to arithmetic sweep operations; custom functions must conform to the expected input and output format of sweep().
Common Use Case - Computing aggregate statistics like means, sums, etc., across rows or columns.
- General data manipulation tasks requiring the application of a function.
- Standardizing or normalizing data.
- Centering data by subtracting the mean or dividing by a standard deviation after calculating these with apply().

Testing Data Types

  • R provides functions to test the data type of a variable.

Examples:

is.character("Hello")
[1] TRUE
is.numeric(10)
[1] TRUE
is.na(NA)
[1] TRUE
is.vector(c(1, 2, 3))
[1] TRUE
is.matrix(matrix(1:4, ncol=2))
[1] TRUE
is.data.frame(data.frame(x=1:4, y=4:1))
[1] TRUE
is.factor(factor(c("a", "b", "a")))
[1] TRUE