Getting Started with R

Part 2

The CSV & TSV File Formats

CSV (Comma-Separated Values) and TSV (Tab-Separated Values) are plain text formats used for storing data in a tabular structure. Both formats human-readable and easy to handle in many programming environments, including R.

CSV (Comma-Separated Values):
- Fields (columns) are separated by commas.
- Lines (rows) are separated by line breaks.
- Commonly used due to its simplicity and broad application in systems that handle tabular data.
- Example:
```
Name,Age,Occupation
Alice,28,Engineer
Bob,35,Data Scientist
```
TSV (Tab-Separated Values):
- Fields are separated by tabs.
- Often preferred when data values may contain commas, to avoid confusion.
- Example:
```
Name    Age    Occupation
Alice    28    Engineer
Bob    35    Data Scientist
```

Loading Data in R

There are different ways (functions) to read (or load, or import) data files into R.
One simple and easy way is using the read.csv() function.
Example:

df = read.csv("filename.csv")

The Dataset: The Happiness Index 2019

The World Happiness Report 2019

Reading the Happines Dataset

df = read.csv("https://raw.githubusercontent.com/ahmedmoustafa/datasets/main/happiness/happiness2019.csv")
head(df)

country	category	score	gdp_per_capita	social_support	healthy_life_expectancy	freedom_to_make_life_choices	generosity	perceptions_of_corruption
Afghanistan	Underdeveloped	3.203	0.350	0.517	0.361	0.000	0.158	0.025
Albania	Transitioning	4.719	0.947	0.848	0.874	0.383	0.178	0.027
Algeria	Developing	5.211	1.002	1.160	0.785	0.086	0.073	0.114
Argentina	Developing	6.086	1.092	1.432	0.881	0.471	0.066	0.050
Armenia	Transitioning	4.559	0.850	1.055	0.815	0.283	0.095	0.064
Australia	Developed	7.228	1.372	1.548	1.036	0.557	0.332	0.290

Exploring the Structure of the Dataset

Shape of the Data: check the dimensions (number of rows and columns) of the dataset

dim(df)

[1] 155   9

paste("Number of rows (countries):", nrow(df))

[1] "Number of rows (countries): 155"

paste("Number of columns (attributes):", ncol(df))

[1] "Number of columns (attributes): 9"

Column Names: generate a list of all the attributes/columns in the dataset

colnames(df)

[1] "country"                      "category"                    
[3] "score"                        "gdp_per_capita"              
[5] "social_support"               "healthy_life_expectancy"     
[7] "freedom_to_make_life_choices" "generosity"                  
[9] "perceptions_of_corruption"

Column Data Types: understand the kind of data each column holds (numeric, character, factor, etc.).

sapply(df, class)

                     country                     category 
                 "character"                  "character" 
                       score               gdp_per_capita 
                   "numeric"                    "numeric" 
              social_support      healthy_life_expectancy 
                   "numeric"                    "numeric" 
freedom_to_make_life_choices                   generosity 
                   "numeric"                    "numeric" 
   perceptions_of_corruption 
                   "numeric"

Categorical Variables

Which columns make sense to be converted to factor? category is a qualitative variable.

df$category = factor(df$category)
levels(df$category)

[1] "Developed"      "Developing"     "Transitioning"  "Underdeveloped"

It is actually an ordinal qualitative variable. So, instead of the default levels (alphabetical), let’s provide a real order.

df$category = factor(df$category, levels = c("Developed", "Transitioning", "Developing", "Underdeveloped"))
levels(df$category)

[1] "Developed"      "Transitioning"  "Developing"     "Underdeveloped"

The Happiest Country

We need to determine the highest score using the max() function, then locate the index (position) of the country with that max score using the which() function.

df$country[which(df$score == max(df$score))]

[1] "Finland"

Alternatively, there is also the 2-in-1 function which.max()

df$country[which.max(df$score)]

[1] "Finland"

The Least Happy Country

df$country[which(df$score == min(df$score))]

[1] "South Sudan"

df$country[which.min(df$score)]

[1] "South Sudan"

The Top 10 Happiest Countries

We need to obtain the descending order() of the countries according to the score column then obtain the first 10:

Using the decreasing parameter:

df$country[order(df$score, decreasing = TRUE)][1:10]

 [1] "Finland"     "Denmark"     "Norway"      "Iceland"     "Netherlands"
 [6] "Switzerland" "Sweden"      "New Zealand" "Canada"      "Austria"

Using the negative (-) scores:

df$country[order(-df$score)][1:10]

 [1] "Finland"     "Denmark"     "Norway"      "Iceland"     "Netherlands"
 [6] "Switzerland" "Sweden"      "New Zealand" "Canada"      "Austria"

The Top 10 Happiest Countries

The full records of the top 10 happiest countries:

df[order(-df$score)[1:10], ]

	country	category	score	gdp_per_capita	social_support	healthy_life_expectancy	freedom_to_make_life_choices	generosity	perceptions_of_corruption
44	Finland	Developed	7.769	1.340	1.587	0.986	0.596	0.153	0.393
37	Denmark	Developed	7.600	1.383	1.573	0.996	0.592	0.252	0.410
105	Norway	Developed	7.554	1.488	1.582	1.028	0.603	0.271	0.341
58	Iceland	Developed	7.494	1.380	1.624	1.026	0.591	0.354	0.118
99	Netherlands	Developed	7.488	1.396	1.522	0.999	0.557	0.322	0.298
133	Switzerland	Developed	7.480	1.452	1.526	1.052	0.572	0.263	0.343
132	Sweden	Developed	7.343	1.387	1.487	1.009	0.574	0.267	0.373
100	New Zealand	Developed	7.307	1.303	1.557	1.026	0.585	0.330	0.380
24	Canada	Developed	7.278	1.365	1.505	1.039	0.584	0.285	0.308
7	Austria	Developed	7.246	1.376	1.475	1.016	0.532	0.244	0.226

Egypt’s Happy Score & Rank

Find the row number (index) with the country equals (==) “Egypt” to obtain the score in that row (at that index)

df$score[which(df$country == "Egypt")]

[1] 4.166

Similarly, we obtain the row number of Egypt then use that index to obtain the corresponding rank of the score, after ranking the scores

rank(-df$score)[which(df$country == "Egypt")]

[1] 136

Note the use of the negative sign (-) above with the score to switch the direction of ranking from ascending (which is the default) to descending

A Glimpse of Data Visualization in R

Using the basic plot function in R, we can visualize the relationship between two variables as a scatter plot.

For example, let’s investigate the relationship between the score (on the y-axis) and the gdp_per_capita

Relationship Between Happiness and GDP, Visually

plot(df$score ~ df$gdp_per_capita)

Relationship Between Happiness and GDP, Quantitatively

cor(df$score, df$gdp_per_capita)

[1] 0.7937202

Both the graph and the correlation coefficient suggest a strong association between population happiness and the country’s GDP.