Packages for Data Analysis¶

We will be discussing three important Python libraries that are commonly used for data analysis: pandas, matplotlib, and numpy.

Importing Libraries in Python¶

In Python, libraries or packages are collections of pre-written code that can be imported and used in your own programs.
The most common way to import a library is using the import statement followed by the name of the library.
For example, to import the math library, you would use the following statement:

In [2]:

import math

math.sqrt(25)

Out[2]:

5.0

Importing Libraries with Aliases¶

You can also import a library with an alias using the as keyword.
This can be useful when you want to use a shorter name for a library in your code.
For example, to import the math library with the alias m, you would use the following statement:

In [3]:

import math as m

m.sqrt(25)

Out[3]:

5.0

Installing a library¶

If a package is not already installed, you can install within your notebook using:

%pip install library_name

Example to install Pandas:

Pandas¶

Pandas is a Python library that is used for data manipulation and analysis.
It provides data structures for efficiently storing and manipulating large datasets.
Let's start with an example of how to use Pandas to load a CSV file.

Titanic Dataset¶

The Titanic dataset contains data on the passengers of the Titanic, including their survival status, age, gender, class, and other attributes.
The table has 886 rows and 8 columns.
Here's a glimpse of the table:

PassengerId	Survived	Name	Sex	Age	SibSp	Parch	Ticket	Fare
1	0	Braund, Mr. Owen Harris	male	22	1	0	A/5	20.0

Loading a CSV file with Pandas¶

In [5]:

import pandas as pd

url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
titanic = pd.read_csv(url) # Load Titanic dataset

titanic.shape # The dimension of the dataframe (the table)

Out[5]:

(887, 8)

Viewing Data¶

Once you've loaded data into a DataFrame, you can start exploring it using various Pandas functions.
The head() and tail functions are useful functions for quickly viewing the first and last few rows of a DataFrame.

In [6]:

titanic.head() # Print the first few rows of the DataFrame

Out[6]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
0	0	3	Mr. Owen Harris Braund	male	22.0	1	7.2500
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	71.2833
2	1	3	Miss. Laina Heikkinen	female	26.0	0	7.9250
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	53.1000
4	0	3	Mr. William Henry Allen	male	35.0	0	8.0500

In [7]:

titanic.tail() # Print the last few rows of the DataFrame

Out[7]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
882	0	2	Rev. Juozas Montvila	male	27.0	0	0	13.00
883	1	1	Miss. Margaret Edith Graham	female	19.0	0	0	30.00
884	0	3	Miss. Catherine Helen Johnston	female	7.0	1	2	23.45
885	1	1	Mr. Karl Howell Behr	male	26.0	0	0	30.00
886	0	3	Mr. Patrick Dooley	male	32.0	0	0	7.75

Basic Statistics¶

You can use Pandas to calculate basic statistics on your data, such as mean, median, and standard deviation.
The describe() function provides a summary of the basic statistics of each column in the DataFrame.

In [8]:

titanic.describe()

Out[8]:

	Survived	Pclass	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
count	887.000000	887.000000	887.000000	887.000000	887.000000	887.00000
mean	0.385569	2.305524	29.471443	0.525366	0.383315	32.30542
std	0.487004	0.836662	14.121908	1.104669	0.807466	49.78204
min	0.000000	1.000000	0.420000	0.000000	0.000000	0.00000
25%	0.000000	2.000000	20.250000	0.000000	0.000000	7.92500
50%	0.000000	3.000000	28.000000	0.000000	0.000000	14.45420
75%	1.000000	3.000000	38.000000	1.000000	0.000000	31.13750
max	1.000000	3.000000	80.000000	8.000000	6.000000	512.32920

Indexing and Selection¶

You can use indexing and selection to retrieve specific data from a DataFrame.
The loc[] function is used for label-based indexing, where you can specify the row and column labels.
The iloc[] function is used for integer-based indexing, where you can specify the row and column numbers.

In [9]:

titanic.iloc[2:5]

Out[9]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
2	1	3	Miss. Laina Heikkinen	female	26.0	0	7.925
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	53.100
4	0	3	Mr. William Henry Allen	male	35.0	0	8.050

In [10]:

titanic.loc[2:5, ['Survived', 'Pclass']]

Out[10]:

	Survived	Pclass
2	1	3
3	1	1
4	0	3
5	0	3

Filtering Data¶

You can use Boolean indexing to filter data in a DataFrame based on a certain condition.
For example, you can filter the Titanic dataset to only show passengers who survived:

In [11]:

# Filter Titanic dataset to only show passengers who survived
survivors = titanic[titanic['Survived'] == 1]
survivors.head()

Out[11]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	0	71.2833
2	1	3	Miss. Laina Heikkinen	female	26.0	0	0	7.9250
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	0	53.1000
8	1	3	Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson	female	27.0	0	2	11.1333
9	1	2	Mrs. Nicholas (Adele Achem) Nasser	female	14.0	1	0	30.0708

Grouping and Aggregation¶

Grouping and aggregation are powerful tools for summarizing and analyzing data in a DataFrame.
You can group data in a DataFrame based on one or more columns, and then apply an aggregation function like sum(), mean(), or count().
For example, you can group the Titanic dataset by ticket class and calculate the average age for each class:

In [12]:

# Group Titanic dataset by ticket class and calculate the average age for each class
age_by_class = titanic.groupby('Pclass')['Age'].mean()
age_by_class

Out[12]:

Pclass
1    38.788981
2    29.868641
3    25.188747
Name: Age, dtype: float64

Matplotlib¶

Matplotlib is a Python library used for creating data visualizations.
It provides a wide range of tools for creating line plots, bar plots, histograms, scatterplots

Plotting a Simple Line Graph with Matplotlib¶

Let's start with an example of how to use Matplotlib to create a simple line graph.
For this example, we will use a dataset of global CO2 emissions from 1960 to 2014.
Here's the code to load the dataset:

In [13]:

import pandas as pd

url = 'https://raw.githubusercontent.com/datasets/co2-fossil-global/master/global.csv'
co2 = pd.read_csv(url)

co2.head()

Out[13]:

	Year	Total	Solid Fuel	Per Capita
0	1751	3	3	NaN
1	1752	3	3	NaN
2	1753	3	3	NaN
3	1754	3	3	NaN
4	1755	3	3	NaN

In [14]:

import matplotlib.pyplot as plt

plt.plot(co2['Year'], co2['Total'])
plt.xlabel('Year')
plt.ylabel('CO2 Emissions (million metric tons)')
plt.title('Global CO2 Emissions from Fossil Fuels')
plt.show()

Creating a Bar Plot with Matplotlib¶

You can also use Matplotlib to create a bar plot.
For this example, we will use a dataset of the number of medals won by different countries in the 2016 Summer Olympics.
Here's the code to load the dataset and create a bar plot:

In [15]:

plt.bar(age_by_class.index, age_by_class.values)

plt.title('Mean Age by Class')
plt.xlabel('Class')
plt.ylabel('Mean Age (Years)')

plt.show()

Numpy¶

NumPy is a Python library used for numerical computing.
It provides a wide range of tools for working with arrays and matrices.
NumPy is used in many scientific computing applications.
Let's start with an example of how to use NumPy to create an array.

Creating a NumPy Array¶

To create a NumPy array, you can use the numpy.array() function.
Here's the code to create a NumPy array:

In [16]:

import numpy as np

data = [1, 2, 3, 4, 5]
arr = np.array(data)
arr

Out[16]:

array([1, 2, 3, 4, 5])

NumPy Array Operations¶

You can perform various operations on NumPy arrays.
For example, you can add, subtract, multiply, and divide arrays.
Here's the code to add two arrays:

In [17]:

import numpy as np

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr3 = arr1 + arr2
arr3

Out[17]:

array([5, 7, 9])

NumPy Array Indexing and Slicing¶

You can also index and slice NumPy arrays.
Here's the code to create a NumPy array and slice it:

In [18]:

import numpy as np

arr = np.array([1, 2, 3, 4, 5])
arr[2]

Out[18]:

In [19]:

arr[1:4]

Out[19]:

array([2, 3, 4])

NumPy Broadcasting¶

Broadcasting is a powerful NumPy feature that allows you to perform operations on arrays of different shapes.
Here's an example:

In [20]:

import numpy as np

# Create a 2D array of shape (3, 4)
arr1 = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8],
                 [9, 10, 11, 12]])

# Create a 1D array of shape (4,)
arr2 = np.array([2, 2, 2, 2])

# Add the 1D array to each row of the 2D array using broadcasting
result = arr1 + arr2

print(result)

[[ 3  4  5  6]
 [ 7  8  9 10]
 [11 12 13 14]]

NumPy Broadcasting 2/3¶

In this example, we have a 2D NumPy array arr1 with shape (3, 4) and a 1D NumPy array arr2 with shape (4,). We want to add the values in arr2 to each row of arr1. Normally, this operation would not be possible because the two arrays have different shapes. However, NumPy broadcasting allows us to perform this operation by "stretching" or "broadcasting" the 1D array to match the shape of the 2D array.

In this case, NumPy broadcasts the 1D array arr2 to a 2D array of shape (3, 4) by duplicating its values along the first dimension. This allows us to perform element-wise addition between the two arrays.

NumPy Broadcasting 3/3¶

Note that broadcasting is not always possible or desirable, and certain conditions must be met for it to work correctly. For example, the trailing dimensions of the two arrays must either match or be equal to 1, among other rules. It's important to understand these rules and use broadcasting judiciously to avoid errors and unexpected results.

Seaborn¶

Seaborn is a Python data visualization library based on Matplotlib.
It provides a high-level interface for creating informative and attractive statistical graphics.
Seaborn works well with pandas dataframes and provides tools for visualizing relationships between variables.

In [21]:

import seaborn as sns
sns.countplot (data = titanic, x = 'Survived', hue = 'Sex')

Out[21]:

<AxesSubplot: xlabel='Survived', ylabel='count'>