We will be discussing three important Python libraries that are commonly used for data analysis: pandas, matplotlib, and numpy.
import statement followed by the name of the library.math library, you would use the following statement:import math
math.sqrt(25)
5.0
as keyword.math library with the alias m, you would use the following statement:import math as m
m.sqrt(25)
5.0
%pip install library_name
| PassengerId | Survived | Name | Sex | Age | SibSp | Parch | Ticket | Fare | 
|---|---|---|---|---|---|---|---|---|
| 1 | 0 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 | 20.0 | 
import pandas as pd
url = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
titanic = pd.read_csv(url) # Load Titanic dataset
titanic.shape # The dimension of the dataframe (the table)
(887, 8)
DataFrame, you can start exploring it using various Pandas functions.head() and tail functions are useful functions for quickly viewing the first and last few rows of a DataFrame.titanic.head() # Print the first few rows of the DataFrame
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 | 
| 1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 | 
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 | 
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 | 
| 4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 | 
titanic.tail() # Print the last few rows of the DataFrame
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 882 | 0 | 2 | Rev. Juozas Montvila | male | 27.0 | 0 | 0 | 13.00 | 
| 883 | 1 | 1 | Miss. Margaret Edith Graham | female | 19.0 | 0 | 0 | 30.00 | 
| 884 | 0 | 3 | Miss. Catherine Helen Johnston | female | 7.0 | 1 | 2 | 23.45 | 
| 885 | 1 | 1 | Mr. Karl Howell Behr | male | 26.0 | 0 | 0 | 30.00 | 
| 886 | 0 | 3 | Mr. Patrick Dooley | male | 32.0 | 0 | 0 | 7.75 | 
You can use Pandas to calculate basic statistics on your data, such as mean, median, and standard deviation.
The describe() function provides a summary of the basic statistics of each column in the DataFrame.
titanic.describe()
| Survived | Pclass | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|
| count | 887.000000 | 887.000000 | 887.000000 | 887.000000 | 887.000000 | 887.00000 | 
| mean | 0.385569 | 2.305524 | 29.471443 | 0.525366 | 0.383315 | 32.30542 | 
| std | 0.487004 | 0.836662 | 14.121908 | 1.104669 | 0.807466 | 49.78204 | 
| min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.00000 | 
| 25% | 0.000000 | 2.000000 | 20.250000 | 0.000000 | 0.000000 | 7.92500 | 
| 50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.45420 | 
| 75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.13750 | 
| max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.32920 | 
DataFrame.loc[] function is used for label-based indexing, where you can specify the row and column labels.iloc[] function is used for integer-based indexing, where you can specify the row and column numbers.titanic.iloc[2:5]
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.925 | 
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.100 | 
| 4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.050 | 
titanic.loc[2:5, ['Survived', 'Pclass']]
| Survived | Pclass | |
|---|---|---|
| 2 | 1 | 3 | 
| 3 | 1 | 1 | 
| 4 | 0 | 3 | 
| 5 | 0 | 3 | 
You can use Boolean indexing to filter data in a DataFrame based on a certain condition.
For example, you can filter the Titanic dataset to only show passengers who survived:
# Filter Titanic dataset to only show passengers who survived
survivors = titanic[titanic['Survived'] == 1]
survivors.head()
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 | 
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 | 
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 | 
| 8 | 1 | 3 | Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson | female | 27.0 | 0 | 2 | 11.1333 | 
| 9 | 1 | 2 | Mrs. Nicholas (Adele Achem) Nasser | female | 14.0 | 1 | 0 | 30.0708 | 
sum(), mean(), or count().# Group Titanic dataset by ticket class and calculate the average age for each class
age_by_class = titanic.groupby('Pclass')['Age'].mean()
age_by_class
Pclass 1 38.788981 2 29.868641 3 25.188747 Name: Age, dtype: float64
import pandas as pd
url = 'https://raw.githubusercontent.com/datasets/co2-fossil-global/master/global.csv'
co2 = pd.read_csv(url)
co2.head()
| Year | Total | Gas Fuel | Liquid Fuel | Solid Fuel | Cement | Gas Flaring | Per Capita | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1751 | 3 | 0 | 0 | 3 | 0 | 0 | NaN | 
| 1 | 1752 | 3 | 0 | 0 | 3 | 0 | 0 | NaN | 
| 2 | 1753 | 3 | 0 | 0 | 3 | 0 | 0 | NaN | 
| 3 | 1754 | 3 | 0 | 0 | 3 | 0 | 0 | NaN | 
| 4 | 1755 | 3 | 0 | 0 | 3 | 0 | 0 | NaN | 
import matplotlib.pyplot as plt
plt.plot(co2['Year'], co2['Total'])
plt.xlabel('Year')
plt.ylabel('CO2 Emissions (million metric tons)')
plt.title('Global CO2 Emissions from Fossil Fuels')
plt.show()
plt.bar(age_by_class.index, age_by_class.values)
plt.title('Mean Age by Class')
plt.xlabel('Class')
plt.ylabel('Mean Age (Years)')
plt.show()
numpy.array() function.import numpy as np
data = [1, 2, 3, 4, 5]
arr = np.array(data)
arr
array([1, 2, 3, 4, 5])
import numpy as np
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
arr3 = arr1 + arr2
arr3
array([5, 7, 9])
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
arr[2]
3
arr[1:4]
array([2, 3, 4])
import numpy as np
# Create a 2D array of shape (3, 4)
arr1 = np.array([[1, 2, 3, 4],
                 [5, 6, 7, 8],
                 [9, 10, 11, 12]])
# Create a 1D array of shape (4,)
arr2 = np.array([2, 2, 2, 2])
# Add the 1D array to each row of the 2D array using broadcasting
result = arr1 + arr2
print(result)
[[ 3 4 5 6] [ 7 8 9 10] [11 12 13 14]]
In this example, we have a 2D NumPy array arr1 with shape (3, 4) and a 1D NumPy array arr2 with shape (4,). We want to add the values in arr2 to each row of arr1. Normally, this operation would not be possible because the two arrays have different shapes. However, NumPy broadcasting allows us to perform this operation by "stretching" or "broadcasting" the 1D array to match the shape of the 2D array.
In this case, NumPy broadcasts the 1D array arr2 to a 2D array of shape (3, 4) by duplicating its values along the first dimension. This allows us to perform element-wise addition between the two arrays.
Note that broadcasting is not always possible or desirable, and certain conditions must be met for it to work correctly. For example, the trailing dimensions of the two arrays must either match or be equal to 1, among other rules. It's important to understand these rules and use broadcasting judiciously to avoid errors and unexpected results.
import seaborn as sns
sns.countplot (data = titanic, x = 'Survived', hue = 'Sex')
<AxesSubplot: xlabel='Survived', ylabel='count'>