Data wrangling and statistics

This tutorial covers essential concepts in descriptive statistics and their implementation in python. The tutorial includes:

  1. Initial Data Examination
  2. Descriptive Statistics
  3. Visualizations

To fulfill the goals of this tutorial, we will make use of the Titanic dataset, which provides information about passengers on board the RMS Titanic. Run the code cells below to load the dataset, and the necessary libraries for obtaining descriptive statistics.

# import numpy
import numpy as np
# library for plotting and importing the data
import seaborn as sns
# library for specific stiatistics
from scipy.stats import skew, kurtosis
# library for manipulating and analysing data 
import pandas as pd
# Load and filter the titanic dataset
titanic = sns.load_dataset('titanic')
titanic = titanic[['survived','age', 'sex', 'fare', 'deck', 'class', 'embark_town','alone']]
# for display clarity 
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Initial Data Examination

The first steps is to get a quick overview of your dataset. The head() function shows the initial rows of the dataset, providing a brief overview of what the data looks like. This initial examination helps you understand how the data is organized (sometimes even showing missing values) and make informed decisions about further approaches.

print(titanic.head())
   survived    age     sex   fare deck  class  embark_town  alone
0         0 22.000    male  7.250  NaN  Third  Southampton  False
1         1 38.000  female 71.283    C  First    Cherbourg  False
2         1 26.000  female  7.925  NaN  Third  Southampton   True
3         1 35.000  female 53.100    C  First  Southampton  False
4         0 35.000    male  8.050  NaN  Third  Southampton   True

As illustrated in the output, this version of the dataset consists of 8 variables:

  1. survived : Indicates whether the passenger survived (0 = No, 1 = Yes).

  2. age : Age of the passenger.

  3. sex : Gender of the passenger.

  4. fare : Fare paid for the ticket.

  5. class : Equivalent to pclass but as a categoriacal data type.

  6. deck : Deck where the passenger's cabin was located.

  7. embark_town : Town where the passenger embarked.

  8. alone : Indicates whether the passenger was traveling alone.

Additionally, the output reveals missing values (NaN), which will be relevant later in this tutorial.

Dimensionality

Executing the code cell below returns a tuple with the dimensions of the dataset (rows, columns), by making use of the .shape attribute:

print(titanic.shape)
(891, 8)

Data types and Structures

Data types refer to the format in which data is stored. In the cell below the .dtypes attribute is used to access the data types of the features in the dataset:

print(titanic.dtypes)
survived          int64
age             float64
sex              object
fare            float64
deck           category
class          category
embark_town      object
alone              bool
dtype: object

The generic Python object data type is called object . Objects can consist of various types, such as strings, lists, dictionaries, or even custom objects.

Descriptive Statistics

Descriptive statistics are a fundamental component in understanding a dataset, as they provide a concise summary of its main characteristics by summarizing its central tendencies, variability, and distribution.

Summary Statistics

When examining a dataset, it's common to encounter two types of variables: numerical and categorical. Numerical variables represent quantities or measurements, while categorical variables represent categories or labels. To gain meaningful insights, numerical and categorical variables are summarized separately.

Numerical summary

Statistics of numerical variables include quantities such as

  • mean (average)
  • median (middle value)
  • mode (most frequent value)
  • range (difference between the maximum and minimum values)
  • quartiles (for assessing spread and distribution)
  • variance and standard deviation (both are measures of data dispersion)
  • skewness (distribution asymmetry measure)
  • kurtosis (tail thickness of distributions).

The describe() function from the pandas library, and the skew() and kurtosis() functions from the scipy library are used in the cell below to calculate the statistics.

# Get summary statistics for numerical columns
numerical_summary = titanic.describe()

# Rename the "50%" row to "median" for clarity
numerical_summary = numerical_summary.rename(index={'50%': 'median'})

# Calculate the range and add it as a new row
numerical_columns = titanic.select_dtypes(include=[np.number])
range_values = numerical_columns.max(numeric_only=True) - numerical_columns.min(numeric_only=True)
numerical_summary.loc['range'] = range_values

# Find the mode 
numerical_summary.loc['mode'] = titanic.select_dtypes(include=[np.number]).mode().iloc[0]

# Calculate variance, skewness, and kurtosis and add as new rows to summary
numerical_columns = titanic.select_dtypes(include=[np.number])
variance_values = numerical_columns.var()
skewness_values = skew(numerical_columns, axis=0)
kurtosis_values = kurtosis(numerical_columns, axis=0)
numerical_summary.loc['variance'] = variance_values
numerical_summary.loc['skewness'] = skewness_values
numerical_summary.loc['kurtosis'] = kurtosis_values

# Reorder the rows for clarity
numerical_summary = numerical_summary.reindex(['count', 'mean', 'median', '25%', '75%','mode', 'min', 'max','range', 'std','variance','skewness','kurtosis'])

# Print the numerical summary statistics
print(numerical_summary)
          survived     age     fare
count      891.000 714.000  891.000
mean         0.384  29.699   32.204
median       0.000  28.000   14.454
25%          0.000  20.125    7.910
75%          1.000  38.000   31.000
mode         0.000  24.000    8.050
min          0.000   0.420    0.000
max          1.000  80.000  512.329
range        1.000  79.580  512.329
std          0.487  14.526   49.693
variance     0.237 211.019 2469.437
skewness     0.478     NaN    4.779
kurtosis    -1.772     NaN   33.204

Observe a skewness and kurtosis value of NaN for the age variable. As you progress through the tutorial, you will discover that this variable includes missing values, resulting in an inaccurate calculation of these statistical measures.

Categorical summary

Categorical variables require different types of summary statistics and include

  • counts
  • percentages
  • frequencies
  • most common values

for each category. The cell below uses the describe() function to obtain statistics of categorical variables:

# Get summary statistics for categorical columns
categorical_summary = titanic.describe(include='category')

# For clarity rename top to mode, as the two are equivalent
categorical_summary = categorical_summary.rename(index={'top': 'mode'})

print(categorical_summary)
       deck  class
count   203    891
unique    7      3
mode      C  Third
freq     59    491

The statistics of the categorical variables do not provide insight into the distribution of each category, and therefore no insights into class imbalances. The cell below calculates the frequency of occurrence of each unique category within the dataset:

# count the number of observations belonging to each category
category_counts = titanic.select_dtypes(include='category').apply(lambda col: col.value_counts())
# this is only for display purposes
category_counts = category_counts.reindex(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'First', 'Second', 'Third'])
print(category_counts)
         deck   class
A      15.000     NaN
B      47.000     NaN
C      59.000     NaN
D      33.000     NaN
E      32.000     NaN
F      13.000     NaN
G       4.000     NaN
First     NaN 216.000
Second    NaN 184.000
Third     NaN 491.000

The output shows that there is an imbalance in the categorical variables (varying number of observation for each class). Issues about imbalances will become important later in the course, when discussing classification.

Covariance

Covariance is a measure of how much random variables co-vary (vary together, like the age of a person and the fare price paid by a passenger). Mathematically, the covariance between two variables $x$ and $y$ is defined as:

$$ \text{Cov}(x, y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}) $$

where $n$ is the number of samples, $\bar{x}$ and $\bar{y}$ are the means of the random variables, and $x_i$ and $y_i$ refer to the specific data values of the random variables. The code cell below contains an implementation of the covariance calculation:

def custom_covariance(x, y):
    n = len(x)
    mean_x = np.mean(x)
    mean_y = np.mean(y)
    
    covariance = np.sum((x - mean_x) * (y - mean_y)) / (n - 1)
    return covariance


print(custom_covariance(titanic['age'], titanic['fare']))
59.162200289689366

A positive covariance value indicates that two variables tend to move in the same direction, while a negative covariance value indicates that they tend to move in opposite directions. The magnitude of the covariance value is directly influenced by the scales of the variables involved. For example the variables age and fare have different units of measurement, which can make it challenging to compare covariances across the pair of variables.

Correlation

Measures of correlation standardizes the covariance by transforming it into the range between -1 and 1. Pearson's $r$ is defined as:

$$ \text{r} = \frac{\text{Cov}(x, y)}{\sigma_x \cdot \sigma_y} $$

where $\sigma_x$ and $\sigma_y$ are the standard deviations of each random variable involved.

Covariance and correlation are similar statistical measures, as both quantify how two variables co-vary. Using the values of the covariance when the two variables have different units of measurment makes interpretation difficult. Standardization in correlation attempts to make this easier. The absolute value of the correlation coefficient indicates the strength of the relationship, with the strength increasing as the values get closer to 1 and -1. The sign of the correlation coefficient (+ or -) indicates the direction of the relationship. A positive correlation means that when one variable increases, the other increases at a constant rate, while a negative correlation means that when one variable increases, the other decreases at a constant rate. A correlation of 0 indicates no linear relationship between two variables.

The code cell below calculates Pearson's correlation between survived and fare variables:

# get the covariance between
num = custom_covariance(titanic['survived'], titanic['fare'])
# get all standard deviation for x (survived)
sigma_x =np.sum((np.std(titanic['survived'])))
# get all standard deviation for y (fare)
sigma_y = (np.sum((np.std(titanic['fare']))))

print(num/(sigma_x*sigma_y))
0.2575956308370798

Correlation between multiple variables result in a correlation matrix. The cell below calculates multivariate correlation for the Titanic dataset using the corr function from pandas:

# Calculate the pairwise correlation matrix
correlation_matrix = numerical_columns.corr()

print(correlation_matrix)
          survived    age  fare
survived     1.000 -0.077 0.257
age         -0.077  1.000 0.096
fare         0.257  0.096 1.000

Missing Data

Missing data refers to absent or incomplete values in a dataset. It's common to encounter missing data due to various reasons such as data collection errors, sensor failures, or simply because certain information was not collected for some observations. Dealing with missing data is a crucial step in data preprocessing, as it can significantly impact the accuracy and reliability of any analysis or modeling performed on the dataset. The code cell below identifies the number of missing values for each variable in the dataset, by calling the isna() function:

# Count missing data for each column
print(titanic.isna().sum())
survived         0
age            177
sex              0
fare             0
deck           688
class            0
embark_town      2
alone            0
dtype: int64

The variables age , deck and embark_town contain multiple missing values. For the current tutorial we will not remove or manipulate these values.

Duplicates

Duplicate rows refer to observations that are identical in all their attribute values across all columns and can arise due to reasons such as missing values, data entry errors, system glitches, or issues during data collection. The duplicated() function in pandas identifies and marks duplicate rows:

# Count duplicate rows
print(titanic.duplicated().sum())
107

The Titanic dataset contains 107 duplicated rows. It's essential to exercise caution when deciding whether to keep or delete duplicates. Some duplicates may result from missing data. For instance, age and deck variables contain multiple missing observations, so identical rows could still originate from different passengers due to incomplete information. Blindly removing duplicates may result in the loss of valuable data. In this tutorial, we opted not to remove duplicates for this reason.

Outliers

Outliers are data points that deviate significantly from the majority of observations in a dataset. Outliers can have a disproportionate impact on statistical measures and modeling results, therefore it's crucial that they are correctly identified and handled.

$Z$-scores, also known as standard scores, are a valuable tool for detecting outliers. They quantify how many standard deviations a data point is away from the mean of the dataset. They provide a standardized measure of deviation that is independent of the original data's scale. $Z$-scores are calculated by subtracting the mean of a variable from a specific data point (indicating how far the data point is from the average) and dividing this difference by the standard deviation:

# Calculate the mean and standard deviation for each numerical column
mean_values = numerical_columns.mean()
std_values = numerical_columns.std()

# Calculate z-scores for each data point in the numerical columns
z_scores = (numerical_columns - mean_values) / std_values

A $z$-threshold is set, with $z$-scores greater than or equal to that threshold being considered outliers. The choice of z-threshold depends on the specific analysis and domain knowledge. For the current analysis a threshold of $2$ is used:

# Define the z-score threshold for identifying outliers
z_threshold = 2

# Identify and print outliers using z-scores
outliers = titanic[abs(z_scores) > z_threshold]

# Filter to exclude columns with all NaN values
outliers = outliers.dropna(axis=1, how='all')

print(outliers.count())
age     29
fare    38
dtype: int64

Overall, 29 outliers were identified for the age variable and 38 outliers for the fare variable. In the current tutorial none of the outliers are removed.

Note

When dealing with higher dimensional, multivariate data the Mahalanobis distance would be used for outlier detection, which calculates the distance between a data point and the center of a dataset, accounting for the correlations between variables.

Visualizations

While measures like mean, median, mode and standard deviation provide insights into the typical values and spread of data, visualizations offer a more comprehensive understanding by revealing patterns, relationships, and additional information within the dataset. Different data types necessitate the use of specific types of plots.

Histograms

A histogram is a graphical representation of the distribution of the data. Histograms are particularly useful for understanding the general tendencies, spread, and shape of data. The sns.histplot() function from the Seaborn library is used to plot a histogram of the fare prices paid by the passengers on Titanic:

sns.histplot(titanic['fare'], bins=50, color='skyblue')

In this case, the histogram is positively skewed, as it has an elongated tail on the right side. Overall, the histogram shows us that the majority of passengers paid relatively lower fares, while a small number paid high fares, resulting in the elongated tail. The plot illustrates the findings from the numerical summary, where a skewness value of 5 was observed. This is considered large and indicates that the distribution is heavily skewed to the right.

Box Plots

Similarly to histograms, box plots (a.k.a a box-and-whisker plot) are used to visualize the distribution of a variable. Boxplots are particularly useful for comparing the distribution of a variable across different categories or groups. In the cell below we make use of the sns.boxplot() function from the Seaborn library, to plot the age of the passengers across different classes in the titanic dataset:

# Create a boxplot of the "fare" column
sns.boxplot(x=titanic['class'], y=titanic['age'])

Each box in the plot represents the age distribution of one of the three passenger classes: First Class (First), Second Class (Second), and Third Class (Third). The vertical position of each box on the y-axis represents the median (50th percentile) age of passengers in that class. The horizontal line inside the box represents the median age. The height of each box shows the interquartile range (IQR), which measures the middle 50% of the age distribution. The vertical lines extending from the boxes are called "whiskers" and represent the range of ages that fall within a reasonable quartile range (Quartile 1 – 1.5 * IQR or Quartile 3 + 1.5 * IQR). Any data points beyond the whiskers are considered outliers. The plot above shows that the median age of passengers differs between classes, where the first class of passengers have the highest median age, and third class have the lowest. The second and third classes have multiple outliers, with the second class having outliers on both ends.

Bar Charts

Bar charts are used to display the frequency of categories within a dataset. They are useful for comparing different categories and understanding their relative sizes. The cell below makes use of the sns.barplot() function from the Seaborn library, to provide a visual representation of the class distribution among the passengers:

# Count the number of passengers in each class
class_counts = titanic['class'].value_counts()
sns.barplot(x=class_counts.index, y=class_counts.values, palette='Set2')

The third class has the highest number of passengers, while the second class has the lowest. The plot clearly visualizes the class imbalances.

Heatmaps

Heatmaps are commonly used to visualize relationships, patterns, or distributions within complex datasets. Heatmaps are particularly useful for identifying multicollinearity. The Seaborn library contains the sns.heatmap() function, used in the cell below to create a heatmap of the correlation matrix:

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=.5)

Each cell in the heatmap represents the correlation between two numerical variables from the dataset. The color of each cell indicates the strength and direction of the correlation with darker colors indicating stronger, and lighter blue colors indicating weaker or no correlations (close to 0). Blue cells represent negative correlations, while red cells represent positive correlations. The intensity of the color corresponds to the magnitude of the correlation coefficient. The numerical values within the cells provide precise correlation coefficients.

Scatter Plots

Scatter plots are used to visualize the relationship between two numerical variables. Sometimes, outliers can be spotted in scatterplots, as they appear as points far away from the remaining parts of the data. In the cell below the sns.scatterplot() function from the Seaborn library is used to visualize the relationship between passenger ages and the fare prices paid:

sns.scatterplot(x=titanic['age'], y=titanic['fare'], alpha=0.5, color='b')  # Plot age vs. fare

The scatterplot above shows no discernible pattern or trend between age and fare , no significant clustering of points is visible, and no clear correlation can be observed, as also indicated by the values of the correlation matrix.

Pair Plots

Pair plots are a comprehensive way to visualize the relationships between multiple numerical variables simultaneously. Pair plots can be generated for the age and fare variables using the sns.pairplot() function from the Seaborn library:

sns.pairplot(titanic[['age', 'fare']])

The plot suggests that there is no correlation between the age and fare variables. There are distinct patterns in fare distribution (lower right corner), with a concentration of passengers paying lower fares and fewer passengers paying higher fares.

Missing Data Patterns

Heatmaps can also provide a quick and intuitive way to identify patterns of missing data. You can create a heatmap of missing data by combining the sns.heatmap() function from the seaborn library and the isna() function:

sns.heatmap(titanic.isna(), cbar=False, cmap='viridis')

The heatmap shows missing values as yellow cells, allowing quick identification of which variables (x-axis) and observations (y-axis) have missing data.