Data analysis, manipulation and plotting

Info

The tutorials may contain small exercises and these are all optional.

Introduction

This tutorial will cover the following topics:

  1. Introduction to arrays and vectors in numpy.

  2. Loading/Saving data.

  3. Essential methods for data analysis/manipulation.

  4. Elementary plotting using matplotlib.

Run the cell below to import Numpy and Matplotlib:

#Import necessary libraries 
import numpy as np
import matplotlib.pyplot as plt

Creating data in numpy

Numpy arrays

Numpy has several convenient functions for creation of arrays. The following are especially useful for this course (read more about array creation here ):

Don't worry about memorizing them for now.

The cell below shows a few samples of their use:

a_ones = np.ones((2, 3)) # 2 by 3 array of ones. 
a_zeros = np.zeros((3, 2)) # 2 by 3 array of ones.
a_linspace = np.linspace(0, 10, 5) ## creates an array of 5 numbers evenly spaced from 0 to 9 (10-1 # zero indexed).
a_arange = np.arange(0, 10, 2) # creates arrays from 0 to 9 (max) with a stride of 2. since (10>9) the max value will be 8.
a_uniform = np.random.uniform(size= (2, 2)) # creates a 2 by 2 array of "random" numbers drawn from a uniform distribution. 
a_normal = np.random.normal(size=(2, 2))  # creates a 2 by 2 array of "random" numbers drawn from a normal/gaussian distribution. 

print('ones:\n', a_ones)
print('zeros:\n', a_zeros)
print('linspace:\n', a_linspace)
print('arange:\n', a_arange)
print('uniform:\n', a_uniform)
print('normal:\n', a_normal)
ones:
 [[1. 1. 1.]
 [1. 1. 1.]]
zeros:
 [[0. 0.]
 [0. 0.]
 [0. 0.]]
linspace:
 [ 0.   2.5  5.   7.5 10. ]
arange:
 [0 2 4 6 8]
uniform:
 [[0.369003   0.76433793]
 [0.73035621 0.46291684]]
normal:
 [[ 0.46521356  1.26446077]
 [-0.02335452  0.20548508]]
Note

There is no need for iteration (i.e. loops) when creating arrays in numpy!

Saving arrays with numpy

The following example shows how to save Numpy arrays. Numpy arrays can be stored in the following two formats:

a_normal_50 = np.random.normal(size=(50,2))
## Saving the array as a compressed npy file (numpy data format)
np.save('./Data/RandomData.npy',a_normal_50) 

a_arange_50 = np.arange(0,100,2)
np.save('./Data/StructuredData.npy',a_arange_50)

#numpy can additionally save to as a txt-file (uncompressed) formats like.
a_linspace_50 = np.linspace((1,2),(10,20),10)
### saving data as a regular txt file, also possible to save as a csv file
np.savetxt('./Data/Txt_file.txt',a_linspace_50)

Loading data with numpy

Numpy arrays can be loaded with the Numpy functions np.load(path) and np.loadtxt(path) as shown below:

## Loading data stored as a compressed npy file (numpy data format)
A = np.load('./Data/RandomData.npy') 
B = np.load('./Data/StructuredData.npy')

#load data stored as a txt/ (csv) file (uncompressed) formats like.
C = np.loadtxt('./Data/Txt_file.txt')

# Note A[:N] is only a slice i.e. the first N elements of A
print('A:\n',A[:5])
print('B:\n',B[:10])
print('C:\n',C[:5])

Operate along dimensions

Numpy arrays are often used to handle multidimensional data. In these instances you may want to perform operations along only one or some of the array axes.

Example: Mean

In this example, we calculate the average of $N$ random vectors.

The cell below defines an $N\times K$ matrix of random values:

N, K = 20, 10
r = np.random.uniform(size=(N, K))

The Numpy function np.mean calculates averages over Numpy arrays. The axis argument specifies the direction ($0$ for rows or $1$ for columns) of the calculation. This is demonstrated in the cell below:

np.mean(r, axis=0)
Tip

The axis argument is supported by most of Numpy's functions, including sum and sqrt .

Essential Numpy array methods for data analysis and manipulation

The next section covers essential methods for data analysis and manipulation. The methods will be used abundantly throughout the course and are worth paying careful attention to.

  • np.mean(Array,dim) , np.std(Array,dim) : Calculate the mean value of a given Numpy array of numbers (floats or integers ).
  • a.shape : Finds the shape (dimensionality of a given data array), Len(list/Array) provides the length of the first list/Array dimension.
  • Slicing: using the : operator can create slices of an array A as A[start:stop:step] . Read more in the official guide here .
  • Broadcasting: Is used in Numpy to perform operations between arrays of different size. Read more in the official guide here .
  • Elementwise addition and multiplication
  • np.concatenate(Array list, axis) : Stack numpy arrays along the direction of axis .

Next, we consider a few examples to demonstrate the functionalities described above:

A = np.linspace(0,9,10)

B = np.array([
    [-16, 15, -14, 13],
    [-12, 11, -10, 9],
    [-8, 7, -6, 5],
    [-4, 3, -2, 1]
])

print('A:\n',A)
print('B:\n',B)
A:
 [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
B:
 [[-16  15 -14  13]
 [-12  11 -10   9]
 [ -8   7  -6   5]
 [ -4   3  -2   1]]
### Mean of an array 
# Using/calling the mean method from the Numpy library to determine the mean of the loaded data.
print('Mean A:\n',np.mean(A)) 

# Most Numpy array manipulation methods can additionally be called from an array object
print('Mean of using Array method:\n',A.mean())

### Std of an array 
print('Std of A:\n',np.std(A))

### Sum of an array 
print('A sum:\n', np.sum(A))

### shape (size) of an array
print('A shape:\n',A.shape)
print('B shape:\n',B.shape)

## np.concatenation([A,B]) example
print('Concatenation of A and Slice of B matrix:\n',np.concatenate([A,B[0,:]],axis=0))
Mean A:
 4.5
Mean of using Array method:
 4.5
Std of A:
 2.8722813232690143
A sum:
 45.0
A shape:
 (10,)
B shape:
 (4, 4)
Concatenation of A and Slice of B matrix:
 [  0.   1.   2.   3.   4.   5.   6.   7.   8.   9. -16.  15. -14.  13.]

Slicing of arrays

### Slicing of array
print(B[:,0])

print(A[:5])
print('A[5:], A array except the first 5:\n',A[5:])

print('A[:-5], A array except the last 5:\n', A[:-5])

print('A[1::2] array of every second elemt of A starting from the second:\n',A[1::2])
[-16 -12  -8  -4]
[0. 1. 2. 3. 4.]
A[5:], A array except the first 5:
 [5. 6. 7. 8. 9.]
A[:-5], A array except the last 5:
 [0. 1. 2. 3. 4.]
A[1::2] array of every second elemt of A starting from the second:
 [1. 3. 5. 7. 9.]

Array Arithmetic

### Adding of array
print('Adding a slice of A shape (4,) to B shape (4,4) using broadcasting:\n',A[:4]+B)

print('Adding constant to A (10,) using broadcasting:\n',A+10)
print('Adding single element array (shape (1,)) to B (shape (4,4)) using broadcasting:\n',B  + np.array([10]))

### Elementwise multiplication of arrayLoading
print('Elementwise multiplication of a slice of A (shape (4,)) to B (shape (4,4)) using broadcasting:\n',A[:4]*B)

### Add division example
print('Elementwise division of a slice of B (shape (4,)) and A (shape (4,)):\n',B[0,:]/A[1:5])
Adding a slice of A shape (4,) to B shape (4,4) using broadcasting:
 [[-16.  16. -12.  16.]
 [-12.  12.  -8.  12.]
 [ -8.   8.  -4.   8.]
 [ -4.   4.   0.   4.]]
Adding constant to A (10,) using broadcasting:
 [10. 11. 12. 13. 14. 15. 16. 17. 18. 19.]
Adding single element array (shape (1,)) to B (shape (4,4)) using broadcasting:
 [[-6 25 -4 23]
 [-2 21  0 19]
 [ 2 17  4 15]
 [ 6 13  8 11]]
Elementwise multiplication of a slice of A (shape (4,)) to B (shape (4,4)) using broadcasting:
 [[ -0.  15. -28.  39.]
 [ -0.  11. -20.  27.]
 [ -0.   7. -12.  15.]
 [ -0.   3.  -4.   3.]]
Elementwise division of a slice of B (shape (4,)) and A (shape (4,)):
 [-16.           7.5         -4.66666667   3.25      ]

Comparison operators

Just as the elementwise arithmetic operators, Numpy implements elementwise comparison operators (see the official guide for additional detail). For example, to find all elements of vr larger than $98$, write:

vr = np.array([0, 99, 5, 70, 24, 1, 200]) # Create array of random values
vr > 98

This boolean array can be used to select elements from a Numpy array:

comparison = vr > 98
vr[comparison]

Boolean arrays can be combined by using the logical operators & and | :

vr[(vr < 2) | (vr > 98)]

Boolean indexing can also be used for assignment:

vr[vr > 50] = 0
vr

Basics plotting with matplotlib

Matplotlib contains an API for creating and manipulating plots using functions.

plot and scatter will be the most frequently used functions in this course:

  • plot is typically used for creating connected line segments described by x and y data.
  • scatter is used for plotting individual points, e.g. from a dataset.

Line plot

Take a look at the following sample plot code and output:

x_range = np.linspace(0, 5, 50) # Creates an array of linearly spaced elements
y_linear = x_range + 3 # adding to constant to the numpy array (broadcasting)
y_quadratic = x_range**2 # elementwise exponetiation
y_exp = np.exp(x_range) # exponential function applied elemtwise to x_range

plt.plot(x_range, y_linear)
plt.plot(x_range, y_quadratic)
plt.plot(x_range,y_exp);

Scatter plot

Scatter plots are two-dimensional plots of individual points. The example below creates a quadratic function, adds normally distributed random noise to it, and plots both the original (with plt.plot ) and the noisy points (with plt.scatter ).

x_range = np.linspace(-10, 10, 50) # Create the x-values for the plot
y_values = x_range**2 # Calculate the y-values for the quadratic

noise = np.random.normal(scale=5, size=50) # Create random noise
y_noise = y_values + noise # Add the noise to the y-values

plt.plot(x_range, y_values) # Plot the quadratic function
plt.scatter(x_range, y_noise); # Plot the noisy points

Styling

Matplotlib allows customization of plots. Some useful functionality is described below:

  • plt.plot takes a third argument, format , which is used to adapt the styling of lines. Generally, a letter designating a color (e.g. r ,g ,b ) and a symbol designating line or point style (e.g. + , -- ) are combined to produce a format, e.g. r+ to create red crosses.
  • plt.scatter takes an argument c for the color (can be letter form or complete color names) and an argument marker for the marker style (e.g. + , o ).

Here is a basic example:

plt.plot(x_range, y_values, 'r--')
plt.scatter(x_range, y_noise, c='green', marker='d');

Advanced styling

Matplotlib automatically assigns colors to lines and point series using an internally defined style , however, you can change colors manually. The current style can be changed permanently using plt.style.use(style) or inside a with block using plt.style.context(style) . A reference of built-in style-sheets can be found here . The cell below shows an example:

# We create some normal and uniformly distributed noise. (random data i.e. not structured)
xs, ys = np.random.normal(size=(2, 100))
xu, yu = np.random.uniform(size=(2,100))

with plt.style.context('seaborn'):
    plt.scatter(xs, ys, marker='+')
    plt.scatter(xu, yu, marker='x')

Labels, Title and Legend

Legend, title, and axis labels can be added to plots using the following functions:

with plt.style.context('seaborn'):
    plt.scatter(xs, ys, marker='+')
    plt.scatter(xu, yu, marker='x')
    plt.legend(['normal', 'uniform'])
    
    plt.title('Comparison of distributions')
    plt.ylabel('Y')
    plt.xlabel('X')

Making subplots

Matplotlib makes it possible to combine multiple plots into one figure. The function plt.subplots creates a figure with multiple sub-plots. The function returns a figure object and an array of axes objects. The axes objects are used to make plots in each subplot, add titles, and so forth. An example is shown in the cell below:

figure, axes = plt.subplots(2, 2, figsize=(7, 5))

axes[0, 0].plot(x_range, y_linear)
axes[0, 1].plot(x_range, y_quadratic)
axes[1, 0].scatter(xs, ys)
axes[1, 1].plot(x_range, y_values)
axes[1, 1].scatter(x_range, y_noise);

Saving plots

To save a plot, use plt.savefig(output_path) to save the last plot created. An example is provided below:

plt.savefig('./outputs.pdf');