The basic usage is as follows:
def f(x):
return x
x_lower = 0
x_upper = 1
val, abserr = quad(f, x_lower, x_upper)
For simple functions, we can use a lambda function:
val, abserr = quad(lambda x: exp(-x ** 2), -Inf, Inf)
Higher-dimensional integration works in the same way:
def integrand(x, y):
return exp(-x**2 - y**2)
x_lower = 0
x_upper = 10
y_lower = lambda x: x
y_upper = lambda x: x + 1
val, abserr = dblquad(integrand, x_lower, x_upper, y_lower, y_upper)
A system of ODEs is usually formulated in standard form before it is attacked numerically. The standard form is
where
To solve an ODE, we need to know the function
Note that higher-order ODEs can always be written in this form by introducing new variables for the intermediate derivatives.
SciPy provides two different ways to solve ODEs: an API based on the function odeint
and an object-oriented API based on the class ode
. Usually, odeint
is easier to get started with, but the ode
class offers some finer level of control. Here we will use the odeint
functions. For more information about the class ode
, try help(ode)
.
To use odeint
, first import it from the scipy.integrate
module
from scipy.integrate import odeint, ode
Once we have defined the Python function f
and array y_0
, we can use odeint
as:
y_t = odeint(f, y_0, t)
where t
is an array with time-coordinates for which to solve the ODE problem. y_t
is an array with one row for each point in time in t
, where each column corresponds to a solution
SciPy's fft
module enables easy computation of Fourier transforms, a critical tool in computational physics, signal processing and data analysis.
from scipy.fft import fft, ifft, fftfreq
import numpy as np
N = 600 # Number of sample points.
T = 1.0 / 800.0 # Sample spacing.
x = np.linspace(0.0, N*T, N, endpoint=False)
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = fft(y)
xf = fftfreq(N, T)[:N//2]
The linear algebra module contains various matrix-related functions, including linear equation solving, eigenvalue solvers, matrix functions (e.g., matrix exponentiation), decompositions (SVD, LU, Cholesky), etc.
from scipy.linalg import *
A = array([[1,2,3], [4,5,6], [7,8,9]])
b = array([1,2,3])
x = solve(A, b)
The eigenvalue problem for a matrix
To calculate eigenvalues of a matrix, use the eigvals
, and for calculating both eigenvalues and eigenvectors, use the function eig
:
lam = eigvals(A)
lam, v = eig(A)
The eigenvectors corresponding to the lam[n]
) is the v
, i.e., v[:,n]
.
There are also more specialized eigensolvers, like the eigh
for Hermitian matrices.
inv(A) # Matrix inverse.
det(A) # Matrix determinant.
# Matrix norms of various orders.
norm(A, ord=1)
norm(A, ord=2)
norm(A, ord=Inf)
Sparse matrices are often useful in numerical simulations dealing with large systems, where the problem can be described in matrix form, and matrices or vectors mostly contain zeros. SciPy has good support for sparse matrices, with basic linear algebra operations (e.g., equation solving, eigenvalue calculations, etc.).
There are many possible strategies for storing sparse matrices efficiently, such as coordinate form (COO), list of lists (LIL) form, and compressed-sparse column CSC (and row, CSR). Each format has some advantages and disadvantages. Most computational algorithms (equation solving, matrix-matrix multiplication, etc.) can be efficiently implemented using CSR or CSC formats, but they are not so intuitive and not so easy to initialize. So often, a sparse matrix is initially created in COO or LIL format (where we can efficiently add elements to the sparse matrix data), and then converted to CSC or CSR before used in real calculations.
When we create a sparse matrix, we have to choose which format it should be stored in. For example,
from scipy.sparse import *
# Dense matrix.
M = array([[1,0,0,0], [0,3,0,0], [0,1,1,0], [1,0,0,1]])
# Convert from dense to sparse.
A = csr_matrix(M)
# Convert from sparse to dense.
A.todense()
More efficient way to create sparse matrices: create an empty matrix and populate it using matrix indexing (avoids creating a potentially large dense matrix).
A = lil_matrix((4,4)) # Empty 4x4 sparse matrix.
A[0,0] = 1
A[1,1] = 3
A[2,2] = A[2,1] = 1
A[3,3] = A[3,0] = 1
# Sparse matrix - dense array multiplication.
A * v
LIL stands for LIsts of Lists.
Optimization (finding minima or maxima of a function) is a large field in mathematics, and optimization of complicated functions or in many variables can be rather involved.
We can use several algorithms to find the minima of a function:
from scipy import optimize
def f(x):
return x**4 + 4*x**3 + (x-2)**2
x_min1 = optimize.fmin_bfgs(f, -2)
x_min2 = optimize.brent(f)
x_min3 = optimize.fminbound(f, -4, 2)
To find the root for a function of the form fsolve
function. It requires an initial guess:
from scipy import optimize
def f(omega):
return tan(2 * np.pi * omega) - 3.0 / omega
optimize.fsolve(f, 0.1)
The interp1d
function, when given arrays describing X and Y data, returns an object that behaves like a function that can be called for an arbitrary value of x (in the range covered by X), and it returns the corresponding interpolated y value:
from scipy.interpolate import *
def f(x):
return sin(x)
n = arange(0, 10)
x_meas = linspace(0, 9, 100)
y_meas = f(n) + 0.1 * randn(len(n))
linear_interpolation = interp1d(x_meas, y_meas)
y_interp1 = linear_interpolation(x)
cubic_interpolation = interp1d(x_meas, y_meas, kind='cubic')
y_interp2 = cubic_interpolation(x)
The scipy.stats
module contains a large number of statistical distributions, statistical functions, and tests.
from scipy import stats
X = stats.poisson(3.5)
Y = stats.norm()
X.mean(), X.std(), X.var()
Y.mean(), Y.std(), Y.var()
SciPy includes functions for conducting statistical tests, like t-tests and ANOVA, aiding in hypothesis testing and data analysis.
Calculate the T-test for the means of two independent samples:
t_statistic, p_value = stats.ttest_ind(X.rvs(size=1000), X.rvs(size=1000))
Test if the mean of a single sample of data is 0.1:
stats.ttest_1samp(Y.rvs(size=1000), 0.1)
A low p-value means that we can reject the hypothesis.
Data visualization plays a vital role in data analysis, enabling the effective communication of complex information. Matplotlib and Seaborn are two widely-used Python libraries for data visualization. Matplotlib offers a wide range of plotting tools, while Seaborn provides a high-level interface for drawing attractive statistical graphics.
Matplotlib is a widely-used 2D plotting library in Python. It provides a high-level interface for drawing attractive and informative statistical graphics. Let's start with a simple example to create a basic line plot:
import matplotlib.pyplot as plt
import numpy as np
# Generate data.
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create a simple line plot.
plt.plot(x, y, label='sin(x)')
plt.title('Simple line plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.legend()
plt.show()
In this example, we use NumPy to generate data points for the x-axis and calculate corresponding y-values. The plot
function is then used to create a line plot. Finally, title
, xlabel
, ylabel
, and legend
functions are used to add a title, axis labels, and a legend to the plot.
Matplotlib supports various types of plots, including scatter plots, bar plots, histograms, and more. Explore the documentation for more plot types and customization options.
Let's create a 2D contour plot using Matplotlib.
import matplotlib.pyplot as plt
import numpy as np
# Generate data.
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))
# Create a 2D contour plot.
plt.contourf(X, Y, Z, cmap='viridis')
plt.colorbar(label='sin(sqrt(x^2 + y^2))')
plt.title('Contour plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()
Seaborn, built on Matplotlib, provides a more intuitive interface for creating statistical plots. It integrates well with pandas data structures and offers built-in themes for enhanced visual appeal.
import seaborn as sns
import numpy as np
# Generate data.
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)
# Create a scatter plot using Seaborn.
sns.scatterplot(x=x, y=y, color='blue')
plt.title('Scatter Plot with Seaborn')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()
Let's create a customized histogram with Seaborn, including specific bin edges, colors, and additional statistical annotations.
import seaborn as sns
import matplotlib.pyplot as plt
# Load a dataset.
tips = sns.load_dataset('tips')
# Create a histogram with customizations.
sns.histplot(tips['total_bill'], bins=[10, 20, 30, 40, 50], color='salmon')
plt.title('Customized histogram')
plt.xlabel('Total bill ($)')
plt.ylabel('Frequency')
# Annotate with mean and median.
plt.axvline(tips['total_bill'].mean(), color='blue', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(tips['total_bill'].median(), color='green', linestyle='dashed', linewidth=2, label='Median')
plt.legend()
plt.show()
Create a scatter plot with Seaborn that includes a regression line, different colors based on a categorical variable, and markers with varied sizes.
import seaborn as sns
import matplotlib.pyplot as plt
# Load a dataset.
tips = sns.load_dataset('tips')
# Create a scatter plot.
sns.scatterplot(x='total_bill', y='tip', hue='day', size='size', sizes=(20, 200),
data=tips, palette='Set2', alpha=0.8)
plt.title('Advanced Scatter Plot')
plt.xlabel('Total bill ($)')
plt.ylabel('Tip ($)')
plt.legend(title='Day')
plt.show()
One of the strengths of Seaborn is its ability to work seamlessly with Matplotlib. You can use Matplotlib functions alongside Seaborn to customize your plots further. Here's an example combining Matplotlib and Seaborn to create a histogram with a kernel density estimate:
import seaborn as sns
import matplotlib.pyplot as plt
# Load a dataset.
tips = sns.load_dataset('tips')
# Create a histogram with a Kernel Density Estimate using Seaborn.
sns.histplot(tips['total_bill'], kde=True, color='skyblue')
# Customize Matplotlib features.
plt.title('Histogram with KDE')
plt.xlabel('Total bill ($)')
plt.ylabel('Frequency')
plt.show()
Matplotlib supports advanced plots like contour plots, 3D plots, and subplots:
Seaborn excels in creating complex statistical plots:
Matplotlib and Seaborn can directly plot from pandas DataFrame, simplifying the workflow in data analysis tasks.
pandas is a powerful Python library for data manipulation and analysis. It provides fast, flexible data structures like Series and DataFrame, designed to work with structured data intuitively and efficiently.
In the pandas library, the standard import convention involves using the aliases np
for NumPy and pd
for pandas:
import numpy as np
import pandas as pd
You can create a Series by providing a list of values. pandas will generate a default RangeIndex:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
Creating a DataFrame involves passing a NumPy array with a datetime index and labeled columns:
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
Alternatively, a DataFrame can be formed from a dictionary of objects:
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
The resulting DataFrame has diverse data types.
To examine the top and bottom rows of a DataFrame, use head()
and tail()
:
df.head()
df.tail(3)
Retrieve the DataFrame's index or column labels:
df.index
df.columns
df.to_numpy() # Convert the DataFrame to a NumPy array.
Note: NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy()
, pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. If the common data type is object
, DataFrame.to_numpy
will require copying data.
pandas offers various methods for data selection. We'll explore both label-based and position-based approaches.
For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A
:
df["A"]
Passing a slice :
selects matching rows:
df[0:3]
df["20130102":"20130104"]
Use loc
and at
for label-based indexing:
df.loc[dates[0]] # Selecting a row by label.
df.loc[:, ["A", "B"]] # Selecting all rows for specific columns.
df.loc["20130102":"20130104", ["A", "B"]] # Both endpoints are included.
df.at[dates[0], "A"] # Fast scalar access.
For position-based indexing, employ iloc
and iat
:
df.iloc[3] # Selecting via position.
df.iloc[3:5, 0:2] # Slicing rows and columns.
df.iat[1, 1] # Fast scalar access.
Selecting values from a DataFrame where a boolean condition is met:
df[df > 0]
Select rows based on a condition:
df[df["A"] > 0]
Use Series.isin
method for filtering:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2[df2["E"].isin(["two", "four"])]
Generate quick statistics using describe()
:
df.describe()
Transpose the DataFrame with .T
:
df.T
Sort the DataFrame by index or values:
df.sort_index(axis=1, ascending=False) # Sort by index.
df.sort_values(by="B") # Sort by values.
Compute the mean for each column or row:
df.mean()
df.mean(axis=1)
Perform operations with another Series or DataFrame:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
df.sub(s, axis="index")
Apply user-defined functions using agg
and transform
:
# Calculate the mean of each column and then multiply it by 5.6.
# The result will be a pandas Series with the aggregated value for each column in df.
df.agg(lambda x: np.mean(x) * 5.6)
# Apply a function to each element of the DataFrame.
df.transform(lambda x: x * 101.2)
Compute value counts for a Series:
s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()
pandas simplifies handling missing data using methods like dropna()
, fillna()
, and interpolate()
.
Transform data using operations like pivoting, melting, and applying custom functions.
pandas excels in time series data analysis, offering functionalities for resampling, shifting, and window operations.
pandas supports categorical data types, which can be more efficient and expressive for certain types of data.
pandas integrates with Matplotlib for easy data visualization. Here's a basic example:
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
ts.plot()
plt.show()
pandas can read and write to various file formats, including CSV, Excel, SQL, and more.
df = pd.DataFrame(np.random.randint(0, 5, (10, 5)))
df.to_csv("foo.csv")
pd.read_csv("foo.csv")
df.to_excel("foo.xlsx", sheet_name="Sheet1")
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
Quoting the official PyTorch tutorial introduction:
PyTorch is a Python-based scientific computing package serving two broad purposes:
The package is imported in Python with import torch
.
Tensor
(1/2)Tensors are the PyTorch equivalent to Numpy arrays, with the addition to also have support for GPU acceleration.
In multilinear algebra, tensor is a generalization of vector and matrix concepts:
torch.Tensor
:x = torch.Tensor(2,3,4)
# tensor([[[ 6.5254e+10, 3.0890e-41, 4.2039e-45, -6.3663e-15],
# [ 6.5246e+10, 3.0890e-41, 1.6367e-42, 4.5787e-41],
# [ 6.5254e+10, 3.0890e-41, 4.2039e-45, 4.5787e-41]],
# [[ 6.5255e+10, 3.0890e-41, 1.4013e-45, 0.0000e+00],
# [ 6.5255e+10, 3.0890e-41, 6.5246e+10, 3.0890e-41],
# [ 1.7404e-42, -5.0820e+26, 6.5255e+10, 3.0890e-41]]])
Tensor
(2/2)The function torch.Tensor
allocates memory for the desired tensor, but reuses any values that have already been in the memory. To directly assign values to the tensor during initialization, there are many alternatives including:
torch.zeros
,torch.ones
,torch.rand
,torch.randn
,torch.arange
..torch.Tensor
: (input list): Creates a tensor from the list elements you provide# Create a tensor from a (nested) list
x = torch.Tensor([[1, 2], [3, 4]])
print(x)
# tensor([[1., 2.],
# [3., 4.]])
Tensors can be converted to numpy arrays (.numpy()
), and numpy arrays back to tensors (torch.from_numpy()
).
Most operations that exist in numpy, also exist in PyTorch.
Another common operation aims at changing the shape of a tensor. In PyTorch, this operation is called view
:
x = torch.arange(6)
print("X", x)
#X tensor([0, 1, 2, 3, 4, 5])
x = x.view(2, 3)
print("X", x)
#X tensor([[0, 1, 2],
# [3, 4, 5]])
Other commonly used operations include matrix multiplications, which are essential for neural networks:
torch.matmul
: Performs the matrix product over two tensors, where the specific behavior depends on the dimensionstorch.mm
: Performs the matrix product over two tensors, where the specific behavior depends on the dimensionstorch.bmm
: Performs the matrix product with a support batch dimension: If the first tensor torch.einsum
: Performs matrix multiplications and more (i.e. sums of products) using the Einstein summation convention.One of the main reasons for using PyTorch in Deep Learning projects is that we can automatically get gradients/derivatives of functions that we define.
We will mainly use PyTorch for implementing neural networks, and they are just fancy functions. If we use weight matrices in our function that we want to learn, then those are called the parameters or simply the weights.
Given an input x
, we define our function by manipulating that input, usually by matrix-multiplications with weight matrices and additions with so-called bias vectors. As we manipulate our input, we are automatically creating a computational graph.
This graph shows how to arrive at our output from our input. PyTorch is a define-by-run framework; this means that we can just do our manipulations, and PyTorch will keep track of that graph for us. Thus, we create a dynamic computation graph along the way.
Example: let's compute the computational graph of the function
x = torch.arange(3, dtype=torch.float32, requires_grad=True) # Only float tensors can have gradients
# X tensor([0., 1., 2.], requires_grad=True)
a = x + 2
b = a ** 2
c = b + 3
y = c.mean()
print("Y", y)
# Y tensor(12.6667, grad_fn=<MeanBackward0>)
We can perform backpropagation on the computation graph by calling the function backward()
on the last output, which effectively calculates the gradients for each tensor that has the property requires_grad=True
.
Computing y.backward()
, now x.grad
will contain the gradient
y.backward()
print(x.grad)
# tensor([1.3333, 2.0000, 2.6667])
PyTorch calculates the gradients using the chain rule:
Ex: try to compute the gradient by hand!
A crucial feature of PyTorch is the support of GPUs, short for Graphics Processing Unit. A GPU can perform many thousands of small operations in parallel, making it very well suitable for performing large matrix operations in neural networks.
To check if we have a GPU available:
gpu_avail = torch.cuda.is_available()
print(f"Is the GPU available? {gpu_avail}")
# Is the GPU available? True
You can write your code with respect to this device object, and it allows you to run the same code on both a CPU-only system, and one with a GPU. Let’s try it below. We can specify the device as follows:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Device", device)
# Device cuda
torch.nn
makes building neural networks more convenientclass MyModule(nn.Module):
def __init__(self):
super().__init__()
# Some init for my module
def forward(self, x):
# Function for performing the calculation of the module.
pass
DataLoader
classThe class torch.utils.data.DataLoader
represents a Python iterable over a dataset with support for automatic batching, multi-process data loading and many more features. The data loader communicates with the dataset using the function __getitem__
, and stacks its outputs as tensors over the first dimension to form a batch. In contrast to the dataset class, we usually don’t have to define our own data loader class, but can just create an object of it with the dataset as input. Additionally, we can configure our data loader with the following input arguments (only a selection, see full list here):
batch_size
: Number of samples to stack per batchshuffle
: If True, the data is returned in a random order. This is important during training for introducing stochasticityAfter defining the model and the dataset, it is time to prepare the optimization of the model. During training, we will perform the following steps:
We can calculate the loss for a batch by simply performing a few tensor operations as those are automatically added to the computation graph. For instance, for regression we can use Mean Squared Error (MSE) which is defined as
where
For updating the parameters, PyTorch provides the package torch.optim
.
The simplest of them is the Stochastic Gradient Descent (SGD): `torch.optim.SGD. Stochastic Gradient Descent updates parameters by multiplying the gradients with a small constant, called learning rate, and subtracting those from the parameters (hence minimizing the loss).
# Input to the optimizer are the parameters of the model: model.parameters()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
Now we are ready to train the model (see notebook).
UvA DL Notebooks: very nice tutorials from the DL group in Amsterdam