The basic usage is as follows:
def f(x):
return x
x_lower = 0
x_upper = 1
val, abserr = quad(f, x_lower, x_upper)
For simple functions, we can use a lambda function:
val, abserr = quad(lambda x: exp(-x ** 2), -Inf, Inf)
Higher-dimensional integration works in the same way:
def integrand(x, y):
return exp(-x**2 - y**2)
x_lower = 0
x_upper = 10
y_lower = lambda x: x
y_upper = lambda x: x + 1
val, abserr = dblquad(integrand, x_lower, x_upper, y_lower, y_upper)
A system of ODEs is usually formulated in standard form before it is attacked numerically. The standard form is
where
To solve an ODE, we need to know the function
Note that higher-order ODEs can always be written in this form by introducing new variables for the intermediate derivatives.
SciPy provides two different ways to solve ODEs: an API based on the function odeint
and an object-oriented API based on the class ode
. Usually, odeint
is easier to get started with, but the ode
class offers some finer level of control. Here we will use the odeint
functions. For more information about the class ode
, try help(ode)
.
To use odeint
, first import it from the scipy.integrate
module
from scipy.integrate import odeint, ode
Once we have defined the Python function f
and array y_0
, we can use odeint
as:
y_t = odeint(f, y_0, t)
where t
is an array with time-coordinates for which to solve the ODE problem. y_t
is an array with one row for each point in time in t
, where each column corresponds to a solution
SciPy's fft
module enables easy computation of Fourier transforms, a critical tool in computational physics, signal processing and data analysis.
from scipy.fft import fft, ifft, fftfreq
import numpy as np
N = 600 # Number of sample points.
T = 1.0 / 800.0 # Sample spacing.
x = np.linspace(0.0, N*T, N, endpoint=False)
y = np.sin(50.0 * 2.0*np.pi*x) + 0.5*np.sin(80.0 * 2.0*np.pi*x)
yf = fft(y)
xf = fftfreq(N, T)[:N//2]
The linear algebra module contains various matrix-related functions, including linear equation solving, eigenvalue solvers, matrix functions (e.g., matrix exponentiation), decompositions (SVD, LU, Cholesky), etc.
from scipy.linalg import *
A = array([[1,2,3], [4,5,6], [7,8,9]])
b = array([1,2,3])
x = solve(A, b)
The eigenvalue problem for a matrix
To calculate eigenvalues of a matrix, use the eigvals
, and for calculating both eigenvalues and eigenvectors, use the function eig
:
lam = eigvals(A)
lam, v = eig(A)
The eigenvectors corresponding to the lam[n]
) is the v
, i.e., v[:,n]
.
There are also more specialized eigensolvers, like the eigh
for Hermitian matrices.
inv(A) # Matrix inverse.
det(A) # Matrix determinant.
# Matrix norms of various orders.
norm(A, ord=1)
norm(A, ord=2)
norm(A, ord=Inf)
Sparse matrices are often useful in numerical simulations dealing with large systems, where the problem can be described in matrix form, and matrices or vectors mostly contain zeros. SciPy has good support for sparse matrices, with basic linear algebra operations (e.g., equation solving, eigenvalue calculations, etc.).
There are many possible strategies for storing sparse matrices efficiently, such as coordinate form (COO), list of lists (LIL) form, and compressed-sparse column CSC (and row, CSR). Each format has some advantages and disadvantages. Most computational algorithms (equation solving, matrix-matrix multiplication, etc.) can be efficiently implemented using CSR or CSC formats, but they are not so intuitive and not so easy to initialize. So often, a sparse matrix is initially created in COO or LIL format (where we can efficiently add elements to the sparse matrix data), and then converted to CSC or CSR before used in real calculations.
When we create a sparse matrix, we have to choose which format it should be stored in. For example,
from scipy.sparse import *
# Dense matrix.
M = array([[1,0,0,0], [0,3,0,0], [0,1,1,0], [1,0,0,1]])
# Convert from dense to sparse.
A = csr_matrix(M)
# Convert from sparse to dense.
A.todense()
More efficient way to create sparse matrices: create an empty matrix and populate it using matrix indexing (avoids creating a potentially large dense matrix).
A = lil_matrix((4,4)) # Empty 4x4 sparse matrix.
A[0,0] = 1
A[1,1] = 3
A[2,2] = A[2,1] = 1
A[3,3] = A[3,0] = 1
# Sparse matrix - dense array multiplication.
A * v
LIL stands for LIsts of Lists.
Optimization (finding minima or maxima of a function) is a large field in mathematics, and optimization of complicated functions or in many variables can be rather involved.
We can use several algorithms to find the minima of a function:
from scipy import optimize
def f(x):
return x**4 + 4*x**3 + (x-2)**2
x_min1 = optimize.fmin_bfgs(f, -2)
x_min2 = optimize.brent(f)
x_min3 = optimize.fminbound(f, -4, 2)
To find the root for a function of the form fsolve
function. It requires an initial guess:
from scipy import optimize
def f(omega):
return tan(2 * np.pi * omega) - 3.0 / omega
optimize.fsolve(f, 0.1)
The interp1d
function, when given arrays describing X and Y data, returns an object that behaves like a function that can be called for an arbitrary value of x (in the range covered by X), and it returns the corresponding interpolated y value:
from scipy.interpolate import *
def f(x):
return sin(x)
n = arange(0, 10)
x_meas = linspace(0, 9, 100)
y_meas = f(n) + 0.1 * randn(len(n))
linear_interpolation = interp1d(x_meas, y_meas)
y_interp1 = linear_interpolation(x)
cubic_interpolation = interp1d(x_meas, y_meas, kind='cubic')
y_interp2 = cubic_interpolation(x)
The scipy.stats
module contains a large number of statistical distributions, statistical functions, and tests.
from scipy import stats
X = stats.poisson(3.5)
Y = stats.norm()
X.mean(), X.std(), X.var()
Y.mean(), Y.std(), Y.var()
SciPy includes functions for conducting statistical tests, like t-tests and ANOVA, aiding in hypothesis testing and data analysis.
Calculate the T-test for the means of two independent samples:
t_statistic, p_value = stats.ttest_ind(X.rvs(size=1000), X.rvs(size=1000))
Test if the mean of a single sample of data is 0.1:
stats.ttest_1samp(Y.rvs(size=1000), 0.1)
A low p-value means that we can reject the hypothesis.
Data visualization plays a vital role in data analysis, enabling the effective communication of complex information. Matplotlib and Seaborn are two widely-used Python libraries for data visualization. Matplotlib offers a wide range of plotting tools, while Seaborn provides a high-level interface for drawing attractive statistical graphics.
Matplotlib is a widely-used 2D plotting library in Python. It provides a high-level interface for drawing attractive and informative statistical graphics. Let's start with a simple example to create a basic line plot:
import matplotlib.pyplot as plt
import numpy as np
# Generate data.
x = np.linspace(0, 10, 100)
y = np.sin(x)
# Create a simple line plot.
plt.plot(x, y, label='sin(x)')
plt.title('Simple line plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.legend()
plt.show()
In this example, we use NumPy to generate data points for the x-axis and calculate corresponding y-values. The plot
function is then used to create a line plot. Finally, title
, xlabel
, ylabel
, and legend
functions are used to add a title, axis labels, and a legend to the plot.
Matplotlib supports various types of plots, including scatter plots, bar plots, histograms, and more. Explore the documentation for more plot types and customization options.
Let's create a 2D contour plot using Matplotlib.
import matplotlib.pyplot as plt
import numpy as np
# Generate data.
x = np.linspace(-5, 5, 100)
y = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x, y)
Z = np.sin(np.sqrt(X**2 + Y**2))
# Create a 2D contour plot.
plt.contourf(X, Y, Z, cmap='viridis')
plt.colorbar(label='sin(sqrt(x^2 + y^2))')
plt.title('Contour plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()
Seaborn, built on Matplotlib, provides a more intuitive interface for creating statistical plots. It integrates well with pandas data structures and offers built-in themes for enhanced visual appeal.
import seaborn as sns
import numpy as np
# Generate data.
x = np.random.randn(100)
y = 2 * x + np.random.randn(100)
# Create a scatter plot using Seaborn.
sns.scatterplot(x=x, y=y, color='blue')
plt.title('Scatter Plot with Seaborn')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()
Let's create a customized histogram with Seaborn, including specific bin edges, colors, and additional statistical annotations.
import seaborn as sns
import matplotlib.pyplot as plt
# Load a dataset.
tips = sns.load_dataset('tips')
# Create a histogram with customizations.
sns.histplot(tips['total_bill'], bins=[10, 20, 30, 40, 50], color='salmon')
plt.title('Customized histogram')
plt.xlabel('Total bill ($)')
plt.ylabel('Frequency')
# Annotate with mean and median.
plt.axvline(tips['total_bill'].mean(), color='blue', linestyle='dashed', linewidth=2, label='Mean')
plt.axvline(tips['total_bill'].median(), color='green', linestyle='dashed', linewidth=2, label='Median')
plt.legend()
plt.show()
Create a scatter plot with Seaborn that includes a regression line, different colors based on a categorical variable, and markers with varied sizes.
import seaborn as sns
import matplotlib.pyplot as plt
# Load a dataset.
tips = sns.load_dataset('tips')
# Create a scatter plot.
sns.scatterplot(x='total_bill', y='tip', hue='day', size='size', sizes=(20, 200),
data=tips, palette='Set2', alpha=0.8)
plt.title('Advanced Scatter Plot')
plt.xlabel('Total bill ($)')
plt.ylabel('Tip ($)')
plt.legend(title='Day')
plt.show()
One of the strengths of Seaborn is its ability to work seamlessly with Matplotlib. You can use Matplotlib functions alongside Seaborn to customize your plots further. Here's an example combining Matplotlib and Seaborn to create a histogram with a kernel density estimate:
import seaborn as sns
import matplotlib.pyplot as plt
# Load a dataset.
tips = sns.load_dataset('tips')
# Create a histogram with a Kernel Density Estimate using Seaborn.
sns.histplot(tips['total_bill'], kde=True, color='skyblue')
# Customize Matplotlib features.
plt.title('Histogram with KDE')
plt.xlabel('Total bill ($)')
plt.ylabel('Frequency')
plt.show()
Matplotlib supports advanced plots like contour plots, 3D plots, and subplots:
Seaborn excels in creating complex statistical plots:
Matplotlib and Seaborn can directly plot from pandas DataFrame, simplifying the workflow in data analysis tasks.
pandas is a powerful Python library for data manipulation and analysis. It provides fast, flexible data structures like Series and DataFrame, designed to work with structured data intuitively and efficiently.
In the pandas library, the standard import convention involves using the aliases np
for NumPy and pd
for pandas:
import numpy as np
import pandas as pd
You can create a Series by providing a list of values. pandas will generate a default RangeIndex:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
Creating a DataFrame involves passing a NumPy array with a datetime index and labeled columns:
dates = pd.date_range("20130101", periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
Alternatively, a DataFrame can be formed from a dictionary of objects:
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
The resulting DataFrame has diverse data types.
To examine the top and bottom rows of a DataFrame, use head()
and tail()
:
df.head()
df.tail(3)
Retrieve the DataFrame's index or column labels:
df.index
df.columns
df.to_numpy() # Convert the DataFrame to a NumPy array.
Note: NumPy arrays have one dtype for the entire array while pandas DataFrames have one dtype per column. When you call DataFrame.to_numpy()
, pandas will find the NumPy dtype that can hold all of the dtypes in the DataFrame. If the common data type is object
, DataFrame.to_numpy
will require copying data.
pandas offers various methods for data selection. We'll explore both label-based and position-based approaches.
For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A
:
df["A"]
Passing a slice :
selects matching rows:
df[0:3]
df["20130102":"20130104"]
Use loc
and at
for label-based indexing:
df.loc[dates[0]] # Selecting a row by label.
df.loc[:, ["A", "B"]] # Selecting all rows for specific columns.
df.loc["20130102":"20130104", ["A", "B"]] # Both endpoints are included.
df.at[dates[0], "A"] # Fast scalar access.
For position-based indexing, employ iloc
and iat
:
df.iloc[3] # Selecting via position.
df.iloc[3:5, 0:2] # Slicing rows and columns.
df.iat[1, 1] # Fast scalar access.
Selecting values from a DataFrame where a boolean condition is met:
df[df > 0]
Select rows based on a condition:
df[df["A"] > 0]
Use Series.isin
method for filtering:
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2[df2["E"].isin(["two", "four"])]
Generate quick statistics using describe()
:
df.describe()
Transpose the DataFrame with .T
:
df.T
Sort the DataFrame by index or values:
df.sort_index(axis=1, ascending=False) # Sort by index.
df.sort_values(by="B") # Sort by values.
Compute the mean for each column or row:
df.mean()
df.mean(axis=1)
Perform operations with another Series or DataFrame:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
df.sub(s, axis="index")
Apply user-defined functions using agg
and transform
:
# Calculate the mean of each column and then multiply it by 5.6.
# The result will be a pandas Series with the aggregated value for each column in df.
df.agg(lambda x: np.mean(x) * 5.6)
# Apply a function to each element of the DataFrame.
df.transform(lambda x: x * 101.2)
Compute value counts for a Series:
s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts()
pandas simplifies handling missing data using methods like dropna()
, fillna()
, and interpolate()
.
Transform data using operations like pivoting, melting, and applying custom functions.
pandas excels in time series data analysis, offering functionalities for resampling, shifting, and window operations.
pandas supports categorical data types, which can be more efficient and expressive for certain types of data.
pandas integrates with Matplotlib for easy data visualization. Here's a basic example:
import matplotlib.pyplot as plt
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
ts.plot()
plt.show()
pandas can read and write to various file formats, including CSV, Excel, SQL, and more.
df = pd.DataFrame(np.random.randint(0, 5, (10, 5)))
df.to_csv("foo.csv")
pd.read_csv("foo.csv")
df.to_excel("foo.xlsx", sheet_name="Sheet1")
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])