Introduction
In this blog post, we will explore key mathematical concepts that are fundamental for data science and quantitative finance including:
- Vector Spaces
- Gram-Schmidt Process
- Fundamental Subspaces of a Matrix
- Applications of the Null Space
- Orthogonal Projections
- The Least-Squares Problem
By deepening your understanding of these concepts, you can gain a better intuition for the capabilities and limitations of various tools and techniques, troubleshoot and optimize models more effectively, enhance your communication skills, stay competitive in the job market, and foster a continuous learning mindset. Additionally, a strong mathematical foundation will empower you to read and understand research papers, keeping you updated with the latest advancements in your field.
I created this blog post while studying the MITx MicroMasters Program in Finance. I return to this blog post from time to time to keep these topics fresh in my mind. This post is not meant to be a complete "course" on the topic. It is more in the form of my notes that I (and you) can review to stay sharp on these topics, or as a guide for further study.
If you are interested in taking a course on these topics, I recommend:
- Professor Jeff Chasnov (HKUST) Matrix Algebra Course.
- Professor Gilbert Strang (MIT) Linear Algebra Course.
Vector Spaces
Vector spaces are mathematical structures that provide a framework for analyzing objects such as vectors and functions in a systematic way. They consist of a set of vectors along with two operations: vector addition and scalar multiplication. The set of vectors and the operations must satisfy certain axioms, which allow us to manipulate and study them more easily.
Zero Vector
The zero vector is a unique element in a vector space, denoted by 0 or 0v, that possesses the property of not changing any vector when added to it. Formally, for any vector v in the vector space:
\[v + 0 = v \] \[0 + v = v \]Moreover, the zero vector is the only vector that, when multiplied by any scalar, yields itself.
Linear Independence/Dependence
A set of vectors is said to be linearly independent if none of the vectors can be written as a linear combination of the others. Conversely, if at least one vector in the set can be expressed as a linear combination of the others, the set is linearly dependent. Linear independence is an important concept in vector spaces, as it ensures that each vector contributes unique information.
Span, Basis and Dimension
The span of a set of vectors is the collection of all possible linear combinations of those vectors. A basis of a vector space is a set of linearly independent vectors that span the entire vector space. Every vector in the space can be uniquely represented as a linear combination of the basis vectors. The dimension of a vector space is the number of vectors in any of its bases. It is a fundamental property that gives us insight into the structure of the space.
Orthonormal Basis
An orthonormal basis is a special type of basis where each vector has unit length (norm equal to 1) and the inner product of any two distinct vectors is zero, meaning they are orthogonal. Orthonormal bases are particularly useful in various applications, such as simplifying calculations and providing a more intuitive understanding of the geometry of the vector space.
Gram-Schmidt Process
Overview
The Gram-Schmidt Process is a powerful and widely used algorithm for converting a set of linearly independent vectors into an orthonormal basis for the same vector space. The process involves two main steps:
- Find an orthogonal basis by iteratively subtracting the [projections of the current vector onto the previous orthogonal vectors].
- Normalize each orthogonal vector to obtain a set of orthonormal vectors.
For a set of linearly independent vectors v1, v2, ..., vn, the algorithm can be represented as:
Step 1: Find an orthogonal basis
\[ \begin{aligned} \mathbf{u_1} &= \mathbf{v_1} \\ \mathbf{u_2} &= \mathbf{v_2} - \frac{\mathbf{v_2} \cdot \mathbf{u_1}}{\|\mathbf{u_1}\|^2} \mathbf{u_1} \\ &\;\;\vdots \\ \mathbf{u_n} &= \mathbf{v_n} - \sum_{i=1}^{n-1} \frac{\mathbf{v_n} \cdot \mathbf{u_i}}{\|\mathbf{u_i}\|^2} \mathbf{u_i} \end{aligned} \]Step 2: Normalize
\[ \mathbf{e_i} = \frac{\mathbf{u_i}}{\|\mathbf{u_i}\|}, \quad i = 1, 2, \dots, n \]Example: Gram-Schmidt Process
Let's say we have a set of three linearly independent 3x1 vectors:
\[ \mathbf{v_1} = \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix},\, \mathbf{v_2} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix},\, \mathbf{v_3} = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} \]Now, we'll apply the Gram-Schmidt Process to obtain an orthonormal basis:
Step 1: Find an orthogonal basis
\[ \begin{aligned} \mathbf{u_1} &= \mathbf{v_1} \\ \mathbf{u_2} &= \mathbf{v_2} - \frac{\mathbf{v_2} \cdot \mathbf{u_1}}{\|\mathbf{u_1}\|^2} \mathbf{u_1} \\ \mathbf{u_3} &= \mathbf{v_3} - \frac{\mathbf{v_3} \cdot \mathbf{u_1}}{\|\mathbf{u_1}\|^2} \mathbf{u_1} - \frac{\mathbf{v_3} \cdot \mathbf{u_2}}{\|\mathbf{u_2}\|^2} \mathbf{u_2} \end{aligned} \]Step 2: Normalize to obtain the orthonormal basis vectors e1, e2, and e3 for the given vector space.
\[ \begin{aligned} \mathbf{e_1} &= \frac{\mathbf{u_1}}{\|\mathbf{u_1}\|} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} \\ \mathbf{e_2} &= \frac{\mathbf{u_2}}{\|\mathbf{u_2}\|} = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} \\ \mathbf{e_3} &= \frac{\mathbf{u_3}}{\|\mathbf{u_3}\|} = \frac{1}{\sqrt{2}} \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} \end{aligned} \]Fundamental Subspaces of a Matrix
Overview
There are four fundamental subspaces associated with a given matrix A:
- Column space (range)
- Row space
- Null space (kernel)
- Left null space
Each of these subspaces provides valuable information about the properties and structure of the matrix.
Column Space
The column space of a matrix A is the subspace of ℝm (where A is an m×n matrix) that is formed by the linear combinations of the columns of A. It is also called the range of A. The column space represents the set of all possible output vectors that can be produced by applying the linear transformation represented by the matrix A to input vectors.
To find the column space, we can use the columns of the matrix that have a pivot in their corresponding row in the Reduced Row Echelon Form (RREF) of the matrix. These columns form a basis for the column space of A.
The dimension of the column space is called the rank of the matrix. The rank tells us the number of linearly independent columns of the matrix, which is equal to the number of linearly independent rows as well.
As a related concept, let's represent a matrix product as a linear combination of the columns of the matrix:
\[ \begin{bmatrix} a & b \\ c & d \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} a x_1 + b x_2 \\ c x_1 + d x_2 \end{bmatrix} = x_1 \begin{bmatrix} a \\ c \end{bmatrix} + x_2 \begin{bmatrix} b \\ d \end{bmatrix} \]As we can see, the result of the matrix multiplication is indeed a linear combination of the columns of the original matrix, with the unknowns \(x_1\) and \(x_2\) acting as the weights for each column vector. This interpretation of matrix multiplication as a linear combination of columns is a powerful concept in linear algebra, as it provides a geometric interpretation of the operations and offers insights into various applications, such as systems of linear equations, transformations, and projections.
Row Space
The row space of a matrix A is the subspace of ℝn (where A is an m×n matrix) that is formed by the linear combinations of the rows of A. The row space contains all possible linear combinations of the rows and is useful for understanding the properties and structure of the matrix.
To find the row space, we can use the non-zero rows of the Reduced Row Echelon Form (RREF) of the matrix. These rows form a basis for the row space of A.
The dimension of the row space is also equal to the rank of the matrix. This means that the row space and the column space have the same dimension, even though they may be subspaces of different vector spaces (ℝm and ℝn).
Null Space
The null space of a matrix A is the set of all vectors x that satisfy the equation Ax = 0. In other words, it is the set of vectors that, when multiplied by the matrix, results in the zero vector. The null space can be used to analyze the linear dependence of the columns of the matrix, as well as to determine the existence and uniqueness of solutions to a system of linear equations.
Left-Null Space
The left-null space, also known as the cokernel or the left kernel, is a subspace of the coefficient matrix's row space in a linear system. It consists of all vectors that, when multiplied with the matrix on the left, result in the zero vector. Mathematically, for a given matrix \(A\), the left-null space is defined as \(N(A^T)\), where \(A^T\) denotes the transpose of matrix \(A\). The left-null space provides valuable information about the properties of a linear system, such as its consistency and the dependency of its rows. By analyzing the left-null space, one can gain insights into the relationships between the equations in the system, and determine if a unique solution exists or if there are infinitely many solutions. Furthermore, the left-null space is orthogonal to the column space of the original matrix \(A\), and it plays a crucial role in the study of the four fundamental subspaces in linear algebra.
Dimensions and Relationships between Col(A), Row(A), Null(A) and Leftnull(A)
In this section, we will explore the dimensions and relationships between the four fundamental subspaces in linear algebra: Null(A), Col(A), Row(A), and Leftnull(A). We will also discuss the orthogonality between these spaces, the connection to pivot columns in the Reduced Row Echelon Form (RREF), and the concept of matrix rank.
Definitions and Dimensions
Consider a matrix \(A\) of dimensions \(m \times n\) and rank r. We define the following subspaces:
- Col(A): The column space of matrix \(A\) is the span of its columns. It is a subspace of \(R^m\).
- Row(A): The row space of matrix \(A\) is the span of its rows. It is a subspace of \(R^n\).
- Null(A): The null space of matrix \(A\) contains all vectors \(x\) such that \(Ax = 0\). It is a subspace of \(R^n\).
- Leftnull(A): The left-null space of matrix \(A\) consists of all vectors \(y\) such that \(y^TA = 0\). It is a subspace of \(R^m\).
The following diagram depicts the four fundamental spaces in matrix algebra:
\[ \begin{matrix} & \boldsymbol{Row(A)} & \\ \text{dimension:} & r & \\ \text{basis:} & \{ \boldsymbol{r}_1, \boldsymbol{r}_2, \ldots, \boldsymbol{r}_r \} \\ \end{matrix} \quad \begin{matrix} & \boldsymbol{Col(A)} & \\ \text{dimension:} & r & \\ \text{basis:} & \{ \boldsymbol{c}_1, \boldsymbol{c}_2, \ldots, \boldsymbol{c}_r \} \\ \end{matrix} \] \[ \begin{matrix} & \boldsymbol{Null(A)} & \\ \text{dimension:} & n-r & \\ \text{basis:} & \{ \boldsymbol{v}_1, \boldsymbol{v}_2, \ldots, \boldsymbol{v}_{n-r} \} \\ \end{matrix} \quad \begin{matrix} & \boldsymbol{Leftnull(A)} & \\ \text{dimension:} & m-r & \\ \text{basis:} & \{ \boldsymbol{w}_1, \boldsymbol{w}_2, \ldots, \boldsymbol{w}_{m-r} \} \\ \end{matrix} \]In this diagram, we have four spaces: row space, column space, null space, and left-null space. The row space is the span of the rows of a matrix, while the column space is the span of the columns of a matrix. The dimension of the row and column space is equal as it is dictated by the rank of the matrix. The null space is the set of all vectors that are mapped to the zero vector by the matrix and has dimension \(n-r\), while the left-null space is the set of all vectors that can multiply the matrix from the left and produce the zero vector and the left-null space has dimension \(m-r\).
Matrix Rank
The dimension of the row space and the column space is equal to the rank of the matrix. The rank of a matrix is the number of linearly independent rows or columns, and it is also equal to the dimension of the row space and the column space. The dimension of the null space is equal to the number of linearly independent solutions to the homogeneous system of equations Ax = 0, where A is the matrix. The dimension of the left-null space is equal to the number of linearly independent solutions to the homogeneous system of equations yA = 0.
Orthogonal Complements
The row space and the column space are orthogonal complements of each other, meaning that every vector in the row space is orthogonal to every vector in the null space. Similarly, every vector in the column space is orthogonal to every vector in the left-null space. The null space and the left-null space are also orthogonal complements of each other.
Overall, this diagram helps to illustrate how the dimensions of the four fundamental spaces in matrix algebra are related to each other and how they relate to the rank of the matrix.
Orthogonality
Vectors in the row space of \(A\) are orthogonal to the vectors in the null space of \(A\). Mathematically, if \(v \in Row(A)\) and \(w \in Null(A)\), then their dot product is zero:
\[ v \cdot w = 0 \]Similarly, vectors in the column space of \(A\) are orthogonal to the vectors in the left-null space of \(A\). If \(x \in Col(A)\) and \(y \in Leftnull(A)\), then:
\[ x \cdot y = 0 \]Relationship to Pivot Columns and Matrix Rank
The dimensions of these subspaces are closely related to the number of pivot columns in the Reduced Row Echelon Form (RREF) of the matrix \(A\). The number of pivot columns is also known as the rank of the matrix (\(rank(A)\)).
The rank of the matrix determines the dimensions of the column space and the row space:
\[ dim(Col(A)) = dim(Row(A)) = rank(A) \]As a consequence of the Rank-Nullity Theorem, the dimensions of the null space and the left-null space are given by:
\[ dim(Null(A)) = n - rank(A) \] \[ dim(Leftnull(A)) = m - rank(A) \]By understanding these relationships between the dimensions of the subspaces, the rank of a matrix, and the orthogonality between the subspaces, we gain a deeper insight into the structure of a linear system and its solutions. These concepts form the foundation for many applications in linear algebra, such as solving systems of linear equations, linear transformations, and least squares problems.
Example: Find Null Space of 3x5 Matrix
Let's consider a 3x5 matrix A: \[ A = \begin{bmatrix} 1 & 20 & 0 & 41 & 0 \\ 0 & 2 & 0 & 4 & 0 \\ 2 & 4 & 0 & 9 & 1 \end{bmatrix} \]
A then has the following Reduced Row Echelon Form (RREF):
\[ \textbf{RREF}(A) = \begin{bmatrix} 1 & 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 2 \\ 0 & 0 & 0 & 1 & -1 \end{bmatrix} \]Then we can setup the equations for \(x_1, x_2, x_3, x_4, x_5\) using the RREF.
The null space of a matrix is fond from solving the following:
\[ \begin{bmatrix} 1 & 0 & 0 & 0 & 1 \\ 0 & 1 & 0 & 0 & 2 \\ 0 & 0 & 0 & 1 & -1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \]It can be seen that the algebraic form of the system of equations represented in the above matrix form are:
\[ \begin{aligned} 1x_1 + 0x_2 + 0x_3 + 0x_4 + 1x_5 &= 0\\ 0x_1 + 1x_2 + 0x_3 + 0x_4 + 2x_5 &= 0\\ 0x_1 + 0x_2 + 0x_3 + 1x_4 - 1x_5 &= 0\\ \end{aligned} \]These equations can be simplified to:
\[ \begin{aligned} x_1 &= -x_5 \\ x_2 &= -2x_5 \\ x_4 &= x_5 \\ \end{aligned} \]It can then be seen that \(x_3, x_5\) are free variables that can take on any value and still satisfy the above system of equations.
It may be helpful to then write the system of equations in this form:
\[ \begin{aligned} x_1 &= -x_5 \\ x_2 &= -2x_5 \\ x_3 &= x_3 \\ x_4 &= x_5 \\ x_5 &= x_5 \\ \end{aligned} \] We can then factor the columns on the right-hand-side (rhs) for each of the \(x\) components that are free variables to find the basis vectors for the null space. It can be seen that the coefficients of the \(x_3\) component on the rhs of the above system of equations is 1 in the third equation and zero in first, second, fourth and fifth equations. The rhs coefficients for \(x_5\) are -1 for the first equation, -2 for the second equation, zero for the third, and one for the fourth and fifth equations: \[ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \end{bmatrix} = x_3 \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix} + x_5 \begin{bmatrix} -1 \\ -2 \\ 0 \\ 1 \\ 1 \end{bmatrix} \]Finally, factor out the basis vectors that form the null space:
\[ \text{Null}(A) = \text{span} \left\{ \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} -1 \\ -2 \\ 0 \\ 1 \\ 1 \end{bmatrix} \right\} \]In this example, we have found the basis vectors for the null space of the given 3x5 matrix A. These basis vectors can be used to describe all vectors in the null space through linear combinations.
Underdetermined System of Linear Equations
An underdetermined system of linear equations, like the above example, occurs when the number of equations is less than the number of unknowns (more columns than rows in the matrix). This typically means that there are infinitely many solutions or no solution at all. In such cases, it is often necessary to apply additional constraints or use optimization techniques to find a unique solution.
Consider a system of linear equations represented by the matrix equation \(Ax = b\), where \(A\) is an \(m \times n\) matrix, \(x\) is a column vector with \(n\) unknowns, and \(b\) is a column vector with \(m\) elements. An underdetermined system occurs when \(m < n\).
Portfolio management involves selecting a combination of financial instruments, such as stocks, bonds, and derivatives, to meet specific investment objectives while balancing risk and return. A portfolio is constructed by determining the optimal allocation of investments based on a set of constraints or conditions.
These constraints may include factors such as the investor's risk tolerance, liquidity requirements, and desired return on investment. In portfolio management, a system of linear equations is often used to model these constraints and find an optimal allocation of investments.
An underdetermined system of linear equations refers to a situation where there are more variables (financial instruments) than there are equations (constraints) to solve for. In other words, there are more investment options available than there are constraints to limit those options.
This can be interpreted to mean that there is flexibility in the portfolio construction process, and the investor has more freedom to choose from a wider range of investments. However, it also means that the optimal solution is not unique and there may be multiple combinations of investments that satisfy the constraints.
In such cases, the portfolio manager must carefully consider the various trade-offs between risk and return and make informed decisions based on their investment objectives, market conditions, and the characteristics of the available financial instruments.
Example: General Solution of 3x5 Matrix
Continuing from the above example, suppose we wanted to solve the above system of equations for a particular b vector in \(Ax=b\).
\[ \begin{bmatrix} 1 & 20 & 0 & 41 & 0 \\ 0 & 2 & 0 & 4 & 0 \\ 2 & 4 & 0 & 9 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \end{bmatrix} = \begin{bmatrix} 2 \\ 1 \\ 4 \end{bmatrix} \]where: \(b = \begin{bmatrix} 2 \\ 1 \\ 4 \end{bmatrix} \)
We can then form the augmented matrix:
\[ \begin{bmatrix} 1 & 20 & 0 & 41 & 0 & 2 \\ 0 & 2 & 0 & 4 & 0 & 1 \\ 2 & 4 & 0 & 9 & 1 & 4 \end{bmatrix} \]And then solve for the RREF of the augmented matrix:
\[ \begin{bmatrix} 1 & 0 & 0 & 0 & 1 & 10 \\ 0 & 1 & 0 & 0 & 2 & \frac{73}{2} \\ 0 & 0 & 0 & 1 & -1 & -18 \end{bmatrix} \]This reveals the following system of equations:
\[ \begin{aligned} 1x_1 + 0x_2 + 0x_3 + 0x_4 + 1x_5 &= 10 \\ 0x_1 + 1x_2 + 0x_3 + 0x_4 + 2x_5 &= \frac{73}{2} \\ 0x_1 + 0x_2 + 0x_3 + 1x_4 - 1x_5 &= -18 \\ \end{aligned} \]Which simplifies to:
\[ \begin{aligned} x_1 + x_5 &= 10 \\ x_2 + 2x_5 &= \frac{73}{2} \\ x_4 - x_5 &= -18 \\ \end{aligned} \]Recall that the free variables above were \(x_3\) and \(x_5\). This means we can set them to anything we want for this particular solution, depending on what we are trying to model with the system of equations. Therefore, we can choose to set \(x_3\) and \(x_5\) each equal to zero and simplify to:
\[ \begin{aligned} x_1 &= 10 \\ x_2 &= \frac{73}{2} \\ x_4 &= -18 \\ \end{aligned} \]Therefore, the general solution to this underdetermined system of linear equations is the sum of the null space basis vectors, each multiplied by an arbitrary constant, plus the particular solution:
\[ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \end{bmatrix} = c_1 \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \end{bmatrix} + c_2 \begin{bmatrix} -1 \\ -2 \\ 0 \\ 1 \\ 1 \end{bmatrix} + \begin{bmatrix} 10 \\ \frac{73}{2} \\ 0 \\ -18 \\ 0 \end{bmatrix} \]Orthogonal Projections
Overview
In linear algebra, orthogonal projections play a crucial role in various applications, such as optimization, computer graphics, and machine learning. An orthogonal projection is the process of projecting a vector onto a subspace such that the difference between the original vector and the resulting projection vector is orthogonal to the subspace.
In other words, the difference between the original vector and the projection vector is orthogonal to every vector in the subspace. This means that the projection vector is as close as possible to the original vector while being entirely contained within the subspace.
The notion of orthogonality is important in linear algebra because orthogonal vectors have a number of desirable properties. For example, orthogonal vectors are linearly independent, which means that they span a larger subspace than non-orthogonal vectors of the same dimension. Additionally, orthogonal vectors can simplify many mathematical calculations, such as finding projections, decompositions, and solutions to linear systems.
In the context of optimization, the concept of orthogonal projections can be used to solve problems where the constraints can be represented as a subspace. By finding the orthogonal projection of a given vector onto the subspace, one can ensure that the resulting solution satisfies the constraints and is as close as possible to the original vector.
Example 2D Vector Projection
Given a vector \(v=\begin{bmatrix}1\\1\end{bmatrix}\) and a subspace \(w=\begin{bmatrix}2\\0\end{bmatrix}\), the orthogonal projection of v onto W, denoted as ProjW(v), is the vector in W that is closest to v. This can be calculated using the following formula:
\[ Proj_W(\textbf{v}) = \frac{\textbf{v} \cdot \textbf{w}}{\textbf{w} \cdot \textbf{w}} \textbf{w} =\frac{1}{2}\begin{bmatrix}2\\0\end{bmatrix} = \begin{bmatrix}1\\0\end{bmatrix} = \textbf{p} \]Where \(\textbf{u} = \textbf{v} - \textbf{p}\) is perpendicular to \(\textbf{w}\).
The Least-Squares Problem
The least squares problem is a common optimization technique in linear algebra, statistics, and machine learning that seeks to find the best-fitting curve or line to a set of data points by minimizing the sum of the squared residuals. Residuals are the differences between the actual data points and their corresponding estimates on the fitted curve or line. In the context of linear regression, the least squares problem aims to find the coefficients of a linear function that minimize the sum of the squared differences between the observed data points and the predicted values based on the linear function.
Orthogonal projection plays a crucial role in solving the least squares problem, as it helps find the point in the subspace spanned by the predictor variables that is closest to the dependent/response variable. The orthogonal projection of the dependent variable onto the subspace of independent/predictor variables results in a vector that minimizes the sum of squared residuals. This projected vector represents the optimal linear combination of the predictor variables, and its coefficients provide the best-fitting solution for the least squares problem.
Setting Up The Least-Squares Problem
Suppose we have n data points, denoted as \((x_i, y_i)\). We want to find the best-fitting linear function of the form \(y = \beta_1 x + \beta_0\), where \(\beta_1\) is the slope and \(\beta_0\) is the intercept. This can be represented as a system of linear equations:
\[ \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \]Let \(\boldsymbol{X}\) be the matrix of coefficients, \(\boldsymbol{\beta}\) be the vector of unknowns, and y be the vector of the dependent variable. Then the above system can be written as \(\boldsymbol{X}\boldsymbol{\beta}\) = y.
Analytical Solution to the Least-Squares
Since the system of equations may be overdetermined, i.e., there may not be a unique solution since we have more equations than unknowns, we want to find the best approximation by minimizing the sum of the squared differences between the actual data points and the predicted points on the function.
To find the least-squares solution, we project the vector y onto the column space of \(\boldsymbol{X}\), denoted as Col(\(\boldsymbol{X}\)). The projection is denoted as ProjCol(X)(y). Then the least-squares solution, denoted as \(\boldsymbol{\beta}_{LS}\), can be found using the following formula:
\[ \boldsymbol{\beta}_{LS} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y} \]Step-by-step derivation
-
Define the error vector
Let \(\boldsymbol{\epsilon} = \textbf{y} - \textbf{X}\boldsymbol{\beta}\), the difference between the actual values and the predicted values. The goal is to minimize the sum of squared errors \(E = \boldsymbol{\epsilon}^T \boldsymbol{\epsilon}\), which can be expanded as:
\[ E = (\textbf{y} - \textbf{X}\boldsymbol{\beta})^T (\textbf{y} - \textbf{X}\boldsymbol{\beta}) \] -
Compute the gradient
To minimize \(E\), we need to find the critical points, which are the points where the gradient of \(E\) with respect to \(\boldsymbol{\beta}\) is zero. So, we compute the gradient:
\[ \nabla_{\boldsymbol{\beta}} E = -2 \textbf{X}^T (\textbf{y} - \textbf{X}\boldsymbol{\beta}) \]Note: This is derived using the chain rule and the fact that \(\frac{\partial (\textbf{u}^T \textbf{v})}{\partial \textbf{u}} = \textbf{v}\) for any vectors \(\textbf{u}\) and \(\textbf{v}\).
-
Set the gradient to zero and solve for \(\boldsymbol{\beta}\)
Now, set the gradient to zero and solve for \(\boldsymbol{\beta}\):
\[ 0 = -2 \textbf{X}^T (\textbf{y} - \textbf{X}\boldsymbol{\beta}) \]First, divide both sides by \(-2\):
\[ 0 = \textbf{X}^T (\textbf{y} - \textbf{X}\boldsymbol{\beta}) \]Then, expand the expression:
\[ 0 = \textbf{X}^T \textbf{y} - \textbf{X}^T \textbf{X} \boldsymbol{\beta} \]Now, isolate the term \(\textbf{X}^T \textbf{X} \boldsymbol{\beta}\) on one side of the equation:
\[ \textbf{X}^T \textbf{X} \boldsymbol{\beta} = \textbf{X}^T \textbf{y} \]Finally, multiply both sides by the inverse of \((\textbf{X}^T \textbf{X})\), assuming it exists:
\[ (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{X} \boldsymbol{\beta} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y} \]As the inverse of a matrix times the matrix itself results in the identity matrix, we can simplify the equation:
\[ \textbf{I} \boldsymbol{\beta} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y} \] -
Obtain the least-squares solution
Since the identity matrix times any matrix is just the matrix itself, we have:
\[ \boldsymbol{\beta}_{LS} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y} \]
The analytical least squares solution, denoted as \(\boldsymbol{\beta}_{LS}\), is a mathematical method to find the optimal set of parameters for a linear model that minimizes the sum of the squared differences between the predicted values and the actual values in the dataset. In other words, it finds the best-fitting line or hyperplane that minimizes the squared residuals.
Alternative Derivation
-
Decompose \(y\) into sum of \(y_{proj_{col(X)}} + (y - y_{proj_{col(X)}})\)
An alternative derivation is to observe that since \(A\boldsymbol{\beta}=y\) is an overdetermined system of equations, that we want to project \(y\) onto the column space of \(X\). We can then write:
\[y = y_{proj_{col(X)}} + (y - y_{proj_{col(X)}})\]
We should note here that the difference between \(y\) projected onto the column space of \(X\) and y is orthogonal to the column space of \(X\) and is also orthogonal to the row space of \(A^T\) because the rows of \(A^T\) are the same as the columns of \(A\).
We also know that a vector that is orthogonal to the rows of a matrix is in the null space of that matrix.
Therefore we know that \(y - y_{proj_{col(X)}}\) is in the null space of \(X^T\).
-
Use Properties of Orthogonality
Because we decomposed y into \(y = y_{proj_{col(X)}} + (y - y_{proj_{col(X)}})\), and we now know that the second term in the decomposition is in the null space of \(X^T\), we can multiply both sides of \(X\boldsymbol{\beta}=y\) by \(X^T\) to get rid of the component of the decomposition that is in the null space of \(X^T\) and then we are left with just \(y\) that is a projection onto the cloumn space of \(X\). This result is known as the normal equation.
\[X^TX\boldsymbol{\beta}=X^Ty\]
-
Solve for \(\boldsymbol{\beta}\) and multiply both sides by \(A\)
We can now solve for \(\boldsymbol{\beta}\) and could stop here:
\[ \boldsymbol{\beta} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y} \]But it is educational to observe that if we multiply both sides by \(A\) we can then see that we have also solved for the projection of \(y\) onto the column space of \(A\):
\[ X\boldsymbol{\beta} = X(\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y} = y_{proj_{col(X)}}\]Where the projection matrix that projects \(y\) onto the column space of \(X\) is:
\[ X(\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \]
Using the least squares solution in linear regression
In the context of linear regression, the goal is to fit a line or hyperplane that best describes the relationship between an independent variable (or variables) and a dependent variable. The least squares solution can be used to find the optimal parameters for the linear model, which will result in the best-fitting line.
Let's assume we have a simple linear regression problem with a single independent variable \(x\) and a dependent variable \(y\). We can represent the linear relationship as:
\[ y = \beta_0 + \beta_1 x \]Here, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope of the line. To apply the least squares solution, we first need to set up the matrix \(\textbf{X}\) and the vector \(\textbf{y}\). The matrix \(\textbf{X}\) will have two columns: the first column consists of all 1s (for the intercept term), and the second column contains the independent variable values. The vector \(\textbf{y}\) contains the dependent variable values.
For example, given the dataset \((x,y) : \{(1, 1), (2, 3), (3, 2)\}\), we have:
\[ \textbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad \textbf{y} = \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix} \]Now, we can use the least squares solution to find the optimal values for \(\beta_0\) and \(\beta_1\):
\[ \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = (\textbf{X}^T \textbf{X})^{-1} \textbf{X}^T \textbf{y} \] \[ \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \left(\begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix}\begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix} \right)^{-1} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{bmatrix} \begin{bmatrix} 1 \\ 3 \\ 2 \end{bmatrix} \] \[ \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} = \begin{bmatrix} 1 \\ \frac{1}{2} \end{bmatrix} \]Plot of Linear Regression
With the optimal values of \(\beta_0\) and \(\beta_1\), we can construct the best-fitting line for the given dataset. This line minimizes the sum of squared residuals, which measures the overall discrepancy between the predicted and actual values. The fitted linear model can then be used to make predictions for new data points or to analyze the relationship between the independent and dependent variables.
In the case of multiple linear regression, where there are multiple independent variables, the same least squares solution can be applied, with the matrix \(\textbf{X}\) having additional columns for each independent variable.
It's important to note that the analytical least squares solution assumes that the matrix \(\textbf{X}^T \textbf{X}\) is invertible. If this assumption does not hold, alternative methods, such as regularization techniques or pseudo-inverse, can be used to find a solution.
Conclusion: Mastering Mathematical Concepts for Success in Data Science and Quantitative Finance
Throughout this blog post, we have explored essential mathematical concepts that underpin the fields of data science and quantitative finance. We explored topics such as:
- Vector Spaces
- Gram-Schmidt Process
- Fundamental Subspaces of a Matrix
- Applications of the Null Space
- Orthogonal Projections
- The Least-Squares Problem
By understanding and applying these vital mathematical concepts, you are now better equipped to tackle complex challenges, design innovative solutions, and stand out in the competitive landscape of data science and quantitative finance. As you continue to refine your skills and knowledge, remember that the pursuit of mastery is an ongoing effort. Stay curious, keep learning, and get motivated about the opportunities that come with being a well-rounded professional in these exciting and ever-evolving fields.