Mathematics of Machine Learning: Introduction to Linear Algebra

In this article we introduce the first step in the mathematical foundation of machine learning: linear algebra.

5 years ago   •   7 min read

By Peter Foy

We all know that machine learning is transforming nearly every industry on the planet, and in this article we're going to look at the mathematical foundations of this revolution.

In the field of machine learning, linear algebra notation is to describe the parameters, weights, and structure of different algorithms. As a result, it is essential for any machine learning practitioner to develop a foundation of linear algebra.

This article is based on this course on the Mathematical Foundation for Machine Learning and AI, and is organized as follows:

  1. Scalars, Vectors, Matrices, and Tensors
  2. Vector and Matrix Norms
  3. Special Matrices and Vectors
  4. Eigenvalues, Eigenvectors, & Eigendecomposition
  5. Summary: Mathematics of Machine Learning

Having a grasp of these foundations are essential to building and deploying machine learning algorithms to solve real world problems.

1. Scalars, Vectors, Matrices, and Tensors

The notation that follows is very important in machine learning as they're used in every deep learning algorithm today.

The inputs of neural networks are typically in the form of vectors or matrices and the output are also either scalars, vectors, matrices, or tensors.

Let's start with a few basic definitions:


Scalars are a single number or value and is typically denoted with an $(x)$.


Vectors are an array of numbers, either in a row or a column, and are identified by a single index. Vectors typically are denoted with a bold $(x)$.


A matrix is a 2-dimensional array of numbers, where each element is identified by two indices. A matrix is denoted with a capital and bold $(X)$.

Let's look at a few examples of scalars, vectors, and matrices:

Scalar: 5

Vector: $[𝟷 \ 𝟻 \ 𝟶]$ or

\begin{bmatrix}1 \\ 5 \\ 0\end{bmatrix}


\begin{bmatrix} 5 & 8 \\ 1 & 2 \\ 2 & 3 \end{bmatrix}

It's important to note that for matrices you can have any size of rows and columns.

For vectors and matrices these are typically indexed starting from 0.

In terms of dimensions, vectors are 1-dimensional, matrices are 2-dimensional, and dimensions are reported in a (Row, Column) format. For example, if we have 3 rows, and 5 columns we would saying this is a 3 x 5 matrix.

Now that we understand the basics, let's move on to some of the operations we can perform.

Stay up to date with AI

We're an independent group of machine learning engineers, quantitative analysts, and quantum computing enthusiasts. Subscribe to our newsletter and never miss our articles, latest news, etc.

Great! Check your inbox and click the link.
Sorry, something went wrong. Please try again.

Matrix Operations

Matrix operations are frequently used in machine learning and can include addition, subtraction, and multiplication.

Matrix addition is an entrywise sum, meaning the addition of matrix A with matrix B is matrix C. A and B must have the same dimensions, and C will also have the same dimensions as A and B.

For example: $A + B = C$

$$\begin{bmatrix} 1 & 2 \\ 3& 4 \\ 5 & 6 \end{bmatrix} + \begin{bmatrix} 1 & 2 \\ 3& 4 \\ 5 & 6 \end{bmatrix} = \begin{bmatrix} 2 & 4 \\ 6 & 8 \\ 10 & 12 \end{bmatrix}$$

Matrix subtraction works the same way as matrix addition - it is performed elementwise and the matrices must have the same dimensions.

Matrix multiplication is a bit different - the matrix product of A and B will be matrix C. A must have the same number of columns as B has rows.

Here is the equation to find the inputs for C:

\[C_{i, j} = \sum_k A_{i, k} B{k, j}\]

For example:

$$\begin{bmatrix} 1 & 2 \\ 3& 4 \\ 5 & 6 \end{bmatrix} * \begin{bmatrix} 1 & 2 \\ 3& 4  \end{bmatrix} = \begin{bmatrix} 7 & 10 \\ 15 & 22 \\ 23 & 34 \end{bmatrix}$$

Here are a few properties of matrix multiplication:

  • Matrix multiplication is distributive: $A(B + C) = AB + AC$
  • Matrix multiplication is associative: $A(BC) = (AB)C$

Matrix Transpose

One of the most important matrix operations in machine learning is the transpose.

The transpose of a matrix is an operator which flips a matix over its diagonal.

The indices of the rows and columns are switched in the transpose, for example:

$$\begin{bmatrix} a{00} & a{01} \\ a{10} & a{11} \end{bmatrix}^T = \begin{bmatrix} a{00} & a{10} \\ a{01} & a{11}  \end{bmatrix}$$

We also aren't restricted to matrices, we can also take the transpose of a vector. In this case a row vector becomes a column vector.


If we take a matrix one step further we get a tensor.

Matrices have two axes, but sometimes we need more than two axes.

For example, if we have a 2-D matrix with indices $(i, j)$, a 3-D tensor would have indices $(i, j, k)$

It gets harder to imagine, but we can have any number of dimensions in a tensor, although it does get more computationally expensive to handle.

2. Vector and Matrix Norms

Now that we're familiar with vectors and matrices, let's look at how they're actually used in machine learning.

The first way that vectors and matrices are used is with norms:

The magnitude of a vector can be measured using a function called a norm.

Norms are used in many ways in machine learning, but one example is measuring the loss function between a predicted and an actual point.

Here are a few important points about norms:

  • Norms map vectors to non-negative values
  • The norm of a vector $x$ measures the distance from the origin to the point $x$

The general formula for a norm is the $L^P$ norm:

\[||x||_p = (\sum_i |x_i|^p)^{1/p}\]

When we fill in different values of $p$ in this equation we're going to get very different norms.

The Euclidean Norm

The most common value is $P = 2$, in which case we get the Euclidean norm, otherwise known as the $L^2$ norm:

\[||x||_2 = (\sum_i |x_i|^2)^{1/2}\]

The $L^1$ Norm

Another common value is the $L^1$ norm:

In cases where discriminating small, nonzero values and zero is important, the $L^1$ norm can be used.

\[||x||_1 = (\sum_i |x_i|\]

The $L^1$ norm increases linearly as elements of x increase.

The $L^1$ norm is useful in machine learning because it allows us to tell whether or not we have 0 or slightly nonzero values around the origin.

The Max Norm

The max norm, or the $L^\infty$ norm, is also frequently used in machine learning.

The max norm simplifies to the absolute value of the largest element in the vector.

\[||x||_\infty = \max_i |x_i|\]

The Frobenius Norm

We've been talking about using vectors, but when we have a matrix this is where the Frobenius norm comes in, which is analogous to the $L^2$ norm of a vector:

\[||A||_F = \sqrt{\sum_{i,j}A^2_{i,j}}\]

The Frobenius norm is used in machine learning frequently since it deals with matrices.

Now let's look at how we can use these norms to normalize a vector or a matrix and produce what are called "unit vectors".

3. Special Matrices & Vectors

The matrices and vectors we're going to discuss occur more commonly than others and are particularly useful in machine learning.

Thea reason most of them are particularly useful in machine learning is because they're computationally efficient.

The matrices & vectors we're going to cover include:

  • Diagonal matrices
  • Symmetric matrices
  • Unit vectors
  • Normalization
  • Orthogonal vectors

Diagonal Matrices

A matrix is diagonal if the following condition is true:

\[D_{i,j} = 0 \ for \ all \ i \neq j\]

Here's an example of a diagonal matrix, where all the entries are 0 except along the main diagonal:

$$\begin{bmatrix} 1 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 3\\ \end{bmatrix}$$

Diagonal matrices are useful in machine learning because multiplying by a diagonal matrix is computationally efficient.

Symmetric Matrices

A symmetric matrix is any matrix that is equal to its transpose:

\[A = A^T\]

The Unit Vector

A unit vector is a vector with unit norm:

\[||X||_2= 1\]

Vector Normalization

Normalization is the process of dividing a vector by its magnitude, which produces a unit vector:

\[\dfrac{x}{||x||_2} = unit \ vector\]

Normalization is a very common step in data preprocessing and has been shown to dramatically improve the performance of machine learning algorithms in many cases.

Orthogonal Vectors

A vector $x$ and a vector $y$ are orthogonal to each other if $x^Ty = 0$.

If two vectors are orthogonal and both vectors have a nonzero magnitude, they will be at a 90 degree angle to each other.

If two vectors are orthogonal and are also unit vectors, they are called orthonormal.

5. Eigenvalues, Eigenvectors, & Eigendecomposition

Now let's move on to the concept of eigendecomposition by breaking it down into eigenvalues and eigenvectors.


Eigendecomposition is simply breaking mathematical objects into their constituent parts.

For example, integers could be decomposed into prime factors.

Similarly, we can decompose matrices in ways that reveal information about their functional properties that is not immediately obvious.

So in the process we take a matrix and decompose it into eigenvectors and eigenvalues.

Eigenvectors & Eigenvalues

The eigenvector of a square matrix $A$ is a nonzero vector $v$ such that multiplication by $A$ alters only the scale of $v$:

\[Av = \lambda v\]


  • $v$ is the eigenvector
  • $\lambda$ is a scalar, the eigenvalue corresponding to $v$

Let's come back to eigendecomposition.


If a matrix $A$ has $n$ linearly independent eigenvectors, we can form a matrix $V$ with one eigenvector per column, and a vector $\lambda$ of all the eigenvalues.

The eigendecomposition of $A$ is then given by:

\[A = V diag (\lambda)V^{-1}\]

One important property of eigendecomposition is that not every matrix can be decomposed into eigenvalues and eigenvectors.

The main motivation for understanding eigendecomposition in the context of machine learning is that it is used in principle components analysis (PCA).

5. Summary: Mathematics of Machine Learning - Linear Algebra

In this article we reviewed the mathematical foundation of machine learning: linear algebra.

We first defined scalars, vectors, matrices, and tensors. As discussed, the input of a neural network is typically in the form of a vector or matrix, and the output is either a scalar, vector, matrix, or tensor.

After that we looked at vector and matrix norms, and as mentioned:

The magnitude of a vector can be measured using a function called a norm.

We then looked at how to implement these concepts in Python and how to use indexing, matrix operations, and the matrix transpose.

Next we looked at a few special matrices and vectors that occur more commonly and are particularly useful in machine learning. These include:

  • Diagonal matrices
  • Symmetric matrices
  • Unit vectors
  • Normalization
  • Orthogonal vectors

Finally we looked at eigenvalues, eigenvectors, and eigendecomposition.

As discussed:

Eigendecomposition is simply breaking mathematical objects into their constituent parts.

Although each one of these topics could be expanded on significantly, the goal of the article is to provide an introduction to concepts that frequently come up in machine learning.

If you want to learn more about any of these subjects, check out resources below.


Spread the word

Keep reading