In our previous guide on Quantum Computation, we introduced a classical-quantum hybrid algorithm called the Quantum Approximate Optimization Algorithm (QAOA).
In this article, we're going to build on this and look at classical-quantum learning algorithms.
This guide is based on this Quantum Machine Learning course from U of T, and in particular, we're going to look at:
- Encoding Classical Information into Quantum Systems
- Ensemble Learning & Discrete Optimization
- Variational Methods in Unsupervised Learning
- Kernel Methods
- Probabilistic Graphical Models
- Summary of Quantum Learning Algorithms
If you want to read more about this subject check out our other introductory guides on the subject.
1. Encoding Classical Information into Quantum Systems
As we discussed previously, there are several paradigms for quantum computation.
Regardless of which paradigm we're working with, we're always going to have to ask the question: how do I encode classical information into a quantum system?
There are several ways of achieving this, and here is an overview of the main approaches.
Basis Encoding
Basis encoding is the easiest way to encode classical information into a quantum system because it is the same as what you would do on a digital computer.
Let's say that \(x = 3\) and you want to encode it, so you would write it as a binary \(11\).
You would then just take this binary and write it as a quantum state: \(|1 1\rangle\).
If we have a vector \(x = \begin{bmatrix}3 2 \end{bmatrix}\) we can create a binary vector \(\begin{bmatrix}1 1\\1 0\end{bmatrix}\).
We can then concatenate the two string into a single key, or 4 qubit state in this example: \(|1110\rangle\).
The advantage of this, as we can see, is that it's very easy to prepare.
The big disadvantage of this approach is that it's very wasteful of the qubits. You need many qubits to describe something simple, like a floating point number for example.
Amplitude Encoding
The next approach we'll discuss is amplitude encoding.
Let's say we have a vector \(x=\begin{bmatrix}x_0\\x_1\end{bmatrix}\), and we have the assumption that the vector is already normalized, meaning its length is 1.
What we can do is write the same thing with the original coefficients, and index the individual coefficients of probability amplitudes with the basis vector:
\[|x\rangle\ = x_0|0\rangle\ + x_1|1\rangle\
\]
The advantage with amplitude encoding is that it requires far fewer qubits, and (in theory) has infinite precision.
The disadvantage is that it's unclear how you would prepare this stand and read out the actual values.
This means you'd have to come up with a state preparation protocol and perform tomography to understand what the probability amplitudes are at the end of the calculation.
Hamiltonian Encoding
There are two ways of encoding a problem into a Hamiltonian.
The first, which we've seen before, is to map the problem to the Ising model.
This is the approach that we use for quantum annealing
The advantage of this is that it's quite easy to implement, as we're up to thousands of qubits for quantum annealing systems.
The disadvantage is that it's limited in scope in what you can do with it, but you can solve both optimization and sampling problems with the Ising way.
The second way of Hamiltonian encoding is with Hamiltonian simulation.
This is not to be confused with simulating it on a classical computer. Instead, this is a quantum computer simulating a quantum system.
An example of this is what the QAOA algorithm does when it approximates the adiabatic pathway.
One of the most important subroutines in quantum machine learning algorithms is quantum matrix inversion, and it does this same thing. It encodes the matrix to be encoded in the Hamiltonian.
The advantage of this is that it's quite natural to encode a matrix with this formalism.
The disadvantage is that there are many terms and conditions that apply to this paradigm, and it's still very limited in what we can do on quantum computers with Hamiltonian simulation.
2. Ensemble Learning & Discrete Optimization
Now that we know how to encode classical information on quantum computers, we can discuss how we do learning on the encoded information.
First, let's discuss a few important concepts in machine learning.
Machine Learning
When we refer to optimization in machine learning, this is typically not the type of optimization you would perform on a quantum computer.
In machine learning, a typical optimization example would be as follows:
- You have data points in a high dimensional space
- Each data point has some label that belongs to a finite number of classes
What we're trying to do in machine learning is fit the data to a conditional probability distribution.
So when we see new data, the machine learning model is trying to approximate what label it should predict.
This is referred to as supervised learning - we have training data with labels that we train our model on.
This is exactly where deep learning techniques have excelled in recent years.
To achieve this you define a loss function in the parameters, which you're trying to optimize for the particular machine learning model (among other parameters).
Each one of these parameters is high dimensional and is represented on a digital computer as a 32 or 64-bit precision floating point number.
Particularly in deep learning, the parameter space becomes extremely high dimensional and we get up to millions of weights.
So we're using at least 32 bits per rate, and there are millions of them.
Recall that the largest quantum computer we have has ~2000 qubits - so you can see how there's a difference between this type of continuous optimization in machine learning, and what we can solve on a quantum computer.
Clearly, we need to think about optimization differently.
Ensemble Learning & Discrete Optimization
In ensemble learning, you can take several large neural networks and combine them into a single predictor.
How to combine neural networks is still a big question, but the objective is to weight each of the neural networks so the overall performance increases.
You can see this is a discrete optimization problem, as you have a discrete amount of neural networks and you want to find the relative importance of each.
As we can see from Wikipedia:
Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.
One of the first algorithms that made ensemble learning very useful is the AdaBoost boosting algorithm.
What AdaBoost does is keep expanding the ensemble sequentially, meaning it adds a new learning model to the ensemble one at a time.
Other important algorithms in machine learning include XGBoost and Gradient boosting.
Qboost
Let's now look at what we can do with a quantum computer in terms of ensemble learning.
We mentioned AdaBoost, which has an exponential loss and no regularization term.
Let's look at how we can create a different objective function that maps to a quantum computer, and provides an improvement in performance over classical methods.
As mentioned, we have a sample of data points that live in some high dimensional space. In this example, let's say they have binary labels.
Let's say we also have a few large neural networks that we have already trained on the dataset.
What we do is measure the squared loss between the prediction of the ensemble, and the actual label \(y_i\). This contrasts with the exponential loss we use in AdaBoost.
In contrast to AdaBoost, we also have another part that performs regularization.
Formally, Qboost can be written as follows:
\[\mathrm{argmin}_{w} \left(\frac{1}{N}\sum_{i=1}^{N}\left(\sum_{k=1}^{K}w_kh_k(x_i)-
y_i\right)^2+\lambda\|w\|_0\right)
\]
where \(h_k(x_i)\) is the prediction of the weak learner \(k\) for a training instance \(k\).
You can read more about Qboost in a paper published by Google and D-Wave: Training a Binary Classifier with the Quantum Adiabatic Algorithm.
You can also check out a demonstration of Qboost on the D-Wave system on this Github page.
3. Variational Methods in Unsupervised Learning
We just discussed a supervised learning problem, which can be solved by mapping to an Ising model.
One thing to note is that it's difficult to compete with supervised learning on classical computers because this is exactly where deep learning excels.
Instead, we want to use quantum computers for machine learning problems that are hard to solve classically.
Unsupervised Learning
An example of a hard problem in machine learning is unsupervised learning.
While there have been advances in unsupervised learning, it remains a difficult domain.
In unsupervised learning, again we have data points in some high dimensional space, but the difference is that these points don't have any labels.
Instead, we're trying to find structure in our data.
We don't have any prior information about whether the data points belong to a certain class (i.e. cat or dog) or how they cluster in the structure.
In unsupervised learning we want to figure out what the labels would be, and how to assign those labels to the data points.
This is an example of a discriminative, unsupervised learning problem.
Another type of unsupervised is generative, in which we figure out the distribution of these points and use it to generate new points.
In this article, we're going to focus on clustering with unsupervised learning.
With clustering you want to identify groups of data points that somehow belong together.
One of the most common learning algorithms for clustering is called k-means clustering.
In k-means you identify center points and assign clusters based on the proximity to this central point. Here's a definition from AWS:
Another way you can define k-means is that it is a clustering problem that finds k cluster centroids for a given set of records, such that all points within a cluster are closer in distance to their centroid than they are to any other centroid.
Unsupervised Learning with a Quantum Computer
Here's one way you could do k-means with a quantum computer.
In order to map this to the Ising model, we can calculate the Gram matrix, or the distance between every point in the sample.
We fill the Gram matrix with the distances between individual points, and this defines a weighted graph.
We can then look for the max-cut, that is, the highest value of a cut through the graph. This, in turn, would identify the clusters that we're looking for in our data points.
This is a well-known NP-hard problem, but it also naturally maps to an Ising model.
We can now use quantum annealing, QAOA, or another quantum optimization technique to solve this problem.
4. Kernel Methods
Let's now move on to kernel methods, which are a class of machine learning algorithms for pattern analysis.
What kernel methods do is a type on nonlinear embedding. This means that we introduce a new way of measuring the distance between high-dimensional data points.
Kernel methods do this by replacing the inner product \((x_i, x_j)\) with a function called a kernel. A kernel retains many properties of the inner product, except it is nonlinear.
The best-known kernel method is support vector machines (SVM).
Kernel methods were very popular in the mid-90's until about 10 years, and then they went out of fashion as they are quite shallow.
This means they are not good at automatically extracting features, which you can do with deep learning.
That said, quantum computers can naturally calculate some interesting kernels.
An Interference Circuit
Instead of looking at a quantum algorithm that can speed up machine learning, we're now going to look at the hardware of a quantum computer and the types of calculations it can naturally do.
If we look at the current capabilities of quantum hardware and their imperfections, we can see why we need to develop completely new learning algorithms.
One of the early examples of this was a type of kernel that can be executed on a shallow circuit gate model quantum computer.
In this example, we start with a simple state preparation. Then we apply a Hadamard operation on the circuit, and this allows us to do interference.
Instead of using a Hamiltonian encoding like we've seen previously, in this example we use the amplitude encoding.
To recap - if we have some vector in a dataset, we normalize it to one and then can encode it in the probability amplitudes, in a superposition.
If you want to learn more about this subject, check out these papers:
- Unsupervised Machine Learning on a Hybrid Quantum Computer
- Implementing a distance-based classifier with a quantum interference circuit
5. Probabilistic Graphical Models
We know that deep learning excels at supervised learning, and has also been making advances in other more difficult problems.
One problem that has a more natural fit is probabilistic graphical models.
As we'll see, probabilistic graphical models can also be trained efficiently with quantum computers.
Markov Networks
Our main focus will be on Markov networks, but to get there we first need to discuss a few things first.
Discriminative Models
In machine learning a discriminative problem is one in which the task is to estimate the conditional probability distribution.
For example, given a dataset of images, figure out what is in the images. This is where deep learning excels.
Generative Models
Another type of task in machine learning is to learn the probability distribution of the actual data instances.
These are generally harder to solve than discriminative problems, although there are several ways you can use deep learning.
Probabilistic Graph Models
Probabilistic graph models excel at sparse modeling, or capturing the sparsity structure between random variables.
The two main types of probabilistic graph models include:
Bayesian Networks
In Bayesian networks the underlying graph is directed:
Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor.
Markov Networks
In Markov networks the graph is undirected:
A Markov network or MRF is similar to a Bayesian network in its representation of dependencies; the differences being that Bayesian networks are directed and acyclic, whereas Markov networks are undirected and may be cyclic.
Let's now look at how sparse modeling happens.
Probabilistic Graph Model
Probablistic graph models capture a compact representation of a joint probability distribution.
Complexity is dealt with using graph theory: "the study of graphs, which are mathematical structures used to model pairwise relations between objects.".
We are looking for what is called conditional independence.
From this Quantum Machine Learning course:
The definition is that \(X\) is conditionally independent of \(Y\) given \(Z\) \((X\perp Y|Z)\), if \(P(X=x, Y=y|Z=z) = P(X=x|Z=z)P(Y=y|Z=z)\) for all \(x\in X,y\in Y,z\in Z\).
Graphical models are typically generative in that we explicitly model a probability distribution.
The models require massive computational power, and these requirements still remain out of reach.
The models need to sample a distribution, and often it is the Boltzmann distribution.
Quantum computers are a good fit for this, as they can provide samples from the distribution.
The hope is that quantum hardware will unlock the power of these models, the same way that GPUs enabled deep learning.
If you want to learn more about probabilistic graph models and quantum computing, check out these papers:
- A quantum machine learning algorithm based on generative models
- A probabilistic graphical model of quantum systems
- Quantum-Assisted Learning of Hardware-Embedded Probabilistic Graphical Models
6. Summary of Quantum Learning Algorithms
In this guide we discussed several approaches to using quantum hardware to enhance learning algorithms.
In particular, we looked at:
- Encoding Classical Information into Quantum Systems
- Ensemble Learning & Discrete Optimization
- Variational Methods in Unsupervised Learning
- Kernel Methods
- Probabilistic Graphical Models
We saw that we want to use quantum computers for machine learning problems that are hard to solve classically, one example of which is unsupervised learning.
We also discussed that given the current capabilities of quantum hardware and their imperfections, we often need to develop completely new learning algorithms.
The goal with these new algorithms is the enhance the problems that remain hard to solve classically.
The idea is that quantum hardware will unlock the power of these hard to solve problems, the same way that GPUs have enabled advances in deep learning.
In the next article we'll look at coherent quantum machine learning protocols and estimate their resource requirements.
Resources: