### How Kalman Filters Work, Part 1 | An Uncommon Lab

[web search]*www.anuncommonlab.com*)

# How Kalman Filters Work, Part 1

## Intro

### Table of Contents

#### Basic Architectures

- Intro
- The Particle Filter
- The Sigma-Point Filter
- Interlude: Expectation and Covariance
- The Extended Kalman Filter
- The Kalman Filter
- Wrap-Up
- Resources

#### Getting Serious >>

#### Useful Little Things >>

Let's suppose you've agreed to a rather odd travel program, where you're going to be suddenly transported to a randomly selected country, and your job is to figure out where you end up. So, here you are in some new country, and all countries are equally likely. You make a list of places and probabilities that you're in those places (all equally likely at about 1/200 for 200 countries).

Place | Probability |
---|---|

Afghanistan | 0.005 |

Albania | 0.005 |

… | |

Mauritius | 0.005 |

Mexico | 0.005 |

… | |

Zambia | 0.005 |

Zimbabwe | 0.005 |

You look around and appear to be in a restaurant. Some countries have more restaurants (per capita/per land area) than others, so you decrease the odds that you're in Algeria or Sudan and increase the odds that you're in Singapore or other high-restaurant-density places.

Place | Prior Prob. | Prob. of Restaurant, given Country | Result |
---|---|---|---|

… | |||

Algeria | 0.005 | 0.01 | 0.00005 |

… | |||

Singapore | 0.005 | 0.3 | 0.0015 |

… | |||

… | |||

Sudan | 0.005 | 0.01 | 0.00005 |

… |

That is, you just multiply the probability that you were in a country with the probability of finding oneself in a restaurant in that country, given that one were already in the country, to obtain the new probability. (The last column isn't actually a probability until it's scaled so that it sums to 1, but only the relative values matter, so it's a probability as far as anyone cares.)

After a few moments, the waitress brings you sushi, so you decrease the odds for Tajikistan and Paraguay and correspondingly increase the odds on Japan, Taiwan, and such places where sushi restaurants are relatively common.

Place | Prior Prob. | Prob. of Sushi given Restaurant, Country | Result |
---|---|---|---|

… | |||

Japan | 0.0015 | 0.3 | 0.00045 |

… | |||

Tajikistan | 0.0005 | 0.005 | 0.0000025 |

… |

You pick up the chopsticks and try the sushi, discovering that it's excellent. Japan is now by far the most likely place, and though it's still possible that you're in the United States, it's not nearly as likely (sadly for the US).

Place | Prior Prob. | Prob. of Good Sushi | Result |
---|---|---|---|

… | |||

Japan | 0.00045 | 0.75 | 0.0003375 |

… | |||

United States | 0.00005 | 0.1 | 0.000005 |

… |

Those “probabilities” are getting really hard to read with all those zeros in front. All that matters is the relatively likelihood, so perhaps you scale that last column by the sum of the whole column. Now it's a probability again, and it looks something like this:

Place | Probability |
---|---|

… | |

Japan | 0.7 |

… | |

United States | 0.01 |

… | |

Taiwan | 0.1 |

… |

Now that you're pretty sure it's Japan, you make a new list of places inside Japan to see if you can continue to narrow it down. You write out Fukuoka, Osaka, Nagoya, Hamamatsu, Tokyo, Sendai, Sapporo, etc., all equally likely (and maybe keep Taiwan too, just in case).

Place | Probability |
---|---|

… | |

Fukuoka | 1/44 |

… | |

Hamamatsu | 1/44 |

… | |

Osaka | 1/44 |

… | |

Taiwan | 1/44 |

… |

Now the waitress brings unagi. You can get unagi anywhere, but it's much more common in Hamamatsu, so you increase the odds on Hamamatsu and slightly decrease the odds everywhere else.

Place | Prior Prob. | Prob. of Unagi | Result |
---|---|---|---|

… | |||

Fukuoka | 1/44 | 0.1 | 0.0022 |

… | |||

Hamamatsu | 1/44 | 0.3 | 0.0068 |

… | |||

Osaka | 1/44 | 0.1 | 0.0022 |

… | |||

Taiwan | 1/44 | 0.05 | 0.0011 |

… |

By continuing in this manner, you may eventually be able to find that you're eating at a delicious restaurant in Hamamatsu Station — a rather lucky random draw.

There's nothing especially odd about the process you've gone through (except that you had such good restaurant data), and everyone would agree that it was pretty reasonable. It was also basically a particle filter, a method for determining unknown things, like location, from measurements of other things, like cuisine, using different probabilities about what you'd observe under different hypotheses. In fact, it's sort of like an automated application of the scientific method, isn't it?

When performed as part of an algorithm, this type of thing is called recursive state estimation. Unfortunately, only a small fraction of mechanical engineers, electrical engineers, and data scientists receive any formal education on the subject, and even fewer develop an intuitive understanding for the process or have any knowledge about practical implementation. While there are very many good books on the math behind it and the details of how to apply it to certain specific problems, this article will take a different approach. We’ll focus on developing:

- an intuition for recursive state estimation,
- a broad knowledge of the strongest and most general types,
- and a good idea of the implementation details.

Once one has a good foundation for these things, it becomes much easier to look up specific algorithms in the filter literature, customize them, adapt them to code, correct problems, and explain performance to others.

Interestingly, the most intuitive forms of recursive estimation are only recently becoming popular, so we’re going to look at their history entirely backwards: starting from the most recent types, like particle filters, and working back into the ancient past (the 1960s) for the breakthrough that enabled the Apollo navigation algorithms to keep a spacecraft on a course to the moon: the Kalman filter.

The target audience is engineers, scientists, makers, and programmers who need to design a system to understand a process, along with the systems engineers and team leads who need a better understanding of what their teams are building. To approach the topic, we'll begin by creating a rough sketch of filter ideas. Then, we'll fill in the sketch with more and more detail. The first part might take a couple of hours, and the second part will take about half that, depending on how deeply one chooese to read. Then there are some appendices for extra implementation notes and exercise with these topics. Anyone can follow the basic ideas here, but a little probability theory and basic linear algebra will help to follow the math and the reference material as we get more detailed. We’ll point out several good resources along the way. With that, let’s get started.

The four filters we'll present here, along with the code that generated all of the images, can be downloaded for MATLAB here.

## The Particle Filter

Particle filters can be pretty easy, and there are about as many forms as there are problems to solve. We'll look at a simple type of particle filter called a bootstrap filter to build an understanding of the basics.

### One Iteration of a Particle Filter

Particles filters work like this: First, we ask, “What things could be happening, and how likely is each thing?” Then, “For each thing that could be happening, what would I expect to observe?” Then, “How likely is each thing, given what I've now actually observed?” We then repeat.

For example, suppose we approximately know the original position of a bouncing ball, but don't know much about the velocity. Let's create a bunch of “hypothesis” balls, randomly dispersed around where we think the real ball is and with random velocities. And let's say that, as far as we know, these are all equally likely candidates for the true ball's position and velocity (its state).

This list of hypothetical balls is like the list of countries.

Each hypothetical ball's state is called a particle, and the likelihood that the particle best represents the true state is the particle weight. So, for each particle, we have a hypothetical ball's position and velocity, as well as a measure of the likelihood, from 0 to 1, that the ball describes the truth, where 1 is complete certainty and 0 represents no possibility at all. E.g., the particle weight for South Georgia Island goes to exactly 0 given that there are no sushi restaurants on South Georgia Island (despite all of the fish that are consumed there).

Now, a moment later, we get a measurement of the ball's position.

How did the ball arrive at this position? Did it go up and slowly arc back down? Did it go down, bounce off the floor, and come back up? We don't know yet. Either path is equally likely.

We also know that the measurement isn't perfect; it has some noise. So, let's take each hypothesis, propagate it forward to the time of the measurement, and see how well it predicts the real measurement. Let's call x̂ i,k−1x^i,k−1 the state (the two position terms and the velocity terms) of the iith particle at the last sample time, and let's call x̂ i,kx^i,k the state at the current sample time. We use hats to denote estimates and kk for the sample. We know how bouncing balls work, so we can propagate from the previous time to the current time using a propagation function, ff.

For the simulation techniques used in these types of propagation functions, see “How Simulations Work”.

This updates the position and velocity for each of the nn particles by starting with the previous position and velocity and calculating the parabola the ball makes due to gravity, when/if it intersects the floor, and the trajectory from the floor back upwards, etc., until it reaches the current time. The propagation function is therefore a tiny simulation.

Given the updated position, we predict what we'd expect to measure for each particle, using a known observation function, hh.

where the hat denotes a predicted measurement, to help us distinguish it from the real measurement, zkzk. For our problem, the observation function simply returns the position of the ball.

Some particles predict the measurement quite well, and others do poorly.

We can take the difference between each ball's new position and the measurement (zk−ẑ i,kzk−z^i,k) and say, “What are the odds of seeing a measurement error this big, given what we know about how noisy the measurements can be?” The really bad predictions have low odds, whereas the good predictions will have high odds.

So, we have a probability for the measurement error for each hypothetical ball. We can multiply each particle's weight by this probability. The result is that particles that best predict the measurement end up with the highest weight. Let's show that by shading the trajectories according to their new weights:

(Note that, after doing this multiply, the sum of the weights won't be 1. We can renormalize by dividing all of the weights by the sum of all of the weights.)

Using these weights, we can calculate the weighted average of all of the hypothetical balls. This might provide the best estimate of the state (there are also many other ways we could use the probabilities to calculate a best estimate, with the weighted average being the most simplistic).

At this point, we have an updated state estimate and associated uncertainty, here represented as a finite collection of particles. We could now wait for another measurement to come in and run through this process once more. We'd propagate each ball to the time of the measurement, calculate the probability of the error between the measurement and the propagated ball, and update that ball's probability. So, let's go through one more cycle.

### Another Iteration, Adding in Resampling and Regularization

Propagate.

Some particles predict the new measurement well, and some no longer predict well. Update the probability of each.

Notice that some of our original hypothetical balls are useless now. There's just no way they represent the truth. Plus, they're all starting to spread out pretty far. Eventually, they'll all be too far away from the truth, and our filter will fall apart. The key that makes the particle filter work is that the set of particles is recalculated every once in a while (or even after each update), just like when we switched from a list of potential worldwide locations to a finer list of locations in Japan. There are a lot of ways to do this, but here's an easy one:

- Create a new set of particles by randomly choosing particles from the current set according to the particle weight. So, in general, particles with a weight of 0.04 will be drawn twice as often as particles with a weight of 0.02.
- Once the new set of particles is created, set all of the weights back to 1/N. The dispersion of the uncertainty is now represented in the presence of different particles, and not by the particle weights. Just to be clear: there will be many duplicated particles at this point.
- Calculate the sample covariance of the new set of particles, or some other measure of their dispersion. The sample covariance is particularly easy:
Pk=1n−1∑in(x̂ i,k−x¯)(x̂ i,k−x¯)T(3)(3)Pk=1n−1∑in(x^i,k−x¯)(x^i,k−x¯)Twhere x¯x¯ is the average of all particles and TT means “transpose”.
- Make a bunch of Gaussian random draws from the sample covariance matrix (see the appendix). Multiply these draws by some small tuning parameter, giving a set of perturbations. Add a perturbation to each particle.

This creates a set of new particles scattered around the high-probability areas of the state space. The first two steps are called resampling, and the final two are called regularizing the particles. Let's do that for the current set of particles:

### Outline of a Bootstrap Filter

We can now continue with the next measurement, and then the next, and the next, keeping with the process we've set up:

To see the (relatively) simple bootstrap filter used in this example, click here or download the sample files here.

- Propagate the particles.
- Determine how likely each particle is given the new measurement.
- Update their weights (and renormalize so that the sum of all of the weights comes to 1).
- Calculate the state estimate, e.g. with a weighted average.
- If resampling the particles this time, randomly select particles according to their weights, and then return the weights to 1/N.
- If regularizing the particles this time, perturb the new set of particles with small random draws.

Going back to the ball example, here's the result converging over time, with the uncertainty clearly clustering around the true state (blue):

After ten seconds:

*Probabilistic Robotics* by Thrun, Burgard, and Fox contains entry-level material for particle filters and sigma-point filters and discusses their use in robot location-finding (a multimodal problem where particle filters have proven useful, even in embedded applications).

### Assumptions, Advantages, and Disadvantages

What's required for all of this to work? Just a couple of simple things:

- You can approximately describe how the states change over time.
- You can approximately determine the likelihood of a hypothetical measurement error.

The advantages of particle filtering are pretty clear: you can coarsely estimate a state with very little theoretical work and for many different types of problems. The uncertainty (set of particles) can even be spread across multiple regions of the state space, representing a multimodal probability distribution. Further, the particle filtering process is easily customizable to different problems (e.g., when and how to resample, how to evaluate the measurement error, how to position particles in the first place). This makes particle filtering flexible and broadly useful.

The disadvantages are also pretty clear:

- Particle filters generally require a large number of particles, which can take substantial runtime. It's not uncommon for even the simplest particle filter to use 1000 particles, which requires 1000 simulations per measurement. The necessary number of particles becomes enormous as the dimension of the state grows. (Our problem would have benefitted substantially from 1000 particles, but it would have been harder to understand the plots with such a large number of particles.)
- Representing the uncertainty as a set of particles and weights — a discrete probability distribution — means that the best estimate of the state is usually pretty coarse, and so particle filters work poorly for problems that require high accuracy.
- When better performance is required, particle filters must often be tailored significantly to fit each individual state estimation problem, and this can take a long time, not least because testing requires one to run the filter, which can itself take a long time. For the same reason, it's hard to find a useful general-purpose particle filter, though the bootstrap filter is fine for simple problems.

**Bonus**: Particle filters are much like genetic algorithms, in that they can be put together quickly and often work well enough, given a long time to run. In fact, a particle filter essentially *is* a genetic algorithm where each particle is an individual in the population, fitness is the degree to which a particle predicts the observation, and the primary form of procreation is mutation.

Particle filters are only gaining popularity relatively recently, owing to the fact that processing power and distributed computing is now so cheap and easy to wield. Further, there are many new techniques for particle filters that can reduce runtime and increase accuracy. However, for applications that need speed, we generally need a different filtering architecture.

## The Sigma-Point Filter

### Uncertainty as a Covariance Matrix

In sigma-point filters (also called unscented filters), we don't represent uncertainty with a big bunch of scattered particles, but instead we assume that the uncertainty has a Gaussian (normal) distribution and is centered on the current best estimate:

We can therefore represent the uncertainty with a covariance matrix, like we calculated for the particles, above. We're visualizing the covariance as an ellipse around the state estimate, where the ellipse is drawn at the 3σ boundary (the true state is therefore inside this ellipse about 99.7% of the time). The 1000 particles are drawn only for the sake of comparison.

The eigenvectors of the covariance matrix are the principle axes of the ellipsoid, and the square roots of the eigenvalues are the semi-axis lengths of the 1σ ellipsoid along the corresponding principle axes. From this, we can construct the relevant ellipses for the sake of visualization.

### Propagation

When a new measurement comes in, we'll need to propagate this uncertainty forward to the time of the measurement. How do we do that? Sigma-point filters do this by first placing a few sigma points around the current estimate.

Sigma points are the same as particles, but we are particular about where we place them, and we don't need to keep track of individual weights. The sigma points will sit on the principle axes of the uncertainty ellipsoid around the current estimate (usually, a much smaller ellipse than the one we've drawn). Note that we're only looking at two dimensions of the problem; there are sigma points in the two velocity dimensions too, as well as in the noise dimensions, which we'll get to more below.

The estimate from the last sample, x̂ +0,k−1x^0,k−1+, and each sigma point, x̂ +i,k−1x^i,k−1+, are then propagated forward to the time of the new measurement.

(The plus sign denotes that the estimate was corrected with the measurements available at the time; we'll cover this more below.)

Just like for particle filters, the sigma points tend to spread out (uncertainty usually grows) during propagation. And, just like for particle filters, we can calculate a new state estimate as a weighted average of the sigma points.

Similar to before, the propagation function is:

but there's something new here: we've added qi,k−1qi,k−1, which represents process noise — unknowns that affect the propagation, like perhaps a random acceleration due to changes in the wind. What value would we use for it? Just like we have sigma points spread around in state space, so too we have sigma points spread around in process noise space. So, if there are 3 separate things that randomly affect the propagation, then qi,k−1qi,k−1 is a 3 dimensional vector. In sigma-point filters, it's assumed that the process noise is Gaussian, so the uncertainty is specified as a covariance matrix, QQ. Even when it's not Gaussian, this is usually a good approximation for any unimodal distribution. So, the sigma points in process-noise-space are just like sigma points in state-space, and taken together are just the big set of sigma points.

Once we've calculated the sigma points and propagated each, we calculate the mean predicted state as a weighted average:

where x̂ −0,kx^0,k− is the “middle” sigma point. All sigma points use the same weight, WW, except for the middle point, which usually uses a larger W0W0 because we trust it more. We now have the best prediction of the state at sample kk, and we marked it with a minus sign to denote “before correction” with the new measurement, or a priori.

We can then calculate the sample covariance around this new estimate, giving the covariance of the predicted state (again, the middle point has a larger weight, WcWc, than the others, but the idea is the same).

We now have an idea of what the state is when the measurement comes in, as well as the uncertainty associated with our prediction.

### Correcting the Estimate

Now of course, we need to use the measurement information to somehow correct the state estimate and covariance. Just like we have a predicted state for each sigma point, x̂ −i,kx^i,k−, and the overall predicted state, x̂ −kx^k−, we make an overall predicted measurement, ẑ kz^k by making a prediction for each sigma point:

(where ri,kri,k is the measurement noise with covariance RR, which is analogous to the process noise) and then calculating the weighted mean:

We'll use this predicted measurement to correct the predicted state like so:

where the plus sign denotes “after measurement correction” or a posteriori. For historical reasons, ykyk is called the innovation vector, measurement prediction error, or residual. It's a vector pointing from our predicted measurement to the actual measurement, and KK determines how a vector in measurement space maps to a correction in state space.

For those who don't know of Aesop's boy who cried, “Wolf”: A boy was watching sheep high atop a cliff overlooking his village. He became bored and yelled down to the village, “Wolf! Wolf!” The villagers came running up the arduous path with pitchforks in hand only to discover the boy laughing and no wolf in sight. A week later, the boy tried the same trick, and some villagers still came. A week later, the boy saw a very real pack of wolves approaching. “Wolf! Wolf!” he cried, but the villagers ignored him. “Wolf!” again, but in vain.

When the boy failed to bring the sheep home that night, the villagers went up the path to search. They were saddened to find their flock had been eaten and the boy had gone missing.

The moral of the story is that people update their internal probabilities by only small amounts when there is new information from a noisy source.

Let's think of some desirable properties for KK. If our knowledge of the state were really good already, and the measurements were so noisy that they weren't that useful, then KK should be small — we probably don't want much correction to the prediction. For example, if someone cries, “Wolf!”, but that guy is always crying wolf, and nobody's seen a wolf around here for ages, then you ignore him, or at least mostly ignore him.

On the other hand, if our knowledge of the state were pretty bad, and the measurements were known to be pretty good (wolves are often seen around here, and the guy who cried wolf is usually right about these types of things), then we'd want to trust the measurements a lot more. In that case, we'd want KK to be big (we might not have predicted that there's a wolf, but there certaintly seems to be one now).

Turning this into math, we represent the relationship between the uncertainty in the state (is there a wolf?) and the measurement error (someone cries wolf erroneously) with a covariance matrix, PxyPxy. We represent the uncertainty of the measurement error (someone cries wolf erroneously) with another covariance matrix, PyyPyy. If we're really sure that there is or isn't a wolf, PxyPxy will be quite small. If there are no pranksters around, PyyPyy will be small too, but no smaller than our fundamental uncertainty in whether there's a wolf around or not. If there are pranksters about, then PyyPyy will be large — lots of error.

We calculate these as weighted sample covariances, similar to what we did for the particle filter.

Switching back to the bouncing ball, our measurement is the ball's position, so we can draw both PxyPxy and PyyPyy for the position error in the same plot:

Using these covariance matrices, let's try a simple ratio: the uncertainty in our prediction (x̂ −kx^k− and ẑ kz^k) “over” the uncertainty in our predicted measurement error (yk=zk−ẑ kyk=zk−z^k), in a matrix sense:

It turns out that this “ratio” is a really good choice. For linear systems, it will produce the estimate with the least squared error in general. It's called the Kalman gain.

Note how, in our case, the ellipsoids for PxyPxy and PyyPyy are pretty close together. This means we have a lot of uncertainty about where the ball will go, and we have a little bit more uncertainty about what we'll measure. Since the Kalman gain is like a ratio, and our “numerator” and “denominator” terms are pretty close together, our gain will be close to the identity matrix for the position part of the state. A corresponding factor will update the velocity part of the state. The actual Kalman gain for this example is:

where the two position terms are the top part of the state, and the two velocity terms are on bottom. This says that the update will be most of the difference (0.8) from the predicted position to the measured position, and correspondingly, velocity will be increased in the same direction (1.2).

We now have the final state estimate for the current measurement:

### Correcting the Covariance

We're not done quite yet. The correction also affects the *uncertainty* in our new estimate, and we haven't updated the sigma points, so we haven't represented how the correction reduces our uncertainty. We can account for this by correcting the sigma points too, in the same way we corrected the state.

Then, we can calculate the new covariance matrix as a sample covariance of the corrected sigma points, P+kPk+.

However, there's no need to literally correct every sigma point, because the definition for the Kalman gain results in a faster way:

where RkRk is the covariance of the measurement error, such as we might have from the specification sheet for our position sensors. We don't work out the proof here, but we can interpret the result pretty well: the Kalman gain determines how much of the measurement noise we can subtract from the propagated covariance matrix. Recall that, if the measurement noise were large, then KK would be small (measurement noise is in the “denominator”), and thus very little of the measurement noise would be subtracted off. If the measurement noise were small, then KK would be larger, and we'd subtract a larger part of the (smaller) measurement noise. The result is that the right amount of RR is removed over time.

That means large Kalman gains subtract more, leaving a small covariance matrix, which reflects more certainty in the estimate.

Here are the results. Notice how the 3σ boundary is smaller than for our starting estimate.

That completes one cycle of the sigma-point filter! When a new measurement comes in, we can do it all over again. Going forward 10 seconds, here are the results:

Look at how the uncertainty in the up-down element gets really small at the top of the bounce, because we know the ball will slow down there, whereas the uncertainty in the left-right element is pretty well constant by the end, since the rightward velocity doesn't change.

To see the sigma-point filter used in this example, click here or download the sample files here.

By the end, the estimate is well-aligned with the truth, despite the really terrible measurements. In fact, the measurements for this example were actually allowed to be *worse* than those used for the particle filter, and yet the accuracy is clearly far *better*, owing to the fact that particles create a coarse representation of the distribution, whereas sigma-point filters are nice and smooth.

### Outline of a Sigma-Point Filter

*Kalman Filtering and Neural Networks* provides great information about the unscented Kalman filter (sigma-point filter) and is frequently cited in the literature. Though the relevant section is short, it includes numerous practical forms, with accessible discussion and very good pseudocode.

- Set up the initial sigma points from the last state estimate and covariance matrix. We'll need a positive and negative sigma point for each dimension of the state, for each dimension of the process noise, and for each dimension of the measurement noise, plus 1 more for the middle sigma point.
- Propagate each of the sigma points forward to the time of the new measurement.
- Create the predicted measurement for each sigma point, including sensor noise.
- Use a weighted sum of the sigma points as the mean predicted state and mean predicted measurement.
- Calculate the sample covariance between the propagated sigma points about the mean state and the predicted measurements about the mean predicted measurement, PxyPxy.
- Calculate the sample covariance of the measurement errors about the mean measurement, PyyPyy.
- Calculate the Kalman gain, KK.
- Correct the state estimate and covariance matrix.

### Assumptions, Advantages, and Disadvantages

When the various sources of uncertainty (previous uncertainty, process noise, and measurement noise) are unimodal and uncorrelated, sigma-point filters are a powerful option. Some advantages:

- Within their assumptions, they are generally more accurate than particle filters, since they don't rely on random particles.
- They're far, far faster than particle filters. Where a particle filter may need 1000 points, a sigma-point filter may need only 9 or so.
- Their assumptions apply to many different realistic problems, and seting up a sigma-point filter requires only defining the propagation function, measurement function, process noise covariance, and measurement noise covariance, all of which are necessary for particle filters too.
- There are standard forms for sigma-point filters, so it's relatively easy to find a good reference in a book or journal.

We could list a few disadvantages though.

- Odd problems can result in a sigma-point filter “falling apart”. For instance, in our ball problem, if the time step were larger, the sigma points would become very “confused” across one or two bounces, and may result in sample covariance matrices that make no sense. It can be difficult to avoid this, whereas particle filters wouldn't have this problem.
- While they're much faster than particle filters, they are also much slower than extended Kalman filters, which we'll get to in a moment.

## Interlude: Expectation and Covariance

We're going to be very informal with probability theory throughout this article in order to focus on the high-level ideas. Here we'll go over just a couple of ideas that are both useful in practice and in understanding the filters in the first place.

Before we move on to the next filter, let's take a moment to review some probability theory that will help out. First is the expectation operator. If aa is some random number, its expected value, E(a)E(a), is its mean, which we might call a¯a¯ for brevity. So, if aa represents rolls of a 6-sided die, then the expected value is 3.5, despite that we can't actually roll a 3.5.

The second important idea is covariance. We looked at *sample* covariance above, where we used a bunch of points to calculate a covariance, but now we'll talk about the *idea* of covariance. The covariance of aa, let's say PaaPaa, is:

When aa is a scalar, there's no “co-” in there; it's just called variance:

If we call a−a¯a−a¯ the error or residual, then the above tells us that the variance is simply the mean squared error of aa about its mean. Further, the standard deviation is the square root of the variance:

and so the standard deviation is also the square root of the mean squared error, sometimes called root-mean-square (RMS) error.

If bb represents rolls from a totally different die, and we're rolling aa and bb together, then what's the covariance between aa and bb, Pab=E((a−a¯)(b−b¯)T)Pab=E((a−a¯)(b−b¯)T)? Sometimes a−3.5a−3.5 would be positive and sometimes negative, and likewise for b−3.5b−3.5. Their product would be sometimes positive and sometimes negative, and averaging a bunch of these would come to 0, because the rolls are uncorrelated. That is, if aa and bb are uncorrelated, their covariance is 0.

Nothing much is different when we upgrade from a random number to a random vector. Let's consider a two-element vector c=[ab]Tc=[ab]T, where aa and bb are the same dice rolls as above.

When the elements of cc are correlated, some of those off-diagonal terms will be non-zero. For instance, let's reassign