Introducing Albatross: Data Science in C++

Over the past couple years at Swift Navigation, we've been developing an open source project for statistical modeling in C++ called albatross. We use albatross in several of the core models behind our Global Navigation Satellite System (GNSS) correction services; things such as real-time estimates of the ionosphere, troposphere and satellite orbits.

Fear not, the goal of this post is not to bore you with equations or software details (for that you can see the documentation). In fact, we won't get into anything related to GNSS, though we couldn't help but include some elements of positioning. The problem will involve a hypothetical treasure hunt. But first, a brief comment on why we started building albatross in the first place.

Why Albatross?

First let's start with a bigger question, why write anything from scratch when you can use external packages? Opinions may vary but some reasonable answers are:

  • The external packages don't quite do what you want.
  • You think you can do it better.
  • You don't want any dependencies.

In our situation it's mostly the first; we want to build statistical models in a way that accommodates the research phase of development (rapid model iteration, evaluation, comparison, and tuning) but also runs fast in a production environment. So while just about everything in albatross could also be done with python packages such as pandas, scikit-learn, emcee and george, using them directly just wasn't practical. Instead we started developing albatross, which draws heavily on paradigms we liked from those packages but with an emphasis on compile time safety and speed. In short, you could say albatross is:

"A package containing some of the stats modeling tools from python that are missing in C++"

Temperature Data

We decided the best way to introduce albatross was through example and because we're trying to show off our package, which was built with geospatial, (Ionosphere) modeling in mind, we decided to work with temperature data.

The data we'll use comes from the Global Summary of the Day (GSOD) dataset which contains measurements from hundreds of locations through the U.S. Here is an example of the GSOD temperature measurements for a single day:

We can then build an albatross model which interpolates the temperatures. Here is what our model thinks the temperature is for that same day:

We can even ask the model how confident it is:

Here, blue colors indicate the model is confident in that region, red indicates larger uncertainty. You may notice that, as expected, the model becomes less confident the farther it gets from a temperature station.

If you're curious, you can find more details on how we built this model in the albatross examples

As a sanity check, we can compare our model to actual temperature measurements. To do so without cheating, we use a technique called cross validation in which we hold a station out, fit a model without using that station and then compare the result to the truth. Here's the out of sample prediction for DuPage airport in Illinois:

The blue prediction distribution shows how likely the model thinks a particular temperature measurement would be at our held out location. For this example, the model thinks 70 degrees is most likely but 60 or 80 degrees would not be surprising. The red line indicates the actual temperature measurement recorded at that station. In this particular case you can see there is strong agreement but if we did this for all locations we should expect to see more variation.

Treasure Hunt

Now for the fun part. What if you were given a treasure map, but instead of the location being marked with an X, what if you were provided with a year of temperature measurements at the unknown location. Could you find the location of the treasure?

Here is the alternative "map", the temperature for every day of the year in 2019:

Our approach can be broken into a few parts. First we can use the temperature model from the previous section to fit a model for each day of the year. Next, we can guess the unknown location and compare the model's predictions for the full year to the temperature at the unknown location. By iteratively improving our guess, we can then hone in on the treasure.

In the previous section we convinced ourselves that we have a model that works for a single day's temperature, so we can now fit a different model for each day of the year and use this to predict an entire year of temperature.

At this point, we're able to propose a location and ask our model what the temperature would have been at our guess, compare with the "map" and come up with the likelihood that the location holds the treasure.

We can then propose a large number of locations and iteratively refine our guesses resulting in a distribution of possible locations of the treasure. That process (called sampling or inference) looks like this,cr

You can see that we initialized the sampler with guesses spread across the Midwest. As the process evolves, our guesses begin to cluster in one location. Eventually it reaches a steady state at which point we can produce a heat map showing where we think the treasure is located:

Not quite centimeter-level accuracy, but not bad!