From GPU to RNA: Oxford Nanopore’s Mission to Master the Genome

Authored by

Derrick Harris, Technology Storyteller

While much of the AI and machine learning world spent the past few years embracing “the bitter lesson,” Oxford Nanopore was focused on optimization. After all, the company’s custom hardware for genome sequencing can only fit so many GPUs. And although there’s technically no shortage of DNA data hanging around, its homemade AI models require very high-quality training data if they’re going to produce high-quality results across a wide range of lifeforms.

In this episode of the Shared Everything podcast, Oxford Nanopore VP of Machine Learning Mike Vella discusses the company’s relatively unique approach to doing AI in a fast-moving field. From hand-programming CUDA kernels to dealing with the gigabyte per second of data coming off its sequencers, Vella explains how his team balances the promise of fast, inexpensive sequencing with the realities of edge computing.

Overcoming hardware constraints through optimization

We do a lot of research into machine learning models and deep learning models that are amenable to optimizing. So machine learning architectures, factorizing of layers, and that kind of thing so we can make them fast.

And another thing we do, which is relatively unusual, is once we settle on a model, once we’re happy with an architecture, we will typically implement it by hand in CUDA. We will actually write our own implementations for the specific GPUs we’re targeting. Most people don’t care about a 10% or 20% speedup, because if you want 20% more throughput, you just buy 20% more Problem solved. We can’t really do that.

We operate within a power budget and even space constraints. So for us, 20% is meaningful. We do a lot of optimization by hand.

Achieving data diversity in a world awash with DNA

We’re always looking for new sources of data. Especially if you take things like bacteria, where there’s enormous diversity, the hardest part is not sourcing DNA data. That’s easy because DNA is all around you. You could go to your local pond, take a glass of pond water, and there’s enough DNA in there to train GPT-size models.

The problem isn’t the data quantity. The problem is actually high-quality labels for that data. I might sequence some E.coli. That’s great. How do I get a high-quality reference sequence? How do I know exactly what that is?

Understanding proteins is understanding life

AlphaFold was such a big breakthrough because you could get the protein sequence and you could say, “Given the sequence, here’s what it looks like, and I think it’s going to bind to this thing or it’s going to behave in this particular way.” Protein sequencing is determining what that sequence is in the first place.

So I might take a sample of blood or something, and I want to know what are the actual proteins inside it and if I can get a machine to actually work out what those proteins are. There are ways of doing that right now, but they’re not very good and they’re kind of indirect. We’re working on repurposing our technology to use it for actually working out what those sequences are.

The benefits to understanding what proteins are present in a sample are potentially huge. A particular disease that a person is expressing, which you’re trying to understand, will often be explainable through the proteins that are present in their blood, for example. Understanding proteins is understanding life, to some extent.

Working with VAST

Some time ago, we found that a lot of our machine learning training was being impacted by data-access issues. Our GPUs were being starved. Our GPU model training was being impacted because GPUs were just waiting on data to load from the file system. So we said, “OK, we need faster storage.”

We evaluated a bunch of different providers, and VAST really stood out to us for the maturity of the platform. And we were really thrilled and very impressed by the level of support we received.

Now, as we’re in the life sciences and healthcare space, there’s a lot of things that we’re starting to get concerned with, like auditability and traceability. Regulations are starting to get a lot more strict around how you train machine learning models that can be used in health care. And, of course, VAST has a lot of offerings around auditability, access control, and that kind of thing. We think it is going to make our life a lot easier in that respect.

From GPU to RNA: Oxford Nanopore’s Mission to Master the Genome

Overcoming hardware constraints through optimization

Achieving data diversity in a world awash with DNA

Understanding proteins is understanding life

Working with VAST

More from this topic