AIAddressing the Thin Data Problem in National Security

Published 5 May 2021

When a data set is too small to be used to make a decision, the solution is usually obvious: Get more data! That’s the cry of analysts everywhere, whether the need is to confirm the safety of a vaccine or to pinpoint an annoying knock in a car’s engine. But, says one expert, “In national security, oftentimes there is not better data. There is not more data. We need new techniques to understand the data we do have, to extract more meaning from the information already in hand.”

When a data set is too small to be used to make a decision, the solution is usually obvious: Get more data! That’s the cry of analysts everywhere, whether the need is to confirm the safety of a vaccine or to pinpoint an annoying knock in a car’s engine.

But one team of artificial intelligence (AI) researchers is moving in another direction. Instead of just fighting for more data, the team at Pacific Northwest National Laboratory (PNNL) accepts the shortcoming and develops new ways around the problem. The approach is paying off, leading to faster, more accurate conclusions.

The alternative approach is a necessity in national security, said Angie Sheffield, a senior program manager with the U.S. Department of Energy’s National Nuclear Security Administration (NNSA).

“In national security, oftentimes there is not better data. There is not more data. We need new techniques to understand the data we do have, to extract more meaning from the information already in hand,” said Sheffield, who manages the data science portfolio in NNSA’s Office of Defense Nuclear Nonproliferation Research and Development, also known as DNN R&D.

Such techniques are being developed by scientists like Tom Grimes, a PNNLscientist working on the project along with colleagues Luke Erickson and Kate Gibb, with support from Sheffield’s office.

“If you have the opportunity to get more data, adding more to the analysis is almost always a good step, of course,” said Grimes. “For instance, doubling the size of a data set from 100,000 to 200,000 images generally makes a large impact on how well your network is able to separate signal from background. But when data are in short supply, sometimes you need to try another approach.”

The national security arena is one where data are not always abundant. That’s especially true for DNN R&D, which drives the R&D of new capabilities to improve the nation’s ability to detect and monitor nuclear material production and movement, weapons development, and nuclear detonations across the globe.

For example, some isotopes are adrift only when an extremely rare nuclear explosion occurs. Other signals might occur only when exceedingly uncommon materials processing steps take place. There isn’t much signal of interest—and certainly, no one wants more of a signal that signifies danger—and there is an abundance of background signals that are decidedly uninteresting. These factors can throw off results.