CYBERSECURITYIdentifying Types of Cyberattacks That Manipulate Behavior of AI Systems

Published 6 January 2024

AI systems can malfunction when exposed to untrustworthy data – what is called “adversarial machine learning” — and attackers are exploiting this issue. New guidance documents the types of these attacks, along with mitigation approaches. No foolproof method exists as yet for protecting AI from misdirection, and AI developers and users should be wary of any who claim otherwise.

Adversaries can deliberately confuse or even “poison” artificial intelligence (AI) systems to make them malfunction — and there’s no foolproof defense that their developers can employ. Computer scientists from the National Institute of Standards and Technology (NIST) and their collaborators identify these and other vulnerabilities of AI and machine learning (ML) in a new publication.

Their work, titled Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST.AI.100-2), is part of NIST’s broader effort to support the development of trustworthy AI, and it can help put NIST’s AI Risk Management Framework into practice. The publication, a collaboration among government, academia and industry, is intended to help AI developers and users get a handle on the types of attacks they might expect along with approaches to mitigate them — with the understanding that there is no silver bullet.

“We are providing an overview of attack techniques and methodologies that consider all types of AI systems,” said NIST computer scientist Apostol Vassilev, one of the publication’s authors. “We also describe current mitigation strategies reported in the literature, but these available defenses currently lack robust assurances that they fully mitigate the risks. We are encouraging the community to come up with better defenses.”   

AI systems have permeated modern society, working in capacities ranging from driving vehicles to helping doctors diagnose illnesses to interacting with customers as online chatbots. To learn to perform these tasks, they are trained on vast quantities of data: An autonomous vehicle might be shown images of highways and streets with road signs, for example, while a chatbot based on a large language model (LLM) might be exposed to records of online conversations. This data helps the AI predict how to respond in a given situation. 

One major issue is that the data itself may not be trustworthy. Its sources may be websites and interactions with the public. There are many opportunities for bad actors to corrupt this data — both during an AI system’s training period and afterward, while the AI continues to refine its behaviors by interacting with the physical world. This can cause the AI to perform in an undesirable manner. Chatbots, for example, might learn to respond with abusive or racist language when their guardrails get circumvented by carefully crafted malicious prompts.