Study: AI Could Lead to Inconsistent Outcomes in Home Surveillance
Wilson and Jain are joined on the paper by co-senior author Dana Calacci PhD ’23, an assistant professor at the Penn State University College of Information Science and Technology. The research will be presented at the AAAI Conference on AI, Ethics, and Society.
“A real, Imminent, Practical Threat”
The study grew out of a dataset containing thousands of Amazon Ring home surveillance videos, which Calacci built in 2020, while she was a graduate student in the MIT Media Lab. Ring, a maker of smart home surveillance cameras that was acquired by Amazon in 2018, provides customers with access to a social network called Neighbors where they can share and discuss videos.
Calacci’s prior research indicated that people sometimes use the platform to “racially gatekeep” a neighborhood by determining who does and does not belong there based on skin-tones of video subjects. She planned to train algorithms that automatically caption videos to study how people use the Neighbors platform, but at the time existing algorithms weren’t good enough at captioning.
The project pivoted with the explosion of LLMs.
“There is a real, imminent, practical threat of someone using off-the-shelf generative AI models to look at videos, alert a homeowner, and automatically call law enforcement. We wanted to understand how risky that was,” Calacci says.
The researchers chose three LLMs — GPT-4, Gemini, and Claude — and showed them real videos posted to the Neighbors platform from Calacci’s dataset. They asked the models two questions: “Is a crime happening in the video?” and “Would the model recommend calling the police?”
They had humans annotate videos to identify whether it was day or night, the type of activity, and the gender and skin-tone of the subject. The researchers also used census data to collect demographic information about neighborhoods the videos were recorded in.
Inconsistent Decisions
They found that all three models nearly always said no crime occurs in the videos, or gave an ambiguous response, even though 39 percent did show a crime.
“Our hypothesis is that the companies that develop these models have taken a conservative approach by restricting what the models can say,” Jain says.
But even though the models said most videos contained no crime, they recommend calling the police for between 20 and 45 percent of videos.
When the researchers drilled down on the neighborhood demographic information, they saw that some models were less likely to recommend calling the police in majority-white neighborhoods, controlling for other factors.
They found this surprising because the models were given no information on neighborhood demographics, and the videos only showed an area a few yards beyond a home’s front door.
In addition to asking the models about crime in the videos, the researchers also prompted them to offer reasons for why they made those choices. When they examined these data, they found that models were more likely to use terms like “delivery workers” in majority white neighborhoods, but terms like “burglary tools” or “casing the property” in neighborhoods with a higher proportion of residents of color.
“Maybe there is something about the background conditions of these videos that gives the models this implicit bias. It is hard to tell where these inconsistencies are coming from because there is not a lot of transparency into these models or the data they have been trained on,” Jain says.
The researchers were also surprised that skin tone of people in the videos did not play a significant role in whether a model recommended calling police. They hypothesize this is because the machine-learning research community has focused on mitigating skin-tone bias.
“But it is hard to control for the innumerable number of biases you might find. It is almost like a game of whack-a-mole. You can mitigate one and another bias pops up somewhere else,” Jain says.
Many mitigation techniques require knowing the bias at the outset. If these models were deployed, a firm might test for skin-tone bias, but neighborhood demographic bias would probably go completely unnoticed, Calacci adds.
“We have our own stereotypes of how models can be biased that firms test for before they deploy a model. Our results show that is not enough,” she says.
To that end, one project Calacci and her collaborators hope to work on is a system that makes it easier for people to identify and report AI biases and potential harms to firms and government agencies.
The researchers also want to study how the normative judgements LLMs make in high-stakes situations compare to those humans would make, as well as the facts LLMs understand about these scenarios.
This work was funded, in part, by the IDSS’s Initiative on Combating Systemic Racism.
Adam Zewe is a writer at Massachusetts Institute of Technology. This story is reprinted with permission of MIT News.