New surveillance software knows -- and comments on -- what a camera sees

Published 1 June 2010

Software developed which offers a running commentary on CCTV’s images to ease video searching and analysis; the system might help address the fact that there are more and more surveillance cameras — on the streets and in military equipment, for instance — while the number of people working with them remains about the same

Good news for those in charge of surveillance cameras, and who have to sift through a whole lot of irrelevant information in order to find the nuggets which are relevant. A prototype computer vision system can generate a live text description of what is happening in a feed from a surveillance camera. Tom Simonite writes in Technology Review that that the system is not yet ready for commercial use, but it demonstrates how software could make it easier to skim or search through video or image collections. It was developed by researchers at the University of California, Los Angeles, in collaboration with ObjectVideo of Reston, Virginia.

You can see from the existence of YouTube and all the other growing sources of video around us that being able to search video is a major problem,” says Song-Chun Zhu, lead researcher and professor of statistics and computer science at UCLA.

Almost all search for images or video is still done using the surrounding text,” he says. Zhu and UCLA colleagues Benjamin Yao and Haifeng Gong developed a new system, called I2T (Image to Text), which is intended to change that.

It puts a series of computer vision algorithms into a system that takes images or video frames as input, and spits out summaries of what they depict. “That can be searched using simple text search, so it’s very human-friendly,” says Zhu.

The team applied the software to surveillance footage in collaboration with Mun Wai Lee of ObjectVideo to demonstrate the strength of I2T. Systems like it might help address the fact that there are more and more surveillance cameras — on the streets and in military equipment, for instance — while the number of people working with them remains about the same, says Zhu.

The first part of I2T is an image parser that decomposes an image — meaning it removes the background, and objects like vehicles, trees, and people.

Simonite notes that some objects can be broken down further; for example, the limbs of a person or wheels of a car can be separated from the object they belong to.

Next, the meaning of that collection of shapes is determined. “This knowledge representation step is the most important part of the system,” says Zhu, explaining that this knowledge comes from human smarts. In 2005, Zhu established the nonprofit Lotus Hill Institute in Ezhou, China, and, with some support from the Chinese government, recruited about