New surveillance software knows -- and comments on -- what a camera sees

twenty graduates of local art colleges to work full-time to annotate a library of images to aid computer vision systems. The result is a database of more than two million images containing objects that have been identified and classified into more than 500 categories.

To ensure that workers annotate images in a standard way, software guides them as they work. It uses versions of the algorithms that will eventually benefit from the final data to pick out the key objects for a person to classify, and it suggests how they might be classified based on previous data. The objects inside images are classified into a hierarchy of categories based on Princeton’s WordNet database, which organizes English words into groups according to their meanings. “Once you have the image parsed using that system that also includes the meaning, transcription into the natural language is not too hard,” says Zhu, who makes some of the data available for free to other researchers. “It is high-quality data and we hope that more people are going to use this,” he says.

The video-processing system also uses algorithms that can describe the movement of objects in successive frames. It generates sentences like “boat1 follows boat2 between 35:56 and 37:23” or “boat3 approaches maritime marker at 40:01.” “Sometimes it can do a match on an object that has left and reentered a scene,” says Zhu, “and say, for example, this is probably a certain car again.” It is also possible to define virtual “trip wires” to help it describe certain events, like a car running a stop sign (see video).

Although the system demonstrates a step toward what Zhu calls a “grand vision in computer science,” I2T is not yet ready for commercialization. Simonite writes that processing surveillance footage is relatively easy for the software because the camera — and hence the background in a scene — is static; I2T is far from capable of recognizing the variety of objects or situations a human could. If set loose on random images or videos found online, for example, I2T would struggle to perform so well.

Improving the system’s knowledge of how to identify objects and scenes by adding to the number of images in the Lotus Hill Institute training set should help, says Zhu.

The I2T system underlying the surveillance prototype is powerful, says Zu Kim, a researcher at the University of California, Berkeley, who researches the use of computer vision to aid traffic surveillance and vehicle tracking. “It’s a really nice piece of work,” he says, even if it can’t come close to matching human performance.

Kim explains that better image parsing is relevant to artificial intelligence work of all kinds. “There are very many possibilities for a good image parser — for example, allowing a blind person to understand an image on the Web.”

Kim can see other uses for generating text from video, pointing out that it could be fed into a speech synthesizer. “It could be helpful if someone was driving and needed to know what a surveillance camera was seeing.” But humans are visual creatures, he adds, and in many situations could be expected to prefer to decide what’s happening in an image or a video for themselves.