Writing style identity tool easily fooled

Published 19 August 2009

It was thought that writing style is almost as unique to a person as a fingerprint or DNA, and literary historians and courts used the approach; a new study suggests that some of these so-called stylometry techniques are easily fooled, even by people without linguistic or literary training

Historians, literary detectives, and even courts of law rely on methods that identify the author of a text by their writing style. A new study suggests that some of these so-called stylometry techniques are easily fooled, even by people without linguistic or literary training.

Colin Barras writes that as well as being used to answer literary questions, such as who wrote Shakespeare’s plays, modern law courts accept evidence from stylometry on the authorship of written material, including suicide notes and threatening letters. Stylometry even helped to convict “Unabomber” Theodore Kaczynski in 1998.

The features that stylometry techniques rely on, however, can be easy to imitate, say Michael Brennan and Rachel Greenstadt at Drexel University in Philadelphia, Pennsylvania. They have shown that people can successfully confuse stylometry software and hide their identity by imitating the writing style of another person. Until now there had been little research into the weaknesses of these techniques, they say.

Writing attack

The researchers asked fifteen people who were not professional writers to submit a 5,000 word “signature” text as a sample of their personal writing style. Each volunteer was then asked to write a description of their neighborhood in a way that masked their personal style, before writing a further passage in the style of novelist and playwright Cormac McCarthy.

Various stylometry techniques were pitted against this deception in an attempt to correctly reveal the true authors of each “masked” passage. They ranged from simple techniques, such as measuring word length and analyzing punctuation, to more complex methods, such as working out the lexical density, a measure that divides the number of unique words in the document by the total word count.

Barras writes that the methods could identify the author of extracts from signature texts with at least 80 percent accuracy. They were no better, however, than random at knowing who wrote a passage when people attempted to hide their writing style. The techniques consistently identified Cormac McCarthy as the author of the imitations of his work. “We would strongly suggest that courts examine their methods of stylometry against the possibility of adversarial attacks,” says Greenstadt.

In the dock
“It’s a great paper,” says Patrick Juola, a computer scientist and text analyst at Duquesne University in Pittsburgh, Pennsylvania. “When you read a paper and say ‘well now I know what I’m studying for the next five years’, they did something right.”

As the study only attacked some of the less complex stylometry techniques, probing the vulnerabilities of others is a “huge line of research,” says Juola.

He gives the example of describing a table setting; is the fork placed “on,” “at,” or “to” the left of the plate? “Most people don’t necessarily notice which preposition gets used - and it’s harder to imitate what you haven’t noticed.”

Unhelpful filter
Some of the techniques tested by Brennan and Greenstadt discard prepositions because they are deemed to have no information content, says Michael Oakes, a computational linguist at the University of Sunderland, United Kingdom. This filters out the words that could have helped most, he says.

Brennan and Greenstadt agree that there are more stylometry techniques they could test in the future. “However, it’s worth noting that our attack methods are not as sophisticated as they could be,” says Greenstadt.

Their volunteer “attackers” lacked formal training in linguistics and had no access to stylometry software. With additional expertise, even the more sophisticated stylometry techniques might be vulnerable.

Brennan and Greenstadt presented a paper (PDF) on their experiments at the Hacking at Random 2009 conference in Vierhouten, the Netherlands, last week.