Spotting potential targets of nefarious e-mail attacks

identify itself or said it is one thing but behaves like another and trolls Web sites in which the average visitor shows little interest.

Going to an Internet site creates a log of the search. Sandia traffic is about evenly divided between Web crawlers and browsers. Crawlers tend to go all over; browsers concentrate on one place, such as jobs. Crawlers, also known as bots, are automated and follow links like Google or Bing do.

“When we get crawled by a Google bot, we aren’t being crawled by one visitor, we’re being crawled by several hundreds or thousands of different IP addresses,” Wendt said. An IP or Internet Protocol address is a numerical label assigned to devices on a computer network, identifying the machine and its location.

Distinguishing bots and browsers
Since Wendt wants to distinguish bots from browsers without having to trust they are who they say they are, he looked for ways to measure behavior.

The first measurement deals with the fact bots try to index a Web site. When you type in search words, the crawler looks for pages associated with those words, disregarding how they’re arranged on a page. That means a bot pulls down HTML files far more often.

Wendt first looked at HTML downloads. Bots should have a high percentage. Browsers pull down smaller percentages.

More than 90 percent of the nulls pulled down nothing but HTML — typical bot behavior.

A single measurement wasn’t enough, so Wendt devised a second based on another marker of bot behavior: politeness.

Bots could suck down Web pages from a server so fast it would shut down the server to anyone else, he said.

That might prompt the site administrator to block them.

So bots take turns. “They say, ‘Hey, give me a page,’ then they may crawl a thousand other sites taking one page from each,” Wendt said. “Or they might just sit there spinning their wheels for a second, waiting, and then they’ll say, ‘Hey, give me another page.’”

Some behavior is “bursty”
Browsers go after only one page but want all images, code, and layout files for it instantly. “I call that a burst,” he said. “A browser is bursty; a crawler is not bursty.” Bursts equal a certain number of visits within a certain number of seconds.

Ninety percent of declared bots had no bursts and none had a high burst ratio. Sixty percent of nulls also had no bursts, lending credence to Wendt’s identification of them as bots.

Forty percent, however, showed some bursty behavior, making them hard to separate from browsers. However, normal browser behavior also falls within set parameters. When Wendt combined both metrics, most nulls fell outside those parameters.

That left browsers who behaved like bots. “Now, are all these people lying to me? No. There could be reasons somebody would fall into this category and still be a browser,” he said. “But it distinctly increases suspicions.”

He also looked at IP addresses. Unlike physical addresses, IP addresses can change. Say you plug your laptop into the Internet at a coffee shop, which assigns you an IP address. After you leave, someone else shows up and gets the same IP address. So an IP address alone doesn’t necessarily distinguish users.

There is another identifier: a particular browser on a particular operating system, which leads to what is called a user agent string. There are thousands of distinct strings.

IP addresses and user agent strings can collide, but Wendt said odds are dramatically lower that two people will collide on the same IP address and user agent string within a short period such as a day. That tells him they are probably different people.

Now he needs to bridge the gap between splitting groups and identifying targets of ill-intentioned emails. He has submitted proposals to further his research after the current funding ends this spring.

“The problem is significant,” he said. “Humans are one of the best avenues for entering a secure network.”