Google’s search results barely scratch the surface of the internet, even in those final pages of links where few of us bother to look. Search engines like Google, Yahoo!, and Bing catalog only five percent of what exists online, and this small portion—the surface web—is where almost everyone stays.
The rest of the internet is known as the deep web. Search services don’t index the deep web because temporary websites pop in and out, and the content just doesn’t follow the same rules.
You need to download special browsers to reach an even deeper pocket of the deep web called the dark web, a place governed by no rules at all. Websites in the dark web—Silk Road is the best-known example—offer all kinds of illicit material: pirated media, drugs, weapons, pornography, prostitution, and more. It’s a safe spot for criminals to lurk, but recent research has made hiding harder.
To access the deep web or the dark web, all you need is the Tor (short for "The Onion Router") browser, downloadable for free. The Tor network allows anonymous access to the web as shown above. Computer users browse and communicate on the internet by sending and receiving pieces of information called data packets. A data packet comprises both data (whatever’s being sent or received) and a header, which routes the data to its destination. Tor networks are crucial for anonymity because even encrypted data isn’t fully secret—the attached headers carry information such as origin, destination, data size, and timestamp.
Working with DARPA (the Department of Defense’s Advanced Research Projects Agency), Computer Science and Engineering Professor Mike Cafarella, who also teaches in LSA's Computer Science Program, is bringing light to the dark web with a project called Memex. Memex uses search functions that sidestep the limitations of the text-based search engines that most of us use, making the dark web scrutable. Since 2014, Memex has focused on human trafficking not only because it’s a particularly grim industry, but also because money from the sex trade often funds other illegal activities, such as drugs and weapons.
“We’re trying to build data tools for addressing crime—in particular human trafficking—where the internet might provide a lot of data,” says Cafarella. He’s analyzed something like 80 million sex ads so far, using automated methods like machine learning and image recognition to uncover who’s behind those shady business deals.
Web Crawling for Clues
The hard part of Cafarella’s work is information extraction. Ads for sex workers in the dark web contain price, location, and service details. Earlier technology, Cafarella says, suffered from mistakes and imprecision. For example, it might interpret a 48103 zip code as the price $48,103. Earlier methods missed important data, too. “Just like if a reader were being really fast and sloppy, they might skip over a lot of things,” he says.
Cafarella’s work with collaborators has improved dark net search tools. “So far, we’ve made the whole process of building a high-quality data set faster and easier.”
Their program can do things like search for similar elements in separate images, such as furniture from the same hotel room that appears in multiple photos. If different sex workers pose for ads in front of the same hotel furniture, they’re more likely part of the same trafficking operation.
His program also scans ads for phone numbers—even those numbers that human traffickers write by hand or otherwise obscure to dodge automated searches. “They’ll put in an image instead of text, like you would your email address if you want to avoid spammers,” Cafarella says. “Or they’ll spell it in a crazy way, or use a mixture of words and numbers to spell it out.”
A slew of phone numbers with consecutive final digits, listed in related sex ads, may indicate that someone bought a suspiciously large number of cell phones for the purpose. Another clue is the same phone number showing up in ads that switch locations frequently: Human traffickers disorient their victims by moving them from place to place, minimizing access to social ties and local resources, and making it tough for captives to escape.
Dark Web Discount
Computer code and web crawlers can’t, of course, perfectly translate the human reality captured in sex ad data. Here’s where social science comes in.
“The goal is to understand economic models of how people do the pricing,” says Cafarella. “If we can understand the market and how and why people price, then we can look at a sex worker who’s much cheaper in her advertisement than you might expect, which suggests that the worker is not pricing her own services.” He says law enforcement can flag and prioritize such outliers who don’t follow expected economic incentives. Those sex workers may create a trail toward a human trafficker who’s actually making the rules and setting the price.
The whole idea is based on economic rationality. Take, for example, out-calls, when a sex worker travels to a client, and in-calls, when a sex worker stays in place and the client travels. “Sex workers charge more for out-calls across every city in the data set,” Cafarella explains. “Why would that be? One hypothesis is that traveling takes time and money—you have to get in your car and go somewhere. So part of the higher price might reflect commuting time. And that’s true: Commuting time accounts for between half and two-thirds of the price premium.” When sex workers charge rates below the market value—on out-calls or otherwise—Cafarella raises a red flag.
Although Memex research currently revolves around the sex trade, Cafarella’s innovations can help solve other problems in science, industry, and government.
His work can help identify systems of terrorist recruitment; bust money-laundering operations; build fossil databases from a century’s worth of paleontology publications; identify the genetic basis of diseases by drawing from thousands of biomedical studies; and generally find hidden connections among people, places, and things.
“I would never have thought a few years ago that database and data-mining research could have such an impact, and it’s really exciting,” says Cafarella. “Our data has been shipped to law enforcement, and we hear that it’s been used to make real arrests. That feels great.”