The Washington Post had a barn-burner of a story on the data used to train AI chatbots and generative text apps, based on a paper out of the Allen Institute for Artificial Intelligence. The paper describes the C4 linguistic corpus, which has been used to train major components of the current generation of AI products.
The exact weighting of this corpus in the final product compared to other mitigating factors is not totally clear, but we know that AI applications have learned to be racist and sexist, to such an extent that their creators have had to equip them with “guardrails” to keep them at least somewhat in check.
How do these AIs learn to be racist and sexist? We already pretty much knew the answer. They learned it from watching us—on the Internet. The new Post story adds some important granularity to that question, however, and I am about to add some more. The C4 dataset is a giant bucket of words, scraped from the Internet and then filtered for toxic content. One of the Cs in C4 stands for “clean,” meaning it’s been vetted for racism and other objectionable material. In theory.
UPDATE: Another C is for “crawled.” To clarify a point I have seen some confusion about in replies, no one picked the 15 million websites in the dataset; it was generated by a blind web crawl. The sites discussed herein are likely the product of negligence, rather than malicious intent. END UPDATE
The Post story describes the C4 dataset in some detail and includes a tool for searching the URLs that were scraped to produce it.
Meanwhile, The Post found that the filters failed to remove some troubling content, including the white supremacist site stormfront.org No. 27,505, the anti-trans site kiwifarms.net No. 378,986, and 4chan.org No. 4,339,889, the anonymous message board known for organizing targeted harassment campaigns against individuals.
Sounds bad, right? Well, it’s worse than it sounds.
The rankings cited in the quote above are based on the number of tokens (words and/or short phrases) scraped from each source. To greatly oversimplify, a large-language model contains statistical data about how tokens are typically stacked. For instance, if you look up “best” it might know that “regards” often follows. Generative AIs owe a debt to autocorrect, but they’re much more sophisticated, able to string together thousands of words in a credible manner in response to a prompt. A generative AI includes contextual cues. So if you mention a movie, it might guess that “best” is more likely to be followed by “picture.” When an AI is trained on a website, it doesn’t just learn the words on that site, it learns the relationships among those words—for instance, which adjectives are applied to which racial groups.
The Washington Post, a generalist publication, noted a few well-known extremist and toxic sites that appear in the dataset. As a non-generalist, I immediately used the tool provided with the story to look up sites that are less well-known. Since the rankings reflect “tokens,” I theorized that sites with very large quantities of content would rank higher than even Stormfront or KiwiFarms, which are better known and more consistently used by extremists.
Unfortunately, I was right. The table below consists only of sites that ranked within the top 2 percent of C4 sources according to the number of tokens scraped. Most of these sites, those ranked higher than 150,000, are in the top 1 percent. Odds are you won’t recognize all these names unless you’re working in the field. I picked these sites for various reasons, but they don’t represent the full spectrum of potentially problematic content. This is just a small sample.
This is a deeply problematic list.
At the top of the list is Global Research, a website stuffed with a wide range of conspiracy theories, including anti-vax and pro-Russia content. Here’s what its front page looked like yesterday:
Global Research ranked 117th out of about 15,000,000 sites crawled to build the C4 dataset. It ranked higher than ABC News. Higher than NBC News, higher than the New Yorker, higher than Vanity Fair, higher than Popular Mechanics, higher than Variety, higher than the recently deceased BuzzFeed, much higher than the Wall Street Journal, and vastly higher than Bloomberg News. When I say it ranks higher, I mean it contributed many more tokens than any one of these more reputable sites.
But wait, there’s more.
Extremists and conspiracy theorists tend to be verbose, and some extremists are also obsessive archivers. Thus, Christogenea, a massive but obscure archive of books and lectures from the violent, virulently racist and anti-Semitic Christian Identity movement, ranks 7,727th, putting it near the top of the top 1 percent of sources. Compare that, for instance, to the Catholic website New Advent, at 14,394.1 Thanks solely to the size of the Christogenea web archive, the toxic Christian Identity is disproportionately represented in C4.
But wait. It gets worse.
Natural News is a website peddling fake medical findings wrapped in right-wing conspiracies. It ranked 634th in the C4 dataset, much higher than the Cleveland Clinic (4,473), the Mayo Clinic (3,359) and the government-run Medline website (7,571). Good thing no one is using this dataset for medical purposes!
The virulently anti-immigrant Center for Immigration Studies comes in at 4,070, far higher than the United States Citizenship and Immigration Services website. Infowars comes in at 6,562, neck and neck with Snopes at 6,540, and that’s not counting a ton of lower-ranked Infowars mirrors and related sites. The sovereign citizen Freedom School website ranks at 23,136 compared to authentic legal resource Justia at 23,532.
As you can see in the table above, the list goes on and on, and this is just my first run at the problem.
Unless you work in the field, you may not even have heard of half these sites. But they’re part of the Rube Goldberg machine that powers your Bing, your Bard and your Whatever-dot-AI. If you’re using AI to draft legal contracts for your company, you don’t want it spewing sovereign citizen nonsense, and if you’re using AI for medical diagnosis, you don’t want it telling you 5G causes cancer. But those fringe sources are weighted equally or higher than many important mainstream sources.
The cleaning of the C4 corpus appeared to be focused on a blacklist of words and constructions, according to the Post, but the inclusion of these high-volume sites nevertheless gives them disproportionate weight in the generative text process. If you delete all the n-words from the rancid neo-Nazi site VNNForum, the text that remains is still wildly toxic and infused with hate.
If such sites contribute the same number or exponentially more tokens to the corpus than comparable non-extremist sites, you have a problem.
At the risk of repeating myself, the current state of generative AI and its guardrails is a lot like putting manacles on Frankenstein’s monster. The manacles don’t really solve the central problem, which is that you’ve created a monster.
New Advent actually appears at least twice under different URLs. The other clearly relevant URL, ranked 13,754, currently redirects to the active site at newadvent.org. It might have hosted different content when it was scraped. ¯\_(ツ)_/¯
Question. What role do you think paywalls have in this? As in: are paywalls forcing reputable sites down the rank list because they aren’t contributing as many tokens?