The AI boom is built on data, the data comes from the internet, and the internet came from us.

The Latest On
Basic Income Today

Voters in Hamburg, Germany decline to experiment with basic income by referendum

October 30, 2025

New basic income pilot will pay low-income New Yorkers $12,000 in cryptocurrency

October 29, 2025

South Korea is Planning a Massive Universal Basic Income Pilot

October 28, 2025

$1,500 a month basic income pilot coming to Mercer County, West Virginia

October 27, 2025

Ireland’s Public Consultation Confirms Overwhelming Support for making Basic Income for the Arts Permanent

October 23, 2025

Teachers support communities. Now is the time to support teachers.

October 21, 2025

Ireland publishes latest report on Basic Income for the Arts Pilot

October 10, 2025

Alameda’s Rise Up Pilot Shows Cash Aid Improves Financial Stability, Mental Health

October 8, 2025

April 26, 2023

By: Scott Rosenberg

See original post here.

The AI boom is built on data, the data comes from the internet, and the internet came from us.

Driving the news: A Washington Post analysis of one public data set widely used for training AIs shows how broadly today’s AI industry has sampled the 30-year treasury of web publishing to tutor their neural networks.

Why it matters: Ever written a blog? Built a web page? Participated in a Reddit thread? Chances are your words have contributed to the education of AI chatbots everywhere.

The big picture: While this massive verbal repurposing is triggering an important legal brawl over whether it should be treated as fair use or theft, it’s also inspiring a personal reckoning for many of the millions whose postings built today’s online world.

We thought we were sharing our hearts and minds, and of course we were.

But without realizing it we were also creating a database, incomplete but rich, of human expression.
That database makes the uncannily adept sentence-completion gymnastics of ChatGPT and its competitors possible.

Because visual AI tools like Dall-E, Midjourney and Stable Diffusion got popular before verbal chatbots like ChatGPT took off, visual creators —photographers, illustrators and fine artists — were the first to grapple with this realization.

Musicians face the same kind of epiphany, as they encounter multiplying AI-conjured facsimiles of their works — like last week’s (never-happened) collaboration between Drake and the Weeknd, “Heart on My Sleeve.”

But far more of us have typed a few words on the internet than have ever recorded songs or drawn pictures.

The Washington Post project lets you enter any internet domain name to see whether and how much it contributed to one AI training database. (This isn’t the same one OpenAI used for ChatGPT or its other projects; OpenAI has not disclosed its training-data sources.)
“The data set contained more than half a million personal blogs, representing 3.8 percent” of the total “tokens,” or discrete language chunks, in the data, the Post team found. (Postings on proprietary social media platforms like Facebook, Instagram and Twitter don’t show up — those companies have kept access to their data to themselves.)

Of note: These training databases are enormous but hardly representative. Some cultures, groups and subjects are oversampled; many others are unfairly neglected. And all the biases, limitations and toxic aspects of internet culture show up in the AI training data.

My thought bubble: The personal blog I wrote fairly consistently for 15 years is well represented in the Post data set — along, it seems, with most of the other writing I contributed for ten years to the web magazine I helped create.

If you have any kind of online history, the self-lookup opportunity the Post’s research provides is irresistible, like Googling your own name. (There’s a similar lookup tool called “Have I Been Trained?” for visuals.)
When you do find your work listed, you’re probably going to ask yourself, as I did, “Is this what I wanted?” and “Why wasn’t I consulted?” and “What if I’d known this was coming?”

Be smart: AI’s hunger for training data casts the entire 30-year history of the popular internet in a new light.

Today’s AI breakthroughs couldn’t happen without the availability of the digital stockpiles and landfills of info, ideas and feelings that the internet prompted people to produce.
But we produced all that stuff for one another, not for AI.

From this vantage, the existence of these vast “corpuses” of data was a profoundly important unintended consequence of the rise of the web itself.

In 1995, when a generation fell in love with the “www” and the browser, or ten years later, when another generation celebrated the advent of blogs and the “wisdom of the crowd,” this outcome was hidden from view.
By the early 2010s, the stirrings of the machine-learning revolution began to make some far-seeing experts uneasy. But it took a very long gaze to sense that the entire web might be about to turn into AI training fodder.

Today, this unintended consequence is front and center in our online experience — reminding us that everything we’re doing right now with, and to, AI will in turn shape the future in ways we can’t foresee.

For instance: If we unleash a flood of simulacra on our public networks, we risk discouraging people from continuing to share, or even make, their own original work.
That might leave future AI models stuck forever with the frozen output of humanity circa 2000-2020, with nothing newer to learn from.

CATEGORIES: WORKFORCE AUTOMATION
TAGS: AI, artificial intelligence, worker displacement

The Latest On
Basic Income Today

Voters in Hamburg, Germany decline to experiment with basic income by referendum

New basic income pilot will pay low-income New Yorkers $12,000 in cryptocurrency

South Korea is Planning a Massive Universal Basic Income Pilot

$1,500 a month basic income pilot coming to Mercer County, West Virginia

Ireland’s Public Consultation Confirms Overwhelming Support for making Basic Income for the Arts Permanent

Teachers support communities. Now is the time to support teachers.

Ireland publishes latest report on Basic Income for the Arts Pilot

Alameda’s Rise Up Pilot Shows Cash Aid Improves Financial Stability, Mental Health

The AI boom is built on data, the data comes from the internet, and the internet came from us.

You may also be interested in...

AI is already driving up unemployment among young tech workers, according to Goldman Sachs

Why Democrat Zoltan Istvan Is Backing Basic Income, Home Robots in California Governor Bid

As unemployment rises, the AI era of universal basic income has just got closer

Tech layoffs show AI’s impact extends beyond entry-level roles

As AI replaces workers, China could consider universal basic income

As if graduating weren’t daunting enough, now students like me face a jobs market devastated by AI

Salesforce CEO Says AI Is Doing Over 30% of Tasks at the Company: ‘Higher-Value Work’

Tech unemployment in the US climbs for fifth consecutive month to 5.5%, AI blamed for job losses

Universal Basic Income: A Business Case For The AI Era

SIGN UP FOR THE BASIC INCOME TODAY NEWSLETTER.