What’s missing from the AI workflow: incentives for content creators to provide training data
My general argument
Here’s my general argument in this post:
AI bots require training data to be effective. Content creators generate part of the training data source.
Content creators require ROI (e.g., recognition, attention, revenue) for the content they supply.
AI bots will need to provide ROI to content creators to sustain a long-term content source in the future.
Since backlinks to creator sites aren’t possible, content creators might create logins and paywalls for content access. In the best scenario, AI will force writers to raise their content to inimitable levels.
AI bots require training data to be effective
This premise needs little explanation. What would AI bots be without all the internet sources they use for training data? Not much. The more data AI bots have access to, the better they become. As such, they consume data sources like hungry trolls: Stack Overflow, Reddit, Twitter, Wikipedia, news sites, etc. These sources provide the mass of needed data for training AI systems.
Content creators require some form of ROI for the training data they supply
Many people are floored by what appears to be a pure fountain of information—consuming information without ads, without slogging through hit-or-miss sources, without the glut of lightweight say-nothing articles, etc. Read this article by Vipul Shekhawat: GPT makes learning fun again, which was recently trending on Hckrnews.com. Shekhawat is floored about how much better it is to learn from GPT-4 than by using search engines and sorting through human-written pages.
Shekhawat explains how the AI bot adapts to his question level, providing him with an interactive learning experience tailored to his exact interests. He provides a compelling step-by-step description of his learning journey (about LED light). Like Shekhawat, yes, I agree that the AI chat experience can exceed the search engine experience. I wrote about this in my post, AI chat interfaces could become the primary user interface to read documentation.
Shekhawat concludes, “After getting used to learning with GPT-4, I can’t imagine going back to the old way. The difference feels almost as profound as going back to flipping through a textbook.”
Apart from the interactivity, he says not being confronted with ads contributed to the delight of the experience:
One of the biggest reasons learning has become cumbersome is that the web is littered with advertisements and content marketing that obstruct your ability to actually delve into a topic. When learning about topics that overlap with things people want to sell you, Google will sometimes include 3-4 ads before actual results. Even if you use an alternative search engine, the actual content on the web is often low-quality or written to sell you a product. Of course, GPT itself may start contributing to making it worse soon, since it’ll be used to generate more of it.
In other words, GPT-4 didn’t try to sell him anything or hijack his attention. It gave him exactly the information he asked for, and nothing more. He could completely control the information level, direction, and flow.
He says there’s a lot of substantial information on the web, but it’s trapped under a layer of billboards, ads, and other commercialization (“cruft”):
Over the past 20 years or so, humanity has poured itself onto the web. All the world’s information, yes, but also all the world’s anxieties, fears, and superficiality. The web has become commercialized: an information highway plastered with billboards.
But underneath it all, the substantive information is out there. It’s just scattered and hard to find. GPT lets people tap into the treasure trove the web represents, without all the cruft. It enables anyone who can formulate and ask questions to learn so much more about any subject, so much faster.
GPT-4 removes all the cruft and delivers the raw information that was sitting there underneath all along. By allowing users to connect with this source, exactly adapted to their interests, it “makes learning fun again.”
Problems with this model
The learning model Shekhawat presents is appealing, but there is a challenge to consider. It is important to understand how the vast amount of learning content on the internet came to be. Content creators did not simply decide to create and freely distribute content without any consideration for their own benefit.
Content creation is a give-and-take process, in which creators expect some benefit in return. This often comes in the form of attention or recognition, which can help establish your name or brand, or just make you more visible in the community. Regardless of any monetary benefit, the reward can simply consist of being read by others and receiving feedback (in other words, interactive knowledge sharing).
I’ve been blogging for 16 years. I started in 2006. Why haven’t I faded after so many years? My motivation to write does involve an ROI. Primarily, I write to satisfy my curiosity, to sort out my thoughts and evaluate my experiences. I write to spin my thinking wheels. I like to feel the “bite” when wrestling with a question, and then articulate my response through a post. But my motivation to write also includes the following:
Knowing people will read my writing.
Knowing that the content I publish will bring people to my site.
Knowing that what I write will create a certain reputation/brand/persona about me.
Knowing that what I write might influence future companies who want to hire me.
Knowing that the content I write will drive some pocket-change advertising revenue my way.
Knowing that what I write allows me to engage in a conversation with other like-minded people who intelligently respond to my ideas.
Knowing that these conversations I start online will help me move to the next level, whatever that might be.
One reason people stop creating content, beyond lack of time or resources, is because they feel their work goes unnoticed. It can be discouraging to receive no feedback or engagement from your audience. Imagine a book author who publishes a book that no one reads or buys. In both cases, blogger or book author, the lack of response can be demotivating and lead to a decrease in productivity. Why write, if no one reads it?
Now shift gears back to AI. Imagine a world where people stop reading individual sites and authors. Instead, they mostly turn to an AI chat that has been trained on all the content ever produced, so it’s the most intelligent bot in the world. What happens to the individual content creators, knowing that what they write —
won’t be read by anyone directly. Instead, it will be mixed, infused, synthesized, etc. in combination with millions of other sources into an intelligent collective output.
won’t bring people to their site.
won’t establish their reputation/brand/persona.
won’t influence future employers.
won’t drive revenue their way.
won’t lead to engaging, interactive conversations.
And so on. Where’s the motivation for content creators? The genius of social media is that in exchange for posting on the social platform, the platform rewards you by making you feel important, by allowing you to interact with others and receive validation or responses from others. Sometimes social media makes you a little famous for a while. That’s why social media took off—there is an ROI for creating the content. When someone likes or comments on your post, it validates you (most of the time), or at least intrigues you to know what others think. Even if it challenges you, that interaction is an experience, philosophically, with the Other that makes life adventurous and worthwhile.
Content created by mainstream users is what helped the internet upstage other sources. In Kevin Kelly’s The Inevitable, Kelly says when the internet was just starting off, big media didn’t take the internet seriously. They were skeptical about where all the engaging content would come from. Newspapers, magazines, books, movies—that’s the content people wanted. Surely the internet’s chat forums and random online sites couldn’t compete with mainstream content outside the internet, right?
They didn’t foresee that users themselves would create the tsunami of needed content to drive online engagement. Social media tapped into our psychological side (our desire to be seen, recognized, part of a community) and compelled us to contribute. The psychological pull motivated mainstream people like me to devote thousands upon thousands of hours writing content on blogs, with little financial reward. The only reward most bloggers receive is the reward of being read. Kelly writes:
We all missed the big story. Neither old ABC nor startup Yahoo! created the content for 5,000 web channels. Instead billions of users created the content for 5,000 web channels. There weren’t 5,000 channels but 500 million channels, all customer generated. The disruption ABC could not imagine was that this “internet stuff” enabled the formerly dismissed passive consumers to become active creators. (19).
Web 2.0 saw the proliferation of amateur blogs, vlogs, and other content creators filling every niche with an abundant stream of content, eventually displacing traditional media. That’s because the average Joe and Sally suddenly realized that by blogging about a specific topic, they could brand themselves as experts, they had readers and comments and engaging conversations, they were suddenly asked to speak at conferences, they were making some money off of ads, they were suddenly visible in the world. Even if you didn’t become a micro-influencer, writing was just dang fun and rewarding. So they produced endless amounts of content.
With AI, where’s the reward for content creation? What will motivate individual content creators if they no longer are read, but rather feed their content into a massive AI machine?
By hiding sources and not linking back to sites used for training data, AI bots deny ROI for content creators. As such, the content well will dry up. Content creators will not participate in a system that provides no ROI for them.
Citing sources isn’t feasible
I would be happy to see sites listed next to AI chat responses, similar to the responses on Phind.com. However, providing sources for outputs from LLMs like ChatGPT or Bard might not be technically possible. I’m no expert, but apparently that’s not how LLMs work. They don’t synthesize content from a list of relevant sources. After the training across billions of parameters, the AI might no longer know the source, or forget it along the way as it learns patterns, contexts, dependencies, relationships, and so on across thousands of iterations.
Jiang Chen explains why source citation is problematic:
Quoting from different sources can’t solve the problem either, since the model often takes sentences out of context and reassembles them into a paragraph to create an incorrect answer. For example, you could ask the model, “How much does AWS charge for a g4dn.16xlarge GPU instance?” To which it might respond, “The g4dn.16xlarge GPU instance from AWS costs $0.526 per hour.” In this instance, the pricing of g4dn.xlarge is grafted to gfdn.16xlarge. (Large Language Models Aren‘t the Silver Bullet for Conversational AI).
Imagine there are not just a handful of sources but thousands of sources used for training data on a topic, and the AI derives the answer from myriad statistical calculations about predicted words and relationships, inferences and contexts, such that the resulting output has no recognizable tie back to any specific source. Each source contributes or influences only an infinitesimal smidge towards the AI model’s outcome. In other words, even if the models were to somehow include a list of sources at the end, the list might be so long (500+?) that any reference becomes invisible anyway.
Regardless of the technical details, in this post I’ll assume that listing sources for AI chat responses isn’t possible. Without links back to sites, how do content creators that contributed the original training data, even just a brief page among thousands of other sources, get any ROI?
Directions content creators might take
In the following sections, I’ll outline four directions that content creators might take in a world where AI tools consume their information for free without providing ROI to the content creators.
(1) Implement logins and paywalls
Without ROI from AI, I imagine content creators might begin protecting their content by putting up paywalls. Forcing users to log in to continue reading content, even without a paywall, would prevent AI tools from gobbling up their site’s content for free. If they have particularly large sites, they could offer an API with access to the content (similar to Reddit and Twitter), charging AI companies for their training data.
Charging for API access only seems feasible for large platforms with hundreds of thousands of pages, though. For smaller users, they could resort to blogging on platforms like Substack and Medium and configure gated access. (This assumes the platform provider wouldn’t simply sell access to the content to AI systems.)
For users who consume this content, this scenario is a nightmare. Imagine if browsing from site to site required you to log in each time, with many sites requiring $5/month paywalls. Users would have so many subscriptions, it would be ridiculous. Instead of the open flow of information on the internet, the internet would become a long hallway of locked doors.
(2) Build restricted courses
When confronted with endless logins and paywalls, users might think, Fine, I’ll just get all my information from AI. ChatGPT or Bard have most of the answers to my questions in the first place, right? I don’t need other sites.
Because of this, content creators will gravitate towards content that AIs can’t create. AIs are great at providing short answers and Wikipedia-like sections for targeted questions, but what if you want to learn something larger, like C++ programming? Or how to use Adobe Illustrator? Imagine asking an AI every single question you have. Or more relevant to our domain, suppose you want to learn API documentation. Try asking an AI bot 500 questions for all the information you want to know. I’m guessing the experience would be dizzying.
Just as reading a book, which has a linear progression to the content and logic, can be a pleasant experience (compared to the fragmented endless clicking across a jumble of sites), so too can a well-constructed course guide a user through a topic from beginner to intermediate to expert levels. Courses are just one type of content that I imagine will become more common online. In writing this post and thinking about these outcomes, I’m wondering, should I put my API doc course behind a login? Behind a paywall? Should I put it on a courseware site like Udemy? These are all questions I’m considering precisely because the AI bot model doesn’t provide ROI for content contributors.
(Of course, I’m just speculating here freely. This is a blog, not an academic publication, so I haven’t researched this topic in depth. I’m just speaking from experience as a content creator and wondering how things will play out. The AI interfaces are still finding their form, so it’s likely too early to draw conclusions.)
(3) Use AI tools to do mass content production
Another scenario could be that content creators themselves leverage AI tools in advanced ways to generate content quickly and prolifically. For example, I mentioned this article by Thomas Smith in my previous newsletter: How I Built WritingGPT, a Fully Automated AI Writing Team — It writes articles that rank on Google for about $1 each. In this scenario, content creators might leverage AI to generate new content prolifically, such as creating the equivalent of a new API documentation course each week, simply by using these tools in more intelligent ways. For example, setting up research agents and workflows through AutoGPT to automatically create resources overnight, etc. The mass of generated content might bring in enough traffic to compensate for the lost traffic going instead to AI chat interfaces.
In this scenario, the internet becomes flooded with so much content, the pace of information wears everyone out. In this model, content suffers from mediocrity and predictability, which only gets worse as more content is created. AI bots will pull from this same content for training data, creating a reinforcing loop.
(4) Create content inimitable by AI
The final trajectory I’ll explore is for content creators to create content inimitable by AI. As an example, consider this post. Can AI write it? Not currently. Sure, AI can create explanatory sections of parts of it, but the forceful, trenchant analysis with interwoven personal experiences is beyond AI to achieve (currently). I have to put “currently” here because many believe that if AI continues on its current trajectory, its intelligence and analysis will soon surpass human intelligence. But for now, AI-written articles are mediocre and bland. At best, they resemble explanatory sections in Wikipedia or high-school-level B essays. Most of the penetrating insights about AI are all written by humans using their minds to reason, analyze, infer, and argue.
In the best of all outcomes, AI tools wipe out the glut of lightweight spammy articles written by SEO content farms—the articles that essentially say nothing but just try to rank for keywords.
Imagine if writers could use AI tools to take content to the next level—to augment their insights and analysis in impressive ways, to use AI to find unthought comparisons between disparate domains, to trace lengthy logical arguments with precision, to parse through a scholar’s evidence and identify gaps and biases, and so on at a deeper level. In this world, AI forces writers to step up their game, moving from Joe blogger to a level typically found in The New Yorker or The Atlantic magazines—sources whose content typically cannot be reproduced by AI. Just like any good competition, AI might force writers to elevate their content.
Given that we don’t merely use the internet to find answers, but also read to be entertained, to keep up with the latest around us, to connect with other humans and learn about their experiences, I’m optimistic that this inimitability outcome might be a real possibility. For example, if you’re reading this article, you probably didn’t search for it specifically. You read to connect with other human experiences from professionals in the same field as you. You read to get insights that you couldn’t have received by asking ChatGPT. If I can’t deliver that experience, my site will disappear into oblivion.
Overall, in the long term, I doubt users will continue to get access to a free stream of endlessly detailed, accurate information drawn from content providers via AI chat without providing some return benefit to the content providers. What does the reward look like without revenue, links, or other visibility? That picture remains to be seen. I outlined four options for the direction content creators might take: (1) creating logins and paywalls, (2) creating course-like content that AI can’t easily create, (3) leveraging AI tools to automate content creation in a mass-production way, and (4) elevating content quality to levels that AI can’t imitate.
Given technology’s trajectory of plurality, most likely all of these options will unfold, but the last trajectory is the most interesting. Follow me on this journey toward elevated content.
About Tom Johnson
I'm a technical writer / API doc specialist based in the Seattle area. In this blog, I write about topics related to technical writing and communication — such as software documentation, API documentation, visual communication, information architecture, writing techniques, plain language, tech comm careers, and more. Check out my API documentation if you're looking for more info about that. If you're a technical writer and want to keep on top of the latest trends in the field, be sure to subscribe to email updates. You can also learn more about me or contact me. Finally, note that the opinions I express on my blog are my own points of view, not that of my employer.