AI gold rush for chatbot training data could run out of human-written text

AI ‘gold rush’ for chatbot training data could run out of human-written text News Channel 3-12

chatbot training data

Get 2 months free with an annual subscription at was $59.88 now $49.Access to eight surprising articles a day, hand-picked by FT editors. For seamless reading, access content via the FT Edit page on FT.com and receive the FT Edit newsletter. Andrew Hong also saw this sudden surge in chatbot creation and usage while working at a venture capital firm a few years ago. With the chatbot market expanding at a 24% CAGR (according to one forecast), it’s a potentially lucrative place for a technology investor, and Hong wanted to be in on it. “When we found out about ZeroShotBot we thought it would be a huge expensive investment but were pleasantly surprised how affordable it was  to put up a ZeroShotBot AI chatbot!

As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating lots of synthetic data” for training. Still, Deckelmann said she hopes there continue to be incentives for people to keep contributing, especially as a flood of cheap and automatically generated “garbage content” starts polluting the internet. “I think what you need is high-quality data. There is low-quality synthetic data. There’s low-quality human data,” Altman said. But he also expressed reservations about relying too heavily on synthetic data over other technical methods to improve AI models.

chatbot training data

Sponsored Partner Content

If real human-crafted sentences remain a critical AI data source, those who are stewards of the most sought-after troves — websites like Reddit and Wikipedia, as well as news and book publishers — have been forced to think hard about how they’re being used. The researchers first made their projections two years ago — shortly before ChatGPT’s debut — in a working paper that forecast a more imminent 2026 cutoff of high-quality text data. AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said.

AI ‘gold rush’ for chatbot training data could run out of human-written text

chatbot training data

As OpenAI begins work on training the next generation of its GPT large language models, CEO Sam Altman told the audience at a United Nations event last month that the company has already experimented with “generating lots of synthetic data” for training. A new study released Thursday by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade — sometime between 2026 and 2032. A new study released Thursday by research group Epoch AI projects that tech companies will exhaust the supply of publicly available training data for AI language models by roughly the turn of the decade — sometime between 2026 and 2032. Artificial intelligence systems like ChatGPT could soon run out of what keeps making them smarter — the tens of trillions of words people have written and shared online. 2.5 times per year, while computing has grown about 4 times per year, according to the Epoch study.

  • But he has concerns about training generative AI systems on the same outputs they’re producing, leading to degraded performance known as “model collapse.”
  • The Newsday app makes it easier to access content without having to log in.
  • Further, most chatbots today require countless hours of time to train the AI platform in order to deliver the right responses to queries.
  • Andrew Hong also saw this sudden surge in chatbot creation and usage while working at a venture capital firm a few years ago.
  • Papernot, who was not involved in the Epoch study, said building more skilled AI systems can also come from training models that are more specialized for specific tasks.

According to Hong, organizations are devoting extensive data science and data engineering resources to prepare large amounts of raw chat transcripts and other conversational data so it can be used to train chatbots and agents. Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem. Training on AI-generated data is “like what happens when you photocopy a piece of paper and then you photocopy the photocopy. You lose some of the information,” Papernot said. Not only that, but Papernot’s research has also found it can further encode the mistakes, bias and unfairness that’s already baked into the information ecosystem.

More Science & Health News

Within the first few hours of our implementation, ZeroShotBot helped us close a sales lead without any human interaction,” said Tanja Lewit, President, Alternate E Source, a New York-based provider of IoT web based devices. In addition to identifying mismatches between intents and training phrases, the Dashbot product can also show the conversational designer areas where new intents are needed, each with its requisite (and appropriate) training phrases. But how much it’s worth worrying about the data bottleneck is debatable. “ZeroShotBot let us build, deploy and maintain a chatbot rapidly and cost-effectively, without coding. Complete digital access to quality FT journalism with expert analysis from industry leaders. The Newsday app makes it easier to access content without having to log in.

chatbot training data

In our commitment to covering our communities with innovation and excellence, we incorporate Artificial Intelligence (AI) technologies to enhance our news gathering, reporting, and presentation processes.

AI ‘gold rush’ for chatbot training data could run out of human-written text

“Maybe you don’t lop off the tops of every mountain,” jokes Selena Deckelmann, chief product and technology officer at the Wikimedia Foundation, which runs Wikipedia. “It’s an interesting problem right now that we’re having natural resource conversations about human-created data. I shouldn’t laugh about it, but I do find it kind of amazing.” “There’d be something very strange if the best way to train a model was to just generate, like, a quadrillion tokens of synthetic data and feed that back in,” Altman said. AI companies should be “concerned about how human-generated content continues to exist and continues to be accessible,” she said. Papernot, who was not involved in the Epoch study, said building more skilled AI systems can also come from training models that are more specialized for specific tasks. But he has concerns about training generative AI systems on the same outputs they’re producing, leading to degraded performance known as “model collapse.”

Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks. From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance. The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria. Epoch is a nonprofit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of effective altruism — a philanthropic movement that has poured money into mitigating AI’s worst-case risks. ANN ARBOR, Mich., Nov. 10, 2021 — ZeroShotBot today announced the launch of a new disruptive conversational AI technology that democratizes chatbots for businesses big and small. ZeroShotBot brings a new way of building chatbots that can be scalable within hours, and requires no training data, allowing anyone with zero coding experience and training to create a fully functionable chatbot.

Besiroglu said AI researchers realized more than a decade ago that aggressively expanding two key ingredients — computing power and vast stores of internet data — could significantly improve the performance of AI systems. While some have sought to close off their data from AI training — often after it’s already been taken without compensation — Wikipedia has placed few restrictions on how AI companies use its volunteer-written entries. Still, Deckelmann said she hopes there continue to be incentives for people to keep contributing, especially as a flood of cheap and automatically generated “garbage content” starts polluting the internet. The amount of text data fed into AI language models has been growing about 2.5 times per year, while computing has grown about 4 times per year, according to the Epoch study. Facebook parent company Meta Platforms recently claimed the largest version of their upcoming Llama 3 model — which has not yet been released — has been trained on up to 15 trillion tokens, each of which can represent a piece of a word. While some have sought to close off their data from AI training — often after it’s already been taken without compensation — Wikipedia has placed few restrictions on how AI companies use its volunteer-written entries.

ZeroShotBot Launches AI Chatbot That Requires Zero Training Data or Coding

ZeroShotBot harnesses the power of zero-shot learning, which refers to the process by which a machine learning model is capable of performing tasks they have never seen before. ZeroShotBot requires zero training data for the bot to be able to interact on the subject matter. ZeroShotBot can be accessed and configured by anyone, eliminating the requirement to have years of technical knowledge and experience to deploy chatbot technology. With ZeroShotBot, any business big or small can create and deploy a high-functioning and personalized bot within a matter of hours. From the perspective of AI developers, Epoch’s study says paying millions of humans to generate the text that AI models will need “is unlikely to be an economical way” to drive better technical performance. The team’s latest study is peer-reviewed and due to be presented at this summer’s International Conference on Machine Learning in Vienna, Austria.

Related Images:

Ta stran uporablja piškotke za izboljšanje uporabniške izkušnje in za spremljanje podatkov o obiskanost strani. Preberi več

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close