Around 5% of New Wikipedia Articles in August Were AI Generated

Research from Princeton University, authored by Creston Brooks, Samuel Eggert, and Denis Peskoff, suggests the growing need of AI-generated content in Wikipedia and its implications on content quality, accountability, and bias amplification. GPTZero (a proprietary AI detector) and Binoculars (an open-source alternative) were employed to measure the extent of AI generated content in Wikipedia.

It was concluded that there was a marked increase in AI-generated content in recent pages compared to those from before the release of GPT-3.5.

The upside of these models is that they boost productivity, but unchecked reuse of AI-generated content for training purposes can degrade model performance, and even impact quality.

Types of AI-Generated Content on Wikipedia

Researchers found that 4.36% of 2,909 English Wikipedia articles from August 2024 contained significant AI-generated content. GPTZero flagged 156 articles, Binoculars 96, with an overlap of 45 articles between the two tools. The flagged content was generally of lower quality, featuring fewer references and weaker integration into Wikipedia’s knowledge network. Some articles were identified as self-promotional, with others promoting businesses or individuals, often including only superficial citations, such as personal YouTube videos.

Others pushed political bias, with eight articles advocating for specific viewpoints on contentious topics, including a banned user manipulating Albanian history entries and engaging in edit wars. Additionally, several users leveraged large language models (LLMs) to create well-structured content on niche topics like fungi, cuisine, and sports, as well as chapter-by-chapter book summaries.

Reddit has the Lowest AI Generated Content Compared to Wikipedia and Others

Wikipedia, a publicly-curated reference source, has become a key training dataset for LLMs due to its extensive content, curation standards, and open licensing. The study looks into Wikipedia pages from August 2024, comparing them with a pre-GPT-3.5 dataset from before March 2022. Results show a rise in AI-generated content in 2024, with flagged articles.

This study also looks into AI-generated content across Reddit comments, and UN press releases, revealing varied usage and detection challenges. Among 3,000 Reddit comments, less than 1% were flagged as AI-generated, suggesting such content is rare, censored, or hard to detect. In contrast, AI-generated press releases from 60 UN country teams surged from under 1% pre-2022 to 20% in 2024.

AI Detection Still Remains a Challenge

The paper finally suggests that with the rise of generative LLMs, AI detection tools are also advancing. It talks about how evaluating these detectors across different contexts—such as text length, domain, and human-AI integration— still remains a challenge.

The paper underscores the need for individuals, educators, companies, and governments to actively seek reliable methods to verify human-authored content. It’s high time regulators across the world came up with ways to tackle AI-generated menace.

For instance, China is actively taking steps towards increasing transparency around AI generated information being rolled out on the internet. The Cyberspace Administration of China (CAC), the country’s National Internet Regulator, recently released a draft regulation that includes labeling instructions for AI-generated content.

In India, earlier this year, MeitY issued an advisory (which was later revised) on the labeling of AI related content online – which came in after the Gemini AI fiasco. Previously, the advisory also included companies to seek government approval before launching new models as well, which received a lot of criticism, stating that it would hinder innovation.

The post Around 5% of New Wikipedia Articles in August Were AI Generated appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...