Think Twice Before Joining Bluesky

Since the election day in the US, Bluesky, a microblogging alternative to X (formerly Twitter), has been rapidly gaining popularity. The user base has doubled since September to reach 20 million by November 20.

The platform is competing against Elon Musk’s X, which has approximately 611 million monthly active users and Meta’s Threads, which boasts 275 million monthly active users.

Musk’s ownership of X and his close alliance with President-elect Donald Trump have made many users uncomfortable, which could be a reason behind people leaving the platform. A report estimates that around 115,000 X accounts were deactivated in the US the day after the ballot.

Trump is reportedly considering appointing an AI czar under Musk’s guidance to oversee federal policies and the government’s use of artificial intelligence.

However, unlike X, Bluesky offers an open API, allowing its data to be used for training AI models. Daniel van Strien, a machine learning engineer at Hugging Face, recently released a dataset containing one million public posts sourced from Bluesky’s Firehose API. The dataset included text, metadata, and language predictions.

However, he faced backlash over the lack of user consent.

“Hi, I do not consent for my posts or content to be used for AI purposes in any way, shape, or form for ethical reasons. Can you withdraw my account from your data scraping, please?” posted a user on Bluesky. Another one posted, “You’ve started a social trend of bad actors using the API to deliberately create antagonistic bsky datasets on Hugging Face (i.e., ‘two-million-bluesky-posts’ repo).” These were just a few of many such posts.

Strien eventually deleted the dataset and issued a public apology. “I’ve removed the Bluesky data from the repo. While my goal was to aid tool development for the platform, I understand this violated principles of transparency and consent. I sincerely apologise for this mistake,” he said on Bluesky.

With the situation escalating, Clem Delangue, CEO of Hugging Face, responded on X: “Surprisingly (or maybe not), it looks like there are a lot of toxic users on Bluesky. One of our team members made a mistake, and the reactions we’re getting are just awful (but also funny tbh). Let’s keep working on more positive public conversation spaces maybe?”

However, Bluesky itself doesn’t use user content to train its models. “A number of artists and creators have made their home on Bluesky, and we hear their concerns with other platforms training on their data. We do not use any of your content to train generative AI, and have no intention of doing so,” the company said in a post.

After the Hugging Face incident, Bluesky clarified that it is an open and public social network, much like websites on the internet itself. However, websites can specify whether they consent to outside companies crawling their data through a robots.txt file. Bluesky said that they are trying to introduce a similar practice.

Controversy Over Data Scraping and User Consent

On November 15, X updated its terms of service. The new terms state that when you upload content (like text, images, etc.), you permit X to use it for analysis. This includes using your content to help train machine learning and artificial intelligence models.

This change was one of the factors that led users to migrate to Bluesky. Interestingly, Musk’s xAI is planning to launch its own Grok standalone app in December.

Similarly, Meta’s updated privacy policy specifies that Meta trains its models using users’ posts, photos, and captions. “We do not use the content of your private messages with friends and family to train our AIs unless you or someone in the chat chooses to share those messages with our AIs,” the company states.

Microsoft-owned LinkedIn recently introduced a new privacy setting that automatically enrols users in AI model training. On September 18, LinkedIn updated its privacy policy to state that user data could be used to develop and train AI models.

However, users can opt out by going to the data privacy tab in their account settings and disabling the ‘Data for Generative AI Improvement’ toggle. This opt-out only applies to future data use and does not affect any training already conducted.

Does it Matter? Startups like OpenAI and Anthropic have already exhausted human-generated content to train their models and now rely on synthetic data for their upcoming frontier models. However, asking for user consent when using their data is still essential, and there is no excuse for bypassing it.For instance, in India, Sarvam AI is using synthetic data created by Meta Llama 3.1 405B to train its model.

OpenAI reportedly uses Strawberry (o1) to generate synthetic data for GPT-5. This sets up a ‘recursive improvement cycle,’ where each GPT version (e.g., GPT-5 or GPT-6) is trained on higher-quality synthetic data created by the previous model.

The post Think Twice Before Joining Bluesky appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...