KDnuggets News, July 19: ChatGPT Dethroned??? • Docker for Data Scientists • Reasoning with Tree of Thought Prompting

Features

  • ChatGPT Dethroned: How Claude Became the New AI Leader by Ignacio de Gregorio Noblejas
  • Docker Tutorial for Data Scientists by Bala Priya C
  • Exploring Tree of Thought Prompting: How AI Can Learn to Reason Through Search by Matthew Mayo

From Our Partners

  • Unlock DataOps Success with DataOps.live – Featured in Gartner Market Guide! by DataOps.live
  • Data access is severely lacking in most companies, and 71% believe synthetic data can help by Mostly AI
  • Neural Networks and Deep Learning: A Textbook (2nd Edition) by Charu Aggarwal
  • Data storytelling — the art of telling stories through data by Manning

This Week's Posts

  • Will 300 million Jobs really be Exposed or Lost to AI Replacement? by Nate Rosidi
  • 7 Steps to Mastering Data Science Project Management with Agile by Nisha Arya
  • 2023 Data Scientists Salaries by Benjamin O. Tayo
  • Where Does AI Happen? by KDnuggets
  • Mistakes That Newbie Data Scientists Should Avoid by Nisha Arya
  • Database Optimization: Exploring Indexes in SQL by Aryan Garg
  • A Practical Approach To Feature Engineering In Machine Learning by Nahla Davies
  • The First Half of 2023: Data Science and AI Developments by Nisha Arya
  • Automating the Chain of Thought: How AI Can Prompt Itself to Reason by Matthew Mayo
  • ChatGPT-Powered Data Exploration: Unlock Hidden Insights in Your Dataset by Bala Priya C
  • H1 2023 Analytics & Data Science Spend & Trends Report by All Things Insights
  • Ensuring Reliable Few-Shot Prompt Selection for LLMs by Chris Mauck

From Around The Web

  • Pandas Crash Course for Data Scientists by Data Science Horizons
  • Advanced Techniques for Research with ChatGPT by Kanwal Mehreen
  • Level Up Your Python Code with Type Hints by Python Power Programming
  • GenAIOps: Evolving the MLOps Framework by David Sweenor
  • Data Version Control Tools for Machine Learning by Nisha Arya

More On This Topic

  • Exploring Tree of Thought Prompting: How AI Can Learn to Reason Through…
  • Unraveling the Power of Chain-of-Thought Prompting in Large Language Models
  • KDnuggets News, July 5: A Rotten Data Science Project • 10 AI Chrome…
  • ChatGPT Dethroned: How Claude Became the New AI Leader
  • KDnuggets™ News 22:n09, Mar 2: Telling a Great Data Story: A…
  • Orca LLM: Simulating the Reasoning Processes of ChatGPT

DSC Weekly 25 July 2023

Announcements

  • With numerous cyber threats lurking and many available attack vectors, organizations must have a comprehensive view of what they are up against and how to best face possible attacks. Join the Enabling Threat Detection and Response summit to hear from leading experts about the most common, pervasive threats striking companies, the best monitoring and analytics strategies out to there to quell them, and the most effective methods for stopping threats.
  • With more data at your disposal than ever, data management and analytics have never been more critical to defining long-term success. Join the Managing Hybrid and Multi-Cloud Environments summit to explore how AI and ML are shaping the future of data analytics. You’ll discover strategies to implement deep learning, neural networks, RPA, NLP and more while harnessing dashboards and visualizations to give teams access to valuable, easy-to-digest, real-time data insights.

Top Stories

  • The AI content + data mandate and personal branding
    July 25, 2023
    by Alan Morrison
    Fair Data Forecast Interview with Andreas Volpini, CEO of WordLift Andreas Volpini believes every user who wants to build a personal brand online has to proactively curate their online presence first. He sees structured data (semantic entity and attribute metadata such as Schema.org) as key to building a cohesive, disambiguated personal presence online. Volpini has…
  • AI is a child: How do we raise it?
    July 24, 2023
    by Dan Allen
    In October 2022, the White House Office of Science and Technology Policy published “The Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People”. This attention from our government given to what could be called an AI EQ (emotional quotient) is reminiscent of how-to parent or raise a child.
  • Generative AI megatrends: Are companies using the excuse of AI to get rid of jobs?
    July 24, 2023
    by ajitjaokar
    In this blog, I will now focus on generative AI megatrends. By that, I mean, trends and underlying trends that could be big in the future – focusing on the technology of LLM but also the wider impact of LLMs on the economy and society. I will hence identify and follow some key trends –…
Education_DSC_160x600-2

In-Depth

  • From automation to optimization: How AI is revolutionizing digital marketing campaigns
    July 25, 2023
    by Erika Balla
    Welcome to the exciting world of digital marketing! In this blog, we’ll delve into this thrilling frontier where optimization meets automation and Artificial Intelligence is at the center. No longer must manual labor and guesswork play an essential part in developing effective marketing strategies; with AI’s capabilities now at their disposal, marketers with digital presence…
  • Sentience: Consciousness is inessential for LLMs, AI
    July 24, 2023
    by David Stephen
    There is a recent paper in Synthese, Qualia share their correlates’ locations, where the abstract stated that “This paper presents the location-sharing argument, which concludes that qualia must share the locations of their physical correlates. The first premise is a consequence of relativity: If something shares a time with a physical event in all reference…
  • Innovations in predictive analytics: ML and generative AI
    July 24, 2023
    by Prasanna Chitanand
    With the introduction of ChatGPT-3 and DALL-E2, the majority of investors started showing interest in businesses building generative AI. Moreover, the fact is generative AI is not enough to reach the needs of the AI revolution. The success of predictive models is relevant to the science fiction future that the majority of the customers want…
  • How to manage real-time data in the digital age
    July 21, 2023
    by Anas Baig
    In today’s tech-driven world, data is like gold. It’s becoming more and more common for companies to use real-time, or live, data to make informed decisions, improve the service they give to customers, and get a leg up on the competition. But handling real-time data can be tricky because there’s so much of it, it’s…
  • Difference Between Modern and Traditional Data Quality – DQLabs
    July 20, 2023
    by Edwin Walker
    Modern data quality practices make use of new technology, automation, and machine learning to handle a variety of data sources, ensure real-time processing, and stimulate stakeholder collaboration. Data governance, continuous monitoring, and proactive management are prioritized to ensure accurate, reliable, and fit-for-purpose data for informed decision-making and corporate success. Modern data quality practices differ from…
  • How much coding is needed in a data science career?
    July 20, 2023
    by Aileen Scott
    The most common question in people’s minds that are not from a technical background is how much coding is required to ace a data science career path. If you also have the same question, you are not alone. But, the surprising answer is “it depends”. Unarguably, coding is a crucial aspect and vital tool for…
  • DSC Weekly 18 July 2023
    July 18, 2023
    by Scott Thompson
    Read more of the top articles from the Data Science Central community.

I read the news today, oh bot! AI-generated anchors are making headlines in India

Lisa

Meet ' Lisa', an AI-generated news anchor for OTV News in India.

Roll over, Walter Cronkite.

The TV news anchor is a time-tested tradition for providing some degree of comfort to the business of delivering our daily dose of headlines. We wake up each day to a familiar morning news anchor and go to bed each night with the evening news delivered by another familiar face. But as AI threatens to take over many of today's jobs, how safe is the news reader's position? One network in India is trying to answer that.

Odisha TV, a news channel and digital platform from India, recently tested out Lisa, an AI-generated news anchor. With a monotone voice and eyes that don't quite close when they blink, Lisa reads the news headlines for the network periodically, and she's not alone.

Also: Bing AI chat expands to Chrome and Safari for select users

According to the South China Morning Post, Lisa is one of two multilingual chatbots that have been added to news networks in India in the past three months. Sana, the other AI-generated news anchor, 'works' for the network Aaj Tak, owned by the India Today group.

AI-generated news anchor Sana reads the headlines for Aaj Tak.

Though the developers employ some subtleties to make the anchors appear more human, the result tends to trigger uncanny valley reactions. Sana often shifts from one foot to the other, and Lisa folds her hands and rearranges her fingers uncomfortably — actions that — on their own — would feel "normal" in a human being. Still, the repetitiveness of the AI bots' movements, combined with their monotone voices and unnatural facial expressions, creates the eerie feeling that you're watching something unnatural.

But Lisa and Sana are always available, never sick or tired, don't go on strike or PTO, and won't age. Still, India Today and Odisha TV claim that they have not added these AI chatbots to replace their human counterparts but rather to complement them by taking over repetitive and mundane tasks.

Also: You can now chat with a famous AI character on Viber. Here's how

Currently, Sana and Lisa are tasked with reading the headlines during a broadcast or news program and handing them over to a human presenter. Sana, however, is being trained to conduct debates with human and AI panelists.

Reception of the AI-generated news anchors has been mixed. Supporters encourage the networks' embrace of new technologies, the ability to provide news faster during elections and other critical times, and the language diversity. In contrast, naysayers oppose artificial intelligence replacing people and the lack of human nuances.

Also: Singapore looks for generative AI use cases with sandbox options

Here's another ethical conundrum: The racial and sexist discrimination that can result when human beings create AI bots in their own image. As network executives decide every physical aspect of an anchor's appearance, there's a real possibility for the arbitrary exclusion of different ethnic groups or physical features.

Artificial Intelligence

How Google’s latest AI model is generating music from your brain activity

soundwaves

Google isn't new to using AI for creating music, launching its MusicLM in January to generate music from text. Now Google has upped the ante and is using AI to read your brain — and produce sound based on your brain activity.

In a new research paper,Brain2Music, Google uses AI to reconstruct music from brain activity as seen through functional magnetic resonance imaging (fMRI) data.

Also: How I used ChatGPT to write a custom JavaScript bookmarklet

Researchers studied the fMRI data collected from five test subjects who listened to the same 15-second music clips across different genres, including blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, and rock.

Then they used that data to train a deep neural network to learn about the relationship between brain activity patterns and different elements of music, such as rhythm and emotion.

Once trained, the model could reconstruct music from an fMRI employing the use of MusicLM. Since MusicLM generates music from text, it was conditioned to create music similar to the original music stimuli on a semantic level.

When put to the test, the generated music resembled the musical stimuli that the participant initially listened to in features such as genre, instrumentation, mood, and more.

On the research page's site, you can listen to several clips of the original music stimuli and compare them to the reconstructions that MusicLM generated. The results are pretty incredible.

Also: You can now chat with a famous AI character on Viber. Here's how

For one clip, the stimulus was a 15-second clip of the iconic "Oops!…I Did It Again" by Britney Spears. The three reconstructions were poppy and upbeat in nature, like the original.

The audio, of course, did not resemble that of the original since the study focuses on the different elements of the music, not the lyrical component.

Essentially, the model can read your mind (technically your brain patterns) to produce music similar to what you were listening to.

OpenAI scuttles AI-written text detector over ‘low rate of accuracy’

OpenAI scuttles AI-written text detector over ‘low rate of accuracy’ Devin Coldewey @techcrunch / 9 hours

OpenAI has shut down its AI classifier, a tool that claimed to determine the likelihood a text passage was written by another AI. While many used and perhaps unwisely relied on it to catch low-effort cheats, OpenAI has retired it over its widely criticized “low rate of accuracy.”

The theory that AI-generated text has some identifying feature or pattern that can be detected reliably seems intuitive, but so far this has not really been borne out in practice. Although some generated text may have an obvious tell, the differences between large language models and the rapidity with which they have developed has made those tells all but impossible to rely on.

TechCrunch’s own test of a gaggle of AI-writing-detection tools concluded that they are at best hit or miss and at worst totally worthless. Of seven generated text snippets given to a variety of detectors, GPTZero identified five correctly and OpenAI’s classifier only one. And that was with a language model that was not cutting-edge even at the time.

But some took the claims of detection at face value, or rather well above it, since OpenAI shipped the classifier tool with a list of limitations significant enough that one wondered why they put the thing out at all. People who were worried that their students, job applicants, or freelancers were submitting generated text would put it into the classifier to test it, and while the results should not have been trusted, they sometimes were.

Given that language models have only improved and proliferated, it seems someone at the company decided it was time they took this fickle tool offline. “We are working to incorporate feedback and are currently researching more effective provenance techniques for text,” reads a July 20 addendum to the classifier announcement post. (Decrypt seems to have been the first to notice the change.)

Most sites claiming to catch AI-written text fail spectacularly

I asked about the timing and reasoning behind shuttering the classifier and will update if I hear back. But it’s curious that it should happen around the time OpenAI joined several other companies in a White House–led “voluntary commitment” to develop AI ethically and transparently.

Among the commitments made by the companies is that of developing robust watermarking and/or detection methods. Or attempting to do so, anyway: Despite every company making noise to this effect over the last six months or so, we have yet to see any watermark or detection method that is not trivially circumvented.

No doubt the first to accomplish this feat will be richly rewarded (any such tool, if truly reliable, would be invaluable in countless circumstances), so it is probably superfluous to make it a part of any AI accords.

The AI Feedback Loop: Maintaining Model Production Quality In The Age Of AI-Generated Content

The AI Feedback Loop: Maintaining Model Production Quality In The Age Of AI-Generated Content

Production-deployed AI models need a robust and continuous performance evaluation mechanism. This is where an AI feedback loop can be applied to ensure consistent model performance.

Take it from Elon Musk:

“I think it’s very important to have a feedback loop, where you’re constantly thinking about what you’ve done and how you could be doing it better.”

For all AI models, the standard procedure is to deploy the model and then periodically retrain it on the latest real-world data to ensure that its performance doesn't deteriorate. But, with the meteoric rise of Generative AI, AI model training has become anomalous and error-prone. That’s because online data sources (the internet) are gradually becoming a mixture of human-generated and AI-generated data.

For instance, many blogs today feature AI-generated text powered by LLMs (Large Language Modules) like ChatGPT or GPT-4. Many data sources contain AI-generated images created using DALL-E2 or Midjourney. Moreover, AI researchers are using synthetic data generated using Generative AI in their model training pipelines.

Therefore, we need a robust mechanism to ensure the quality of AI models. This is where the need for AI feedback loops has become more amplified.

What is an AI Feedback Loop?

An AI feedback loop is an iterative process where an AI model's decisions and outputs are continuously collected and used to enhance or retrain the same model, resulting in continuous learning, development, and model improvement. In this process, the AI system's training data, model parameters, and algorithms are updated and improved based on input generated from within the system.

Mainly there are two kinds of AI feedback loops:

  1. Positive AI Feedback Loops: When AI models generate accurate outcomes that align with users’ expectations and preferences, the users give positive feedback via a feedback loop, which in return reinforces the accuracy of future outcomes. Such a feedback loop is termed positive.
  2. Negative AI Feedback Loops: When AI models generate inaccurate outcomes, the users report flaws via a feedback loop which in return tries to improve the system’s stability by fixing flaws. Such a feedback loop is termed negative.

Both types of AI feedback loops enable continuous model development and performance improvement over time. And they are not used or applied in isolation. Together, they help production-deployed AI models know what is right or wrong.

Stages Of AI Feedback Loops

An Illustration of AI-generated data in AI feedback loop

A high-level illustration of feedback mechanism in AI models. Source

Understanding how AI feedback loops work is significant to unlock the whole potential of AI development. Let's explore the various stages of AI feedback loops below.

  1. Feedback Gathering: Gather relevant model outcomes for evaluation. Typically, users give their feedback on the model outcome, which is then used for retraining. Or it can be external data from the web curated to fine-tune system performance.
  2. Model Re-training: Using the gathered information, the AI system is re-trained to make better predictions, provide answers, or carry out particular activities by refining the model parameters or weights.
  3. Feedback Integration & Testing: After retraining, the model is tested and evaluated again. At this stage, feedback from Subject Matter Experts (SMEs) is also included for highlighting problems beyond data.
  4. Deployment: The model is redeployed after verifying changes. At this stage, the model should report better performance on new real-world data, resulting in an improved user experience.
  5. Monitoring: The model is monitored continuously using metrics to identify potential deterioration, like drift. And the feedback cycle continues.

The Problems in Production Data & AI Model Output

Building robust AI systems requires a thorough understanding of the potential issues in production data (real-world data) and model outcomes. Let’s look at a few problems that become a hurdle in ensuring the accuracy and reliability of AI systems:

  1. Data Drift: Occurs when the model starts receiving real-world data from a different distribution compared to the model's training data distribution.
  2. Model Drift: The model’s predictive capabilities and efficiency decrease over time due to changing real-world environments. This is known as model drift.
  3. AI Model Output vs. Real-world Decision: AI models produce inaccurate output that doesn’t align with real-world stakeholder decisions.
  4. Bias & Fairness: AI models can develop bias and fairness issues. For example, in a TED talk by Janelle Shane, she describes Amazon’s decision to stop working on a résumé sorting algorithm due to gender discrimination.

Once the AI models start training on AI-generated content, these problems can increase further. How? Let’s discuss this in more detail.

AI Feedback Loops in the Age of AI-generated Content

In the wake of rapid generative AI adoption, researchers have studied a phenomenon known as Model Collapse. They define model collapse as:

“Degenerative process affecting generations of learned generative models, where generated data end up polluting the training set of the next generation of models; being trained on polluted data, they then misperceive reality.”

Model Collapse consists of two special cases,

  • Early Model Collapse happens when “the model begins losing information about the tails of the distribution,” i.e., the extreme ends of the training data distribution.
  • Late Model Collapse happens when the “model entangles different modes of the original distributions and converges to a distribution that carries a little resemblance to the original one, often with very small variance.”

Causes Of Model Collapse

For AI practitioners to address this problem, it is essential to understand the reasons for Model Collapse, grouped into two main categories:

  1. Statistical Approximation Error: This is the primary error caused by the finite number of samples, and it disappears as the sample count gets closer to infinity.
  2. Functional Approximation Error: This error stems when the models, such as neural networks, fail to capture the true underlying function that has to be learned from the data.

Causes Of Model Collapse-Example

A sample of model outcomes for multiple model generations affected by Model Collapse. Source

How AI Feedback Loop Is Affected Due To AI-Generated Content

When AI models train on AI-generated content, it has a destructive effect on AI feedback loops and can cause many problems for the retrained AI models, such as:

  • Model Collapse: As explained above, Model Collapse is a likely possibility if the AI feedback loop contains AI-generated content.
  • Catastrophic Forgetting: A typical challenge in continual learning is that the model forgets previous samples when learning new information. This is known as catastrophic forgetting.
  • Data Pollution: It refers to feeding manipulative synthetic data into the AI model to compromise performance, prompting it to produce inaccurate output.

How Can Businesses Create a Robust Feedback Loop for Their AI Models?

Businesses can benefit by using feedback loops in their AI workflows. Follow the three main steps below to enhance your AI models' performance.

  • Feedback From Subject Matter Experts: SMEs are highly knowledgeable in their domain and understand the use of AI models. They can offer insights to increase model alignment with real-world settings, giving a higher chance of correct outcomes. Also, they can better govern and manage AI-generated data.
  • Choose Relevant Model Quality Metrics: Choosing the right evaluation metric for the right task and monitoring the model in production based on these metrics can ensure model quality. AI practitioners also employ MLOps tools for automated evaluation and monitoring to alert all stakeholders if model performance starts deteriorating in production.
  • Strict Data Curation: As production models are re-trained on new data, they can forget past information, so it is crucial to curate high-quality data that aligns well with the model’s purpose. This data can be used to re-train the model in subsequent generations, along with user feedback to ensure quality.

To learn more about AI advancements, go to Unite.ai.

ChatGPT Android App Now Available For Indian Users

OpenAI Releases Shap-E, Generative Model for 3D Assets

ChatGPT for Android is now available for download on Google Play Store announced OpenAI. As per the company’s tweet, the availability of the app will start with the US, India, Bangladesh, and Brazil, followed by other countries in a staged rollout, similar to what we saw for the iOS version.

ChatGPT for Android is now available for download in the US, India, Bangladesh, and Brazil! We plan to expand the rollout to additional countries over the next week. https://t.co/NfBDYZR5GI

— OpenAI (@OpenAI) July 25, 2023

Last week, OpenAI had announced that the Android version of their chatbot would be released in the last week of July and users can pre order it. Interestingly, this release coincides with reports of ChatGPT facing a slump in traffic and experiencing slower response times.

With its availability in India, where it supports Hindi responses (although proficiency may not match that of English), the ChatGPT app is poised to attract a large user base and further solidify its position as a leading AI chatbot platform.

OpenAI’s ChatGPT app offers users a versatile and interactive platform to access information and solutions on the go. Its availability on mobile devices provides convenience and flexibility, and the app’s intuitive interface ensures a smooth user experience. Interestingly, users can now give prompts through voice in the app.

Launched in November 2022, The ChatGPT app has seen remarkable success since its initial release in the United States. Within a mere six days, it garnered over half a million downloads, solidifying its position as one of the top-performing new apps. Impressively, the app outperformed other AI and chatbot applications, as well as popular apps like Microsoft Edge and Bing, in terms of download numbers.

The post ChatGPT Android App Now Available For Indian Users appeared first on Analytics India Magazine.

Esperanto Merging HPC and ML in Upcoming RISC-V Processor

Esperanto Merging HPC and ML in Upcoming RISC-V Processor July 25, 2023 by Agam Shah

Esperanto Technologies has ambitious plans for its next RISC-V processor: to undo the accelerator model and build a chip that has both CPU and GPU capabilities for machine learning and high-performance computing.

"X86 is just too heavyweight to serve as both main CPU. The accelerators and the GPUs are just too hard to program and they can't really serve as your main CPU, right? RISC-V really has the ability to do both things," said Dave Ditzel, CEO of Esperanto, which designs RISC-V processors, during a presentation at last month's RISC-V Summit held in Barcelona, Spain.

Ditzel also shared some details about its next-generation chip that he hopes would serve that purpose: do double-precision computing for HPC, and lower-precision computations for machine learning applications.

The ET-SoC-2 will include new high-performance CPU cores with the RISC-V vector extensions. The RISC-V standards-setting organization, RISC-V International, is in the process of ratifying new vector and floating-point specifications to be included in the base instructions. The full list is available here.

"It's going to have pretty substantial performance for one low-power chip," Ditzel said.

RISC-V is an instruction set architecture that can be licensed for free, and server chip makers are now building processors for AI and enterprise applications. Esperanto is currently selling the ET-SoC-1 chip, and will advance those high-performance computing capabilities in the next chip.

Ditzel did not provide benchmarks for ET-SoC-2 but said it could provide in excess of 10 teraflops of double-precision performance in a single chip.

Dave Ditzel, CEO of Esperanto, presented at last month's RISC-V Summit in Barcelona, Spain. (Click to enlarge).

"This system is meant to put hundreds or 1000s of these chips cooperatively working together in the future," Ditzel said.

Esperanto's focus is more on power efficiency than raw performance, which has been Ditzel's focus for decades. In 1995, he co-founded Transmeta, which had a software-defined chip that emulated x86 processors and was focused on power efficiency.

Ditzel said the current ET-SoC-1 chip provides 32 CPU cores on a single chip at up to 40 watts, depending on the application.

"Within the next five years, a RISC-V-based system will win what's called the Green500 award," Ditzel said.

Green500 is a separate ranking system for the most energy-efficient systems of the Top500 supercomputers in the world.

Ditzel said racks of ET-SoC-1 chips can provide petaflops of performance. But RISC-V lacks a coherent software ecosystem, with very limited application and OS support.

RISC-V processors lag x86 and ARM processors in performance, though companies are closing the gap.

"When people say 'Oh, RISC five is 10 years behind ARM,' the answer is yeah, but it is not going to take 10 years to catch up. It will take maybe a couple of years to catch up," Ditzel said.

Esperanto’s competitors include Ventana, which makes the Veyron V1 chip, which is based on a chiplet design. The Veyron chip can scale up to 16 cores and has L1, L2 and L3 cache, and is being offered as a chip or for licensing.

Many European researchers are also testing out RISC-V development boards as microservers for application development and testing.

Related

10 Brilliant Datasets Based on ChatGPT Outputs

By now, none on the internet has remained untouched by the power of ChatGPT, based on GPT-3.5 and GPT-4), the driving force behind Silicon Valley’s favourite chatbot. With over a 100 million users, the OpenAI model has also captivated the research community. Since the release of GPT-4, AI researchers have been using the model’s outputs to train their own language models and datasets for benchmark results.

Here are 10 datasets trained on GPT-4 output handpicked for the GPT4 enthusiasts!

Lima

Researchers at Meta AI have unveiled ‘LIMA: Less Is More for Alignment’, a small dataset containing 1000 examples (available on Hugging Face). The study suggests that LIMA can push forward the research for developing proficient LLMs. Notably, the researchers demonstrated that a 65B LLaMA model, fine-tuned only on these 1000 examples using a supervised approach, and achieved competitive performance compared to ChatGPT.

Find the repository here.

MiniGPT4

Researchers from Vision-CAIR introduced MiniGPT4, pre-trained and aligned with Vicuna-7B. The updated model shows a significant reduction in GPU memory consumption — as low as 12GB. The researchers propose a novel approach for generating quality image-text pairs by the model itself and ChatGPT. This methodology allows for the creation of a compact yet superior dataset, consisting of a total of 3500 pairs.

Find the GitHub repository here.

Dolly

Dolly, a groundbreaking open source project by Databricks, shows the capability of transforming a pre-existing, outdated open-source LLM into a ChatGPT-like system to follow instructions swiftly. This is made possible by a mere 30-minute training process on a single machine, utilising high quality training data.

Notably, the underlying model in Dolly comprises only 6 billion parameters,compared to other models with much larger parameters. The researchers also released a predecessor for the model, Dolly 2.0 which was lauded by the open source community.

Find the GitHub repository here.

Code Alpaca

The Code Alpaca project aims to construct and distribute an instruction-following Meta AI’s LLaMA model designed specifically for code generation. This repository is built upon Stanford’s Alpaca, with the only modification being the data used for training. The training method remains the same as the original approach.

For refining the Code Alpaca models, a 7B and 13B LLaMA models were used. These models were then fine-tuned using a dataset of 20,000 instruction-following examples, generated through techniques inspired by the Self-Instruct paper, with certain adaptations for better outputs.

Find the GitHub repository here.

Instruction Tuning with GPT4

GPT-4-LLM has a primary objective of facilitating sharing of data produced by GPT-4, which can be used for building instruction-following LLMs through supervised and reinforcement learning techniques.

This project pushes the boundaries of instruction-tuning in the LLMs world, as it is one of the initial initiatives to leverage OpenAI’s GPT-4’s capabilities in generating instruction-following data specifically tailored for LLM fine-tuning. Notably, the development holds the potential to advance the state of the art in language model training.

Find the GitHub repository here.

LLaVA-Instruct-150K

LLaVA Visual Instruct 150K is a collection of multimodal instruction-following data, generated using GPT. The dataset is curated for visual instruction tuning, to enhance the development of large multimodal models with advanced vision and language capabilities, geared towards the GPT-4 vision/language framework. The dataset holds great promise for research in the intersection of vision and language for creating capable multimodal models.

Find the GitHub repository here.

UltraChat

UltraChat offers valuable open-source, large-scale, and multi-round dialogue data powered by ChatGPT Turbo APIs. To prioritise privacy protection, the data collection process does not directly use any internet-based prompts. Furthermore, to maintain high standards of generation quality, a dual API approach is used.

One API operates as the user, generating queries, while the other API assumes the role of generating responses. This approach ensures a reliable dialogue generation process, promoting advancements in conversational AI while also prioritising privacy and data integrity.

Find the GitHub repository here.

GPTeacher

GPTeacher is a compilation of modular datasets, crafted by the GPT-4, General-Instruct, Roleplay-Instruct, Code-Instruct, and Toolformer. Each dataset serves a specific purpose, and together form a valuable resource for researchers. With GPT-4’s data generation prowess, these datasets showcase the model’s versatility and contribute to the landscape of language modelling.

Find the GitHub repository here.

ShareGPT

The collection of 70k user-shared conversations through public APIs have served as the foundational dataset for Vicuna-13B, an open-source chatbot. The dataset is based on an open-source Chrome Extension of ShareGPT which was used by users to share their ChatGPT conversations before OpenAI introduced the feature in the chatbot.

Find the Hugging Face repository here.

HC3

The HC3 (Human ChatGPT Comparison Corpus) dataset is an extensive collection of approximately 40k questions and their corresponding responses, generated by ChatGPT users.

The primary aim of this dataset is to conduct an analysis and comparison of ChatGPT’s responses in contrast to human-generated answers. The questions range from subjects, including open-domain, financial, medical, legal, and psychological areas.

Find the Hugging Face repository here.

The post 10 Brilliant Datasets Based on ChatGPT Outputs appeared first on Analytics India Magazine.

Advance your Career with the 3rd Best Online Master’s in Data Science Program

Sponsored Post

Advance your Career with the 3rd Best Online Master's in Data Science Program
Go beyond business analytics with Bay Path University's MS in Applied Data Science. Data Science teams need general industry experts who understand data science and technical specialists who can make it happen. Bay Path University will provide you with a career path in data science, regardless of your background and experience. We were one of the first institutions to develop two tracks to complete the Master of Science (MS) in Applied Data Science degree, which is right for you?

Generalist Track — This track prepares students to be well-rounded, collaborative, and skilled data scientists and analysts regardless of their background or area of expertise. Coursework in this track provides the foundation needed for breaking into the fast-growing field of data science.

Specialist Track — This track prepares students to take on more technical roles on data science teams, such as data modeler, data mining engineer, or data warehouse architect.

Our MS in Applied Data Science Degree Program Provides:

  • Small class settings, led by an extraordinary team of faculty who teach and mentor students throughout the program
  • Hands-on application using essential programming languages such as Python, SAS, R, and SQL
  • A project-based curriculum teaching students to solve real-world business challenges, using both "small" and "big" data and cutting-edge practices in statistical modeling, machine learning, and data mining
  • A project-oriented capstone that will harness the skills gained throughout the program
  • Flexibility for working professionals with convenient one and two-year schedules

LEARN MORE

More On This Topic

  • Maximize Your Value With The 3rd Best Online Master’s In Data Science…
  • Advance your career in Data Science with HSE Master in Data Science
  • Advance your data science career to the next level
  • Online Master’s in Data Science from Northwestern
  • Northwestern Online Master's in Data Science
  • Start a career in Computer Science with Penn’s Master in Computer Science…