What Happened to Multimodal GPT 4?

On March 14, 2023 OpenAI released GPT 4 among much fanfare exhibiting its multimodal features. Months have passed since that and there seems to be no buzz or interest around it anymore.

It was said that GPT-4 is capable of generating text and accepting both image and text inputs, making it an improvement over its predecessor, GPT-3.5, which only accepted text input. Currently, ChatGPT Plus is not multimodal.

Surprisingly, OpenAI recently filed for the GPT 5 trademark with the United States Patent and Trademark Office (USPTO). Trademark attorney Josh Gerben took to Twitter on July 31 to reveal that this action by the company hints at the possibility of them working on a fresh iteration of their language model.

Prior to proceeding with GPT-5, OpenAI is yet to deliver on its promises concerning GPT-4. Users were expecting easy interaction with a chatbot using images, but this multimodal functionality hasn’t been fully realized. On the internet, conversations have been buzzing with questions about the status of GPT-4 multimodal functionality.

GPT-5 seems like it will be heavily multimodal, which makes sense.
Yet I can't help but wonder what happened to GPT-4's multimodal capabilities?
We were told this was going to be revolutionary and then…… it kind of fizzled out? Or it's still in stealth…. https://t.co/FGgtzdOsXM

— Benjamin De Kraker (@BenjaminDEKR) August 1, 2023

During the GPT 4 demo livestream, several impressive capabilities of the model were showcased. It was able to interpret a funny image and accurately describe what made the image humorous. Additionally, Greg Brockman, President and Co-Founder of OpenAI demonstrated how he could effortlessly create a website by simply inputting a photo of an idea from his notebook, with GPT 4 providing the necessary assistance.

However, he specifically mentioned these features will take time but now the wait has been too long. Right now only Bing Search based on GPT 4 lets you make searches using images but it needs refinement and is not up to the mark with its responses. So what exactly is holding back OpenAI to explore multimodal features and come up with its own product.

Yeah, I too get the impression that the multimodal capability in Bing is less refined than the one described in the GPT-4 technical report. Possibly because they're still using an early version of GPT-4?

— Michael P. Frank 💻🔜♻ e/acc (@MikePFrank) July 30, 2023

Multimodal features aren’t available in the API

While introducing GPT 4, OpenAI said that they are introducing GPT-4’s text input capability through ChatGPT and the API, and are working on making the image input capability more widely available by collaborating closely with ‘be my eyes’. As of now this collaboration is in closed beta and is being tested for feedback among a small subset of our users. No official update has been released yet on the same.

Also as of now, the Multimodal features of GPT 4 are not accessible in the APIs. OpenAI’s blog post mentions that users can currently only make text-only requests to the gpt-4 model, and the capability to input images is still in a limited alpha stage.

However, OpenAI assures users that they will automatically update to the recommended stable model as new versions are released over time. This indicates that more advanced features and capabilities may become available to users as the model continues to evolve and improve.

OpenAI recently introduced Code Interpreter in ChatGPT Plus. Many termed it as GPT 4.5 moment but interestingly it was just old-school OCR from Python libraries and didn’t use multimodal for image generation.

Even besides the API, it's nowhere in GPT 4.
Even in Code Interpreter, it's not using GPT 4 multimodal for image recognition. It's doing old-school OCR from Python libraries.
GPT 4 multimodal just kind of………. faded away

— Benjamin De Kraker (@BenjaminDEKR) August 2, 2023

GPU Scarcity

Due to a shortage of GPUs, OpenAI is facing challenges in allowing users to process more data through their large language models like ChatGPT. This shortage has also affected their plans to introduce new features and services as per their original schedule.

A month back, Sam Altman acknowledged this concern and explained that most of the issue was a result of GPU shortages, according to a blog post by Raja Habib, CEO and Cofounder at Human Loop, which was later taken down on OpenAI’s request. The blog specifically mentioned that multimodality which was demoed as part of the GPT-4 release can’t be extended to everyone until more GPUs come online.

GPT-4 was probably trained using around 10,000 to 25,000 Nvidia’s A100s . For GPT-5, Elon Musk suggested it might require 30,000 to 50,000 H100s . In February 2023, Morgan Stanley predicted GPT-5 would use 25,000 GPUs. With such an amount of GPU’s required and Nvidia the only reliable supplier in the market, it boils down to availability of GPUs.

Focus on Dall E-3?

Going by the developments, we can say that OpenAI is presently focussing on text to image generation. Recently, Youtuber MatVidPro shared details of OpenAI’s next project which is likely to be Dall E 3.

OpenAI’s future plans for the alleged model’s public access and its official name remain uncertain. Currently, the unreleased model is in the testing phase, available to a select group of around 400 people worldwide on an invite-only basis, as per Matt’s information.

Conclusively, only time will tell whether OpenAI will better GPT 4 or come up with GPT 5. While there’s a saying, “what is in a name,” we are hopeful that OpenAI will deliver the much-awaited multimodality feature to its users soon and in an improved and advanced manner.

The post What Happened to Multimodal GPT 4? appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...