Microsoft Drops OmniParser, its New AI Model

Earlier this month, Microsoft subtly announced the release of its new AI model, OmniParser, on its AI Frontiers blog. OmniParser is an entirely vision-based graphics user interface (GUI) agent, launched on Hugging Face with an MIT licence. This is similar to Anthropic’s ‘Computer use’ feature, which was released recently.

With this, Microsoft has solidified its presence in the AI Agent industry, building on its previous dominance in autonomous AI Agents. In September, Microsoft also joined Oracle and Salesforce in the Super League of AI Agentic WorkForce.

This move was long coming, as the first research paper, released in March 2024 by Jianqiang Wan and others from Alibaba Group and Huazhong University of Science and Technology, explained OmniParser as a unified framework for text spotting, key information extraction, and table recognition.

Following the research, in August, Microsoft released a detailed paper which advertised OmniParser as a pure vision-based GUI agent. This paper was written by Yadong Lu and two others from Microsoft Research in collaboration with Yelong Shen of Microsoft GenAI. It concluded that OmniParser outperforms GPT-4V baselines, even when using only screenshot inputs without any extra information.

Hugging Face explains OmniParser as a versatile tool that translates UI screenshots into data and enhances LLMs’ understanding of interfaces. The launch includes two types of datasets: one that detects clickable icons (gathered from popular websites) and another that describes each icon’s function to describe what each UI element does.

OmniParser outperforms GPT-4

OmniParser has been tested on different benchmarks, such as SeeClick, Mind2Web, and AITW. In all these tests, it has outperformed GPT-4V and OpenAI’s GPT-4 with vision.

For compatibility with current vision-based LLMs, OmniParser was combined with the latest models, such as Phi-3.5-V and Llama-3.2-V. Results show that the intractable region detection (ID) model significantly improved task performance across all categories compared to the non-fine tuned Grounding DINO model (without ID).

This boost in performance came from the “local semantics” (LS) that relate each icon’s function to its purpose, enhancing performance across GPT-4V, Phi-3.5-V, and Llama-3.2-V.

In the table, LS refers to the icon’s local semantics and ID to the fine tuned interactable region detection model.

Integration of GPT-4V

With a surge in the use of various LLMs, there has also been a major demand for enhanced AI agents for different functions in user interfaces. Though models like GPT-4V offer great promise, their potential to act as a general agent in OS is often miscalculated due to inadequate screen parsing techniques.

As per the ScreenSpot benchmark, OmniParser greatly boosts GPT-4V’s ability to generate actions that align correctly with relevant areas of the interface.
This claim is supported by another paper released in September 2024, written by researchers at Microsoft in collaboration with Carnegie Mellon University and Columbia University. The ‘Windows Agent Arena’, which evaluates multi-modal OS agents at scale, provides tests for best performance in an agent using OmniParser integrated with GPT-4V.

The post Microsoft Drops OmniParser, its New AI Model appeared first on AIM.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...