Stop Paying for GPT-4o—This YC Startup Offers 4x the Savings

Floworks, the cloud-based enterprise automation startup, recently released a novel ThorV2 architecture allowing LLMs to perform functions with better accuracy and reliability. The YC-backed company collaborated with IIT Bombay and Kharagpur to build this architecture.

In an interview with AIM earlier this year, Floworks claimed that its AI agent, Alisha, is 100% reliable for tasks involving API calls. Sudipta Biswas, the co-founder of Floworks, said, “Our model, which we are internally calling ThorV2, is the most accurate and the most reliable model out there in the world when it comes to using external tools right now.”

Further, he claimed that ThorV2 was 36% more accurate than OpenAI’s GPT-4o, 4x cheaper, and almost 30% faster in terms of latency.

These sensational claims were recently backed by an in-depth research paper that dives deep into ThorV2 architecture and how all of its novel features work to solve several crucial challenges in agentic workflows inside some of the market-leading LLMs today.

Edge of Domain Modeling is ThorV2’s Hero Technique

The edge of domain modelling, used in ThorV2 architecture, involves providing minimal instruction upfront, allowing the agent to begin the task, and then providing the remaining information through error corrections post-task.

This approach differs from providing knowledge of all possible scenarios regarding the function calling.

Edge of domain modelling reduces the need for extensive instructions, which in turn reduces the number of tokens in the prompt. It can further lead to cost saving measures.

The authors mentioned that “Function schemas can be lengthy, leading to large prompt sizes. This increases deployment costs, time consumption, and can result in decreased accuracy on reasoning tasks.”

Additional instructions added to the LLM in the error correction process are performed by a static agent implemented through an Agent Validator Architecture inside ThorV2.

You Don’t Need an LLM to Evaluate Another LLM

The Agent Validator architecture overcomes several limitations of agentic workflows, where the primary LLM agent performing a task receives feedback from other LLMs that act as critics.

The authors argue that using an additional LLM not just increases the deployment costs, but decreases the rates of accuracy.

The introduction of a static agent written fully in code, includes a component called Domain Expert Validator (DEV) which inspects the output generated by the LLM for errors. The DEV contains all the knowledge and information required to perform the function calls inside a specific platform.

While building a validator requires a significant amount of effort, it helps reduce processing time and improve accuracy. This is because the knowledge contained inside the DEV contains information regarding the most common and repetitive errors that occur in the function calling process.

Multiple API Functions in a Single-Step

One of ThorV2’s other advantages is that it can generate multiple API calls in a single step. With ThorV2, a single query is sufficient for both tasks, even if the first task needs to retrieve information from the API for use in the second task.

The approach involves using a placeholder to represent unknown values, and once the first task retrieves the API response, the value is then injected into the second task.
“Generating multiple API calls at once requires sophisticated planning and reasoning capabilities, which is very challenging for ordinary LLMs. Our Agent-Validator architecture simplifies this process as well by correcting errors in the planning step”, added the researchers.

This approach is a significant improvement over the traditional, sequential handling of API calls in current LLMs, which often require a step-by-step execution process.

And the Numbers Don’t Lie – 50% Cost Reduction With 100% Reliability

The ThorV2 architecture was compared to OpenAI’s GPT 4o, GPT 4 Turbo, and Claude 3 Opus for a set of operations on HubSpot’s CRM.

The authors developed a dataset called HubBench on which the model was evaluated. The models were tested for accuracy, reliability, speed, and cost. In a conversation with AIM, Sudipta mentioned that ThorV2 was connected to the Llama 3 70B model for comparison.

ThorV2 came out on top in every single test, and a 100% score in the reliability test, which seeks a consistent output when the model is put out to perform the task ten times.

In the single API call function, ThorV2 scored 90% accuracy, second to Claude 3 Opus’ 78% score.

The test also revealed that it only took $1.6 for a thousand queries, which is 3 times cheaper than OpenAI’s models. Even with multiple API calls, ThorV2 performed better on every single metric.

While reading the comparison benchmark scores, one wonders if these scores are relevant five months after the tests were conducted, with several new and capable models like Claude 3.5 Sonnet and GPT o1 having been launched.

However, it is important to understand that ThorV2 is an architecture built to enhance the performance and capabilities of an existing LLM. The integration will, in fact, work better with new and more capable models.

“We will soon come up with Thor v3, which will definitely compare with other models that have come up recently. But again, the framework is not a model-level innovation that we’re doing,” Sudipta said. “So even if the underlying model keeps on getting better, our framework will keep supporting even better than that.”

It Isn’t Perfect, But Floworks Wants to Get There

One of ThorV2’s limitations is that it relies on knowledge from the DEV based on common, and well established error patterns and it may face difficulties approaching an unseen one. Moreover, the research currently tests ThorV2’s architecture for just single, and two API call functions.

The authors acknowledge the limitations, and plan to perform a comparison with three or more function calls in future research.

In the conversation with AIM, Sudipta revealed that ThorV3 is currently in the works, and it will challenge some of the latest market leading models today. That said, one can also expect other limitations to be resolved in the future iteration.

A Vision to Solve More Real-World Problems

The authors envision ThorV2 to overcome the limitations of existing LLMs and solve problems that can truly create an impact.

They mentioned that LLMs have revolutionised NLP and AI, demonstrating remarkable capabilities across a wide range of tasks. However, their economic impact has been somewhat limited, particularly in domains requiring precise interaction with external tools and APIs.

Over the last few months, we’ve also seen a meteoric rise in AI Agents and their tremendous capabilities, and frameworks like ThorV2 can only propel their powers further in sectors that require a large amount of automation and knowledge transfer between different applications.

“LLMs seem very cool, but to front-load them with a high amount of tokens, the cost will be prohibitively, very high. For large-scale operations where lots of automation is needed to be done, that price point will not suit enterprises, and small businesses,” Sudipta said.

The post Stop Paying for GPT-4o—This YC Startup Offers 4x the Savings appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...