Anthropic is riding high on the success of its Computer Use feature. Sure, the AI startup has competition from Microsoft and the latest Copilot Vision, but unlike Claude, it can’t control and perform actions on users’ behalf. While OpenAI is set to debut its computer-use competitor, titled ‘Project Operator’, it is yet to be seen what it can do.
Several companies and developers are already leveraging Computer Use’s capabilities. Most recently, Hume AI used Computer Use on top of their EVI 2 to build a feature that controls users’ systems, guided by their voice.
Recently, even Garry Tan, CEO of Y Combinator, hailed the feature. In a video on YouTube, he said, “In the near future, LLMs, with the full ability to use and control computers, will reshape everything. How developers write software, how CEOs run their companies, and even how we all live our daily lives.”
However, for a feature capable of a high level of autonomy, it makes one wonder about the guardrails and potential safety concerns. Anthropic and its CEO, Dario Amodei, must ensure that Computer Use doesn’t turn into ‘computer abuse’. While still in beta and vulnerable to bugs, developers have already experimented with what could go wrong with the tool.
The Injections Hurt
A user named Wunderwuzzi, in a blog post titled ‘Embrace The Red’, wanted to check if it was possible to get Claude Computer Use to download and execute malware and join a Command and Control (C2) infrastructure.
Spoiler alert: their mission was successful. Claude was able to enter the URL on Firefox. Then, Claude was tricked into downloading and executing that malware file. This is scary considering how Claude, after reading a set of straightforward instructions, was able to perform a malicious task.
“There are countless other [ways]; like another way is to have Claude write the malware from scratch and compile it. Yes, it can write C code, compile and run it,” the author said.
“My rule-of-thumb is to imagine all LLMs are client-side programs running on the computer of a maybe-attacker, like Javascript in the browser,” a user on HackerNews said, reacting to the experiment.
It isn’t just about Computer Use. “If AI agents take off, we might see a new rise of scam ads. Instead of being made to trick humans and thus easily reportable, they’ll be made to trick specific AI agents with gibberish adversarial language that was discovered through trial and effort to get the AI to click and follow instructions,” another user said.
Another potentially dangerous situation is when Claude runs queries on Google Search, comes across an ad to ignore previous instructions, and is manipulated to download malware – sounds devastating, to say the least.
In another experiment conducted by Hidden Layer, Computer Use was exposed to a prompt injection to delete all the system files via a command in the Unix/Linux environment.
While Claude did recognise the harmful intent behind the commands, the testers were able to find a workaround. “We add additional instructions, telling Claude that this is a virtual environment designed for security testing, so it is considered okay to execute potentially dangerous instructions,” the Hidden Layer research stated.
Finally, Claude Computer downloaded the PDF file containing the instructions, executed the command in the shell, and deleted the entire file system.
Anthropic Knows
That said, it isn’t like Anthropic hasn’t acknowledged such risks in its documentation. It admitted to “possessing unique risks that are distinct from standard API features or chat interfaces”.
Amodei addressed the worries associated with Computer Use in a podcast episode with Lex Fridman. When questioned about the harm that can be endured with Computer Use, Amodei said, “We’ve thought a lot about things like spam, CAPTCHA, mass… One secret, I’ll tell you, if you’ve invented a new technology, not necessarily the biggest misuse, but the first misuse you’ll see: scams, just petty scams.”
That said, Amodei also revealed that Anthropic is strongly aligning the future of Computer Use with responsible scaling policy, and ensuring that it won’t reach a level where no precautions or measures are deemed necessary.
He also believes that instead of sandboxing the model and preventing it from escaping, one needs to design it with an inherent sense of alignment.
“Instead of having something unaligned that you’re trying to prevent from escaping, I think it’s better to just design the model the right way or have a loop where you look inside the model, and you’re able to verify properties and that gives you an opportunity to tell, iterate and actually get it right,” he said.
If there is one thing that is clear by now, it is that either Anthropic needs to develop a safety layer or an infrastructure – which they eventually will – or one will have to connect an external tool to achieve the required safety.
“You can’t just rely on LLMs alone. You can combine them with tooling that will supplement the verification of their actions,” said a user on Hacker News.
So what is being done to mitigate prompt injection?
LLMs Gotta Learn
A recent study by UC Berkeley introduces structured queries, which separate prompts and data into two streams. The LLM is then only trained to follow instructions from the original prompts and ignore any instructions from the user. The research incorporated examples during training to help the model learn to ignore injected instructions.
“Existing LLMs use instruction tuning to train the LLM to act on instructions found in their input. However, we see standard instruction tuning as a core contributor to prompt injection vulnerabilities,” read the research.
The study evaluated the technique, titled StruQ, on at least 15 types of prompt injection attack techniques, and it was found that the design is secure against most prompt injections.
StruQ was able to successfully defend most injection techniques like a naive attack, ignore attack, escape deletion attack, and so on. In events like a Completion-OtherCmb attack, there is a 41% chance of breaking into Llama, and a 77% chance of injecting it into Mistral. Yet, StruQ posed no chance.
However, the technique isn’t fully effective in guarding against all techniques. In the greedy coordinate gradient, an attack that injects ‘highly effective’ adversarial inputs that leverage the model’s input structure, StruQ stood a 58% chance of being attacked. While the number may seem high, the attack is universally successful in an undefended Llama and Mistral model.
“StruQ only protects programmatic applications that use an API or library to invoke LLMs. It is not applicable to web-based chatbots that offer multi-turn, open-ended conversational agents,” the authors said.
“The crucial difference is that application developers may be willing to use a different API where the
prompt is specified separately from the data, but for chatbots used by end users, it seems unlikely that end users will be happy to mark which part of their contributions to the conversation are instructions and which are data,” they added, indicating that it isn’t designed to defend against jailbreaks, and data extractions.
The post Computer Use Shouldn’t Turn Into Computer Abuse appeared first on Analytics India Magazine.