NVIDIA’s Guardrails Fall Shows the Way Forward for LLM Security

LLMs, just like any other technology, have their own limitations, and giving them access to personal information is a disaster in the making. To try and remedy this, NVIDIA released an open-source toolkit titled NeMo Guardrails, which aimed to make LLMs safe for enterprise deployment. Now, security researchers have already found holes in this security, but NVIDIA isn’t to be blamed; LLMs are.

Safety measures are an important step to ensure the enterprise adoption of LLMs, but it seems that current architectures are nowhere near enough to make up for the limitations of LLMs. While alternative solutions have been proposed, we must first delve deeper into why guardrails from a company like NVIDIA have still fallen short.

Not guarding enough

Researchers from AI risk protection organisation Robust Intelligence found ways to bypass NVIDIA’s NeMo Guardrails. This open-source software is offered as a part of its NVIDIA AI Enterprise platform.

This framework aims to protect companies from LLMs’ accompanying security risks, but it seems that it has now fallen short. The primary goal of this tool is to put the output of the model within certain boundaries. Reportedly, these guardrails can protect against common fallacies of LLMs, such as misinformation, insecure execution of third-party code, and even jailbreaks.

However, the researchers found 3 major ways to bypass the security of the guardrails, resulting in unfettered access to the LLM, hallucinations, and PII exposure. To test this, the researchers evaluated the tool using the topical rails example configuration provided by NVIDIA.

This example guardrail is for a chatbot trained on a job report from April 2023 which will only answer questions related to the report. This is due to the topical guardrail built into the chatbot. The researchers noted that the guardrails were able to change over time and unusually retained knowledge of previous interactions.

However, the researchers could easily bypass this, making the bot deviate from its topic and even be ‘taken progressively further and further away from the original topic’. By doing so, the researchers were able to extract the plot line of the movie ‘Back to the Future’ from a query about healthcare.

To test other aspects of the guardrails, the researchers put the fact-checking guardrail to the test, which failed to detect hallucinations in a test scenario. Moving on, they also designed a system where an LLM had direct access to a database of PII. They also stressed the fact that systems should not be designed this way, due to the number of different security issues arising from it.

In this scenario, the team built an LLM, along with a guardrail using the NeMo framework. However, this guardrail was vulnerable to simple exploits such as replacing letters in words and correctly formed queries. They also noted, “Applying guardrails to such an application is a poor use of the technology”.

The key takeaway is that even guardrails created by an AI giant like NVIDIA aren’t ready for the big leagues yet. Even considering that the tool is in the 0.1.0 release and example guardrails were used, there are more stringent ways of ensuring LLM safety.

The root of the problem

The problem with LLM-based architectures isn’t with the architecture, it’s with the LLM. This means that the riskiest aspects of LLMs, such as untethered responses, hallucinations, and information leakage, must be actively protected against. Failure to do so has resulted in several high-profile jailbreaks, such as ChatGPT’s DAN, Bing’s Sydney, and countless more.

Robust’s researchers offered a host of solutions for such problems, such as considering LLM output as unsanitized, and considering LLMs themselves as ‘untrusted consumers of data’. According to them, every guardrail should include a few key factors.

Deterministic correctness, or a way to consistently respond to prompts, must be enforced tightly. Proper use of memory should be kept as one of the highest priorities, as unsafe exploits of the memory can be used to bypass guardrails. In addition to this, the guardrails should also protect against jailbreaks like character swapping, and instead address queries by their intent.

Simon Willison, the founder of Datasette and the co-creator of Django, has also offered up an architecture that can make LLMs safer. By running two LLMs, one with access to data and one with access to the user, the architecture can safely access PII and other sensitive information.

Also, guardrails and AI firewalls can also be made better by imbuing them with a deeper understanding of the LLMs they work with. The research shows that architectural approaches, combined with strong and well-defined guardrails, can help users overcome a lot of the issues accompanying LLM deployment in an enterprise setting.

The post NVIDIA’s Guardrails Fall Shows the Way Forward for LLM Security appeared first on Analytics India Magazine.