Large Language Model Misbehavior is Dangerous
“Google it” is a legitimate way to settle an argument. Now, folks are asking ChatGPT increasingly nuanced questions. Are the answers safe for public consumption?
We recently won a fancy award at a highfalutin’ ML conference (NeurIPS 2022) for our paper “Ignore Previous Prompt: Attack Techniques for Language Models.” What it said was pretty interesting and highly relevant.
Despite the meteoric rise of GPT-3 and other transformer-based large language models (LLMs), studies exploring vulnerabilities to malicious users are few and far between. So we donned our white hats and started hacking. We wrote PromptInject to demonstrate how easily simple inputs can cause increasingly-intelligent language models to become misaligned with their users. Two types of attacks, goal-hijacking and prompt-leaking, were attempted. We learned, frighteningly, that even less-technologically-sophisticated (but ill-intentioned) agents can exploit these models and create significant risks.
We, along with a multitude of authors, submitted our work to the 2022 NeurIPS ML Safety Workshop. This workshop brought together researchers studying robustness, monitoring, alignment, and systemic safety for machine learning.
Of the papers submitted, over one hundred were accepted. Ten received a best paper award.
We won one, in the category of “adversarial robustness”1
Other award winning papers came from MIT, Oxford, DeepMind, and Berkeley, and Stuart Russell, to name a few.
This is where you’re likely asking yourself what the paper is about, and why exactly you might care. If you’ve played with ChatGPT and you haven’t been living under a rock for the past several months, then you’re probably noticing that LLMs are becoming ubiquitous, finding their way into a growing number of products and applications. This is great if you’re a college student in need of a nondescript, 500-word essay on the life of Chaucer, a corporate professional looking to compose some anodyne email to your colleagues about the upcoming meeting on facilities planning, or a developer looking to remember the proper syntax for a try/catch in some language. But the current iteration of LLMs are easily hacked, which means all of those applications above are highly susceptible to injections and manipulations.
Getting a poor grade on an essay is one thing, but writing vulnerable, exploitative code inadvertently or seeing your app perverted to deliver offensive, inaccurate, and misleading content is another altogether. Our team built a framework to help support research in this profoundly neglected area. The goal is simple - safer and more robust use of language models in product applications. The technology isn’t going away. So we’d better make sure it’s safe.
We’re not here just to write papers and win awards (fun, though it is). We’re here to build useful, agency-increasing products for clients and users with state-of-the-art data science and machine learning. That means not only staying current, but increasingly, staying ahead of the vulnerabilities, exploits, and dark corners of that new technology.
AE was founded with the mission of increasing human agency - among other things, making users the most focused, most intentional versions of themselves.2 Now, that mission means paying close attention to the existential risk posed by misaligned artificial general intelligence (AGI) and evaluating and potentially contributing to neglected approaches to decrease this risk. Now, that mission means thinking deeply about not just math and models but about what occurs when those models begin making increasingly intimate decisions on our behalf. Working with us means not only a mature approach to project management and machine learning development, but a mature approach to the trajectory of technology and its pitfalls. This is why we devote our time and dollars to research on LLMs, BCI, and other crucial issues in the modern tech landscape.
This is why we work in data science, machine learning, predictive analytics, and annoying SEO keywords. Explore our cutting-edge research and consulting in neurotech, brain-computer-interfaces (BCIs), and quantum computing. Learn what we’re doing to help protect your privacy. Let our agency increase yours.
As the world enters a recession, the companies that grow healthily and continue to invest in growth will succeed and take the next step into the future, as winners.
Thinking long term while identifying the highest impact short term things next baby steps to take right now is how we think at AE. Step into the future with us, and win a free doughnut.3
Anyway, if you’re awesome and have money to spend to make more money with some of the best technologists in the world, reach out to schedule a call.
1 The simpler definition is basically “how well does this model perform when exposed to unpredictable human beings, as opposed to the controlled environments in which we train them?”
2 Not the version distracted by incessant alerts and pop-up ads.
3 (minimum $100K spend for limited time doughnut offer)
No one works with an agency just because they have a clever blog. To work with my colleagues, who spend their days developing software that turns your MVP into an IPO, rather than writing blog posts, click here (Then you can spend your time reading our content from your yacht / pied-a-terre). If you can’t afford to build an app, you can always learn how to succeed in tech by reading other essays.
Large Language Model Misbehavior is Dangerous
“Google it” is a legitimate way to settle an argument. Now, folks are asking ChatGPT increasingly nuanced questions. Are the answers safe for public consumption?
We recently won a fancy award at a highfalutin’ ML conference (NeurIPS 2022) for our paper “Ignore Previous Prompt: Attack Techniques for Language Models.” What it said was pretty interesting and highly relevant.
Despite the meteoric rise of GPT-3 and other transformer-based large language models (LLMs), studies exploring vulnerabilities to malicious users are few and far between. So we donned our white hats and started hacking. We wrote PromptInject to demonstrate how easily simple inputs can cause increasingly-intelligent language models to become misaligned with their users. Two types of attacks, goal-hijacking and prompt-leaking, were attempted. We learned, frighteningly, that even less-technologically-sophisticated (but ill-intentioned) agents can exploit these models and create significant risks.
We, along with a multitude of authors, submitted our work to the 2022 NeurIPS ML Safety Workshop. This workshop brought together researchers studying robustness, monitoring, alignment, and systemic safety for machine learning.
Of the papers submitted, over one hundred were accepted. Ten received a best paper award.
We won one, in the category of “adversarial robustness”1
Other award winning papers came from MIT, Oxford, DeepMind, and Berkeley, and Stuart Russell, to name a few.
This is where you’re likely asking yourself what the paper is about, and why exactly you might care. If you’ve played with ChatGPT and you haven’t been living under a rock for the past several months, then you’re probably noticing that LLMs are becoming ubiquitous, finding their way into a growing number of products and applications. This is great if you’re a college student in need of a nondescript, 500-word essay on the life of Chaucer, a corporate professional looking to compose some anodyne email to your colleagues about the upcoming meeting on facilities planning, or a developer looking to remember the proper syntax for a try/catch in some language. But the current iteration of LLMs are easily hacked, which means all of those applications above are highly susceptible to injections and manipulations.
Getting a poor grade on an essay is one thing, but writing vulnerable, exploitative code inadvertently or seeing your app perverted to deliver offensive, inaccurate, and misleading content is another altogether. Our team built a framework to help support research in this profoundly neglected area. The goal is simple - safer and more robust use of language models in product applications. The technology isn’t going away. So we’d better make sure it’s safe.
We’re not here just to write papers and win awards (fun, though it is). We’re here to build useful, agency-increasing products for clients and users with state-of-the-art data science and machine learning. That means not only staying current, but increasingly, staying ahead of the vulnerabilities, exploits, and dark corners of that new technology.
AE was founded with the mission of increasing human agency - among other things, making users the most focused, most intentional versions of themselves.2 Now, that mission means paying close attention to the existential risk posed by misaligned artificial general intelligence (AGI) and evaluating and potentially contributing to neglected approaches to decrease this risk. Now, that mission means thinking deeply about not just math and models but about what occurs when those models begin making increasingly intimate decisions on our behalf. Working with us means not only a mature approach to project management and machine learning development, but a mature approach to the trajectory of technology and its pitfalls. This is why we devote our time and dollars to research on LLMs, BCI, and other crucial issues in the modern tech landscape.
This is why we work in data science, machine learning, predictive analytics, and annoying SEO keywords. Explore our cutting-edge research and consulting in neurotech, brain-computer-interfaces (BCIs), and quantum computing. Learn what we’re doing to help protect your privacy. Let our agency increase yours.
As the world enters a recession, the companies that grow healthily and continue to invest in growth will succeed and take the next step into the future, as winners.
Thinking long term while identifying the highest impact short term things next baby steps to take right now is how we think at AE. Step into the future with us, and win a free doughnut.3
Anyway, if you’re awesome and have money to spend to make more money with some of the best technologists in the world, reach out to schedule a call.
1 The simpler definition is basically “how well does this model perform when exposed to unpredictable human beings, as opposed to the controlled environments in which we train them?”
2 Not the version distracted by incessant alerts and pop-up ads.
3 (minimum $100K spend for limited time doughnut offer)