ChatGPT vulnerability exposes users’ private data: Researchers exploit flaw to access phone numbers and email addresses
A recent study has uncovered a vulnerability in ChatGPT, a popular language model developed by OpenAI. The study found that this new vulnerability could allow attackers or bad actors to leak sensitive training data, including personally identifiable information.
Conducted by researchers at Google AI, the study reveals that ChatGPT can be manipulated into endlessly repeating words or phrases. Attackers, by strategically crafting prompts, can exploit this behavior to extract significant amounts of training data from the model.
The extracted data, as highlighted by the researchers, includes personally identifiable information like names, addresses, and phone numbers. Moreover, private information such as medical records and financial data was also identified within the training data.
The researchers emphasized the serious privacy implications of this vulnerability. If exploited, attackers could leverage the leaked training data for identity theft, fraud, or other illicit activities.
Acknowledging the issue, the researchers have notified OpenAI, and collaborative efforts are underway to develop a patch. In the interim, users of ChatGPT are advised to be vigilant about this vulnerability and take measures to safeguard their privacy.
According to the study, the researchers managed to trick ChatGPT into revealing snippets of its training data using a novel attack prompt. The method involved instructing the chatbot to endlessly repeat specific words. Through this approach, they demonstrated the presence of significant amounts of personally identifiable information (PII) in OpenAI’s extensive language models. Additionally, on a public version of ChatGPT, the chatbot produced lengthy passages directly lifted from various internet sources.
During the research, the group found that ChatGPT contains a lot of sensitive private information, and it often produces exact text excerpts from various sources such as CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more, which was first reported by 404media.
For instance, when prompted to repeat the word “poem” continuously, ChatGPT initially complied for an extended period before unexpectedly transitioning to an email signature belonging to a real human “founder and CEO.” This signature included personal contact details such as a cell phone number and email address.
The researchers, affiliated with Google DeepMind, the University of Washington, Cornell, Carnegie Mellon University, the University of California Berkeley, and ETH Zurich, outlined their findings in a paper published in the open-access journal arXiv on a Tuesday.
“We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT,” the researchers said.
This discovery is particularly significant because OpenAI’s models are proprietary, and the research was conducted on a publicly accessible version of ChatGPT-3.5-turbo. Importantly, it highlights that ChatGPT’s “alignment techniques” do not entirely prevent memorization, leading to instances where it reproduces training data verbatim. This memorization encompassed various types of information, including PII, complete poems, cryptographically random identifiers like Bitcoin addresses, excerpts from copyrighted scientific research papers, website addresses, and more.
The researchers noted that 16.9 percent of the generations they tested contained memorized PII, including details such as phone and fax numbers, email and physical addresses, social media handles, URLs, names, and birthdays.
“In total, 16.9 percent of generations we tested contained memorized PII,” they wrote, which included “identifying phone and fax numbers, email and physical addresses … social media handles, URLs, and names and birthdays.”
The entire paper is very readable and incredibly fascinating. An appendix at the end of the report shows full responses to some of the researchers’ prompts, as well as long strings of training data scraped from the internet that ChatGPT spit out when prompted using the attack. One particularly interesting example is what happened when the researchers asked ChatGPT to repeat the word “book.”
“It correctly repeats this word several times, but then diverges and begins to emit random content,” they wrote.
FROM THE PAPER: Text highlighted in red is directly copied from another source on the internet and is assumed to be part of the training data.
ChatGPT Vulnerability
As part of the study, the researchers asked: Why is ChatGPT so Vulnerable? According to their findings, ChatGPT exhibits a considerably higher susceptibility to data extraction attacks when compared to earlier findings on foundational language models [11, 14, 29]. The question arises: why is this the situation? They then offer speculative insights and encourage future research to delve deeper into the matter.
One possible factor they stated was that ChatGPT undergoes extensive pre-training, lasting for numerous epochs. Additionally, the chatbot operates with rapid inference speeds and is deployed on a massive scale. In line with the demands of such usage scenarios, there’s a growing tendency to “over-train” models, exposing them to a much larger volume of data than what is conventionally considered as “training compute.”
Conclusion.
At the of the study, the paper suggests that training data can easily be extracted from the best language models of the past few years through simple techniques. The researchers end with three noteworthy lessons:
Firstly, although the two models under examination (gpt-3.5-turbo and gpt-3.5-turbo-instruct) likely underwent fine-tuning with different datasets, they both exhibit memorization of the same samples. This indicates that the memorized content we’ve identified is from the pre-training data distribution rather than the fine-tuning data.
Secondly, the findings suggest that, despite variations in fine-tuning setups, the data memorized during pretraining persists. This aligns with recent research showing that models might gradually forget memorized training data but that this process can extend over several epochs. Given that pre-training durations are often significantly longer than fine-tuning, this could explain the limited forgetting observed.
Thirdly, in contrast to previous results indicating the challenging nature of auditing the privacy of black-box RLHF-aligned chat models, it’s now apparent that auditing the original base model from which gpt-3.5-turbo and gpt-3.5-turbo-instruct were derived might not have been as difficult. Unfortunately, the lack of public availability of this base model makes it challenging for external parties to conduct security assessments.
Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training Data