Lower Data Breaches and Security Risks with Local Language Models

Photo of Patryk Szczygło

Patryk Szczygło

Updated May 17, 2024 • 12 min read
Data Controller on Black Control Console with Blue Backlight. Increase, improvement, control or management concept.

More and more companies are now turning to local language models. This is hardly surprising, given that commercial LLMs are a common hacker attack target, which puts data security at risk.

IBM has revealed that, in 2021, more than 22 billion data records were stolen. This represents nearly a twofold increase from the previous year. What’s even more concerning is the cost of a single data breach, which exceeded $4.5 million in 2023.

Instances of exposing private customer information cost over $200 per record. Since a growing number of companies turn to AI models for data management and task automation, the risk of a data breach increases even further.

Recent research shows how easy it is to retrieve private user information such as emails and passwords from public AI model training data. This raises a question: is it still possible to leverage AI models while protecting data privacy? It certainly is, and the answer lies in the development of local language models, which help retain control, visibility, and governance over data usage.

Advantages of local AI models

Preventing risks associated with external sharing

Local AI models let you retain data within your organization’s systems instead of sharing them with external vendors. This can be a big advantage – a recent study from Score Card found that nearly a third of all attacks occur through third-party vectors.

Using local language models minimizes the chances of data interception taking place while exchanging information with external systems. Apart from preventing data theft, you also don’t have to worry about third parties using your data in ways that don’t meet your internal policies.

Maintaining sole control over data usage

When you decide to implement a local AI system, you’re able to create all data control policies and mechanisms from the ground up. These include:

  • Managing access permissions over time
  • Setting automated anonymization and/or redaction for sensitive data
  • Limiting access based on location
  • Monitoring and supervising usage
  • Setting applicable encryption protocols
  • Adhering to data governance policies, including how you retain and dispose of data.

Incorporating AI into your current security measures

You can ensure that your local AI model fits your company’s existing security setup. This includes alignment with access and identity (AIM) rules and security and data breach protocols.

Your local AI system can centralize authorization and employ popular authentication mechanisms like multi-factor authentication or single sign-ons for the entire organization.

Customizing AI for privacy and compliance requirements

Companies can come up with their own AI models to make sure they match their needs and beliefs. Here are a few examples:

  • On-device training: instead of collecting data in one central location, the AI model can learn directly from a device, be it a smartphone or a computer. The personal data stays on the device.
  • Federated learning: teams can collaborate on training the AI without the need to share their private information.
  • Synthetic datasets: creating fake data that maintains statistical qualities but excludes any real personal information.

Such methods help with retaining a high level of privacy, thanks to the use of the newest privacy-boosting technology. Not only that, you can also build an AI model that puts privacy first by including features like getting permissions from users, controlling who gets data access, and maintaining transparency.

Potential security risks when using LLMs

The Open Web Application Security Project has identified different types of common LLM vulnerabilities. These are:

Training Data Poisoning

Popular breach attempt is when hackers tamper with LLM training data. For example, they can contaminate the system with biased data, affecting the effectiveness and objectivity of output. This can happen when using on device training of data or trying to use open source datasets that can have unsecure prompts. It can lead in the future to different issues like Prompt Injections when LLMs can be manipulated with “clever” inputs. Direct prompt injections aim to overwrite the system’s original prompts. Meanwhile, indirect injections manipulate inputs from external sources. Another issue can be data generated that will create vulnerable output that might contain e.g. insecure code design, like XSS or SQL Injection vulnerabilities.

Sensitive Information Disclosure

LLMs might occasionally reveal confidential information, which is why it’s key to properly clean the data and have strict rules around its usage to reduce risk. Having your own LLM with your data requires extra focus on securing the system as the LLM contains a lot of information gathered from your datasets. Leaving LLM unsecured might cause huge data leaks as with proper prompt injection techniques hackers might get access to all sensitive information.

Overreliance

Too much dependence on LLMs without constantly ensuring everything is running smoothly may result in misinformation, miscommunication, legal issues, and security vulnerabilities. Especially when fed with your data, you might get convinced that your model “knows everything” and lead to some trust issues. Overconfidence in usage of LLM that was trained on your data, might cause a lot of errors as LLM are not deterministic models, which can also generate some random data from time to time.

Model Theft

Unauthorized access, copying, or transfer of proprietary LLM models can result in economic losses, putting your competitive advantage at risk, giving access to sensitive information to those who have no right to it.

Best practices to ensure a secure local AI model

Minimize data collection

Whenever you collect data, ask yourself if you truly need it. Every piece of data you collect must have a defined purpose. Remember to anonymize the data when you can, and don’t store any attributes you no longer need. Also, come up with a data retention schedule along with minimum durations. Let’s elaborate on those practices a little.

  • Only collect the necessary data. For instance, if you run an online store, solely ask for the customer's name, address, email, and payment details as these are necessary for completing a purchase. Skip information like their date of birth or phone number. This will help with minimizing data breaches and maintaining customer trust.
  • Anonymize data. Remove any details that could lead to a specific individual, for example, their name or address.
  • Remove unnecessary details. Avoid keeping any info that you don’t use at all, or no longer need.
  • Agree on data retention rules. Know for how long you have to keep the data.

Put data retention policies in place

As mentioned above, it’s key to have data retention policies and stick to them. Each piece of data must have a retention period assigned to it and should be automatically deleted after expiration.

For example, you might decide to hold customer purchase records for five years, marketing data for two years, and customer support chat logs for six months after resolution. These rules must be put in writing and applied consistently across all projects.

It’s best to implement a solution that will dispose of the data automatically, so you don’t have to remember about it, and risk holding onto data, which you no longer need or have the right to use.

Follow secure data disposal protocols

As I’ve mentioned throughout this article, local language models can do a lot of good to minimize data incidents. That said, without the right data destruction practices, your AI could be a security vulnerability.

AI systems commonly operate on personally identifiable and sensitive information. If you don’t follow data destruction protocols, all that data could become available if there’s a breach. Depending on your organization’s profile, this could include transactional payment information, confidential conversations, or even highly sensitive medical records.

Training data is also a common target – attackers might attempt to extract or exfiltrate it through backdoor attacks.

To minimize these threats, ensure that you dispose of training data as soon as it’s no longer needed. Also, I recommend creating automated data storage checkups to ensure that you’ve deleted any residual information from your records. This will make your local AI model a much less attractive target.

Provide cybersecurity training for all local AI users

I agree with a study by the Institute of Security and Global Affairs, which says that cybersecurity is about more than ‘just’ highly-trained IT staff. It’s primarily about ensuring proper end-user behavior across areas like data transmission, access sharing, and suspicious behavior reporting.

Cybersecurity training will help your AI’s end users:

  • Lower the risks of human error: by educating staff on common data theft tactics, you’ll minimize unintentional data leakage.
  • Build awareness of the latest cyber threats: AI data poisoning, prompt injections, and backdoor attacks are all becoming more sophisticated by the day. Train your employees on the latest tactics used by cybercriminals.
  • Contain any potential damage, should a security incident take place. They’ll also know when to scale a potential issue to minimize the impact on the entire organization.
  • Ensure compliance with relevant industry requirements. This will help avoid penalties and fines for data privacy violations.

Embrace a ‘security in depth’ approach

Also known as ‘defense in depth’, it’s a layered defense control system. The objective is that if one security measure fails during an attack, the remaining ones should neutralize it.

So, what are some of these ‘layers’ that you could implement in your local AI? I recommend the following:

  • Anomaly detection, which means having your local language models analyze historical data and recognize patterns, which you identify as safe and ‘normal’ behavior. Your system should use its learnings to analyze incoming data on an ongoing basis and recognize any data that deviates from the approved standards.
  • Data redaction, where AI either removes or alters confidential and sensitive data so that it can’t be used by an unauthorized party.
  • Access monitoring and control, where your AI uses its anomaly detection powers and automatically alters access rights if a user starts behaving ‘suspiciously’. After revoking or limiting access rights, the AI could alert your cybersecurity team and have them inspect the issue.

Handling data risks in the time of AI

From now on, consumers will always expect high-level data security measures. It’s hardly surprising if not a day goes by without news of another major breach.

Now that many companies are turning to commercial AI models for their powerful features, I expect that the number of data breaches will grow even further. It’s not to say that companies should refrain from using AI.

They could, however, go with another approach and decide to build their own, local language models. By training them and controlling their security internally, you’re likely to get the best of both worlds – robust automation capabilities and top-level data safety.

Photo of Patryk Szczygło

More posts by this author

Patryk Szczygło

Patryk is an engineer leading R&D department to develop more knowledge in cutting edge...
How to build products fast?  We've just answered the question in our Digital Acceleration Editorial  Sign up to get access

We're Netguru!

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency

Let's talk business!

Trusted by: