How to NOT Pass Customer PII or PHI in OpenAI LLM?

TL;DR

Is my Data Safe with Open AI LLM?

Developers are building Foundation Models (FM) and Large Language Models (LLM) apps for advanced natural language understanding, improved decision-making, enhanced creativity, time and cost savings, and scalability and adaptability.
Sensitive data that may be passed to open AI LLM includes personally identifiable information, financial information, health information, biometric data, employment information, education information, location data, communications data, online behavior, and legal information.
Passing sensitive data to LLM like ChatGPT can lead to privacy and security concerns, data retention policies, unintended consequences, model training issues, and ethical considerations.
Developers should anonymize or redact sensitive data, tokenize or encrypt data, comply with industry-standard compliance frameworks and data protection regulations, and consider the ethical implications of their applications.
Strac offers solutions for automatic redaction of sensitive data across all communication channels, APIs to detect sensitive information or redact sensitive parts in a document, and APIs to tokenize sensitive data.

With the breakthrough in Foundation Models (FM) and Large Language Models (LLM), developers are building apps that improve this world. Around 15 % workers pass sensitive data to ChatGPT inadvertently. Let's dive deep into why developers are building apps on FM and LLMs, what kind of sensitive data is shared with AI models like ChatGPT or GPT-4 or custom ones, why it is recommended not to pass sensitive data to AI models, and how to prevent sharing sensitive data to AI models.

5 Reasons to Build Apps on Foundation Models (FM) and Large Language Models (LLM)

Developers build applications on language models like ChatGPT (Large Language Models, or LLMs in general) for several reasons:

Advanced Natural Language Understanding
Improved Decision-Making and Prediction:
Enhanced Creativity
Time and Cost Savings
Scalability and Adaptability

Advanced Natural Language Understanding: FM and LLM can understand and process human language at an unprecedented level. This enables developers to create applications communicating effectively with users, making the interaction more engaging and efficient.

Improved Decision-Making and Prediction: These models can analyze vast amounts of data and generate insights, offering valuable predictions and recommendations. This helps developers create applications that can make better decisions, automate processes, and optimize workflows.

Enhanced Creativity: FM and LLM can generate original content, such as text, music, or images, by learning from existing data. This capability enables developers to create applications that can offer creative suggestions, generate personalized content, or inspire new ideas.

Time and Cost Savings: By leveraging the powerful capabilities of FM and LLM, developers can build applications that would previously require significant time and resources, saving both time and money.

Scalability and Adaptability: Foundation Models can be fine-tuned and adapted to various domains and industries, enabling developers to create tailored solutions that address specific needs or problems.

Common Types of Sensitive Customer Data Passed to OpenAI LLM

Employees and developers inadvertently post sensitive data that may be vary depending on the context and use case. Some common types of sensitive data include

Personally Identifiable Information (PII)
Financial Information
Health Information(PHI)
Biometric Data
Employment Information
Education Information
Location Data
Communications Data
Online Behavior
Legal Information

Personally Identifiable Information (PII): This refers to any information that can be used to identify an individual, directly or indirectly. Examples include names, addresses, phone numbers, email addresses, Social Security numbers, and driver's license numbers.

Drivers License Photo — Customer PII Sensitive Data (Drivers License)

Financial Information: Data related to an individual's financial status or transactions, such as bank account numbers, credit card numbers, transaction history, and credit scores.

W-2 Document — Customer Financial Sensitive Data (W2)

Health Information: Medical records, health conditions, diagnoses, treatments, and other health-related data that can be linked to an individual. This information is often regulated under laws such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States.

Biometric Data: Unique physical or behavioral characteristics of an individual, such as fingerprints, facial recognition data, voice patterns, or DNA sequences.

Employment Information: Data related to an individual's employment history, including job titles, salary, performance evaluations, and disciplinary records.

Education Information: Records related to an individual's educational background, such as transcripts, test scores, or enrollment history.

Location Data: Precise geolocation information can reveal an individual's whereabouts or movements over time.

Communications Data: Contents of private communications, such as emails, text messages, or instant messages, may contain sensitive information or reveal personal relationships.

Online Behavior: Data related to an individual's online activities, such as browsing history, search queries, or social media profiles, which can reveal personal interests, preferences, or affiliations.

Legal Information: Data related to an individual's legal history, such as criminal records, court proceedings, or background checks.

Why developers should avoid passing sensitive data to LLM or ChatGPT?

Why cutomer data must not be sent to LLM or Chatbots like ChatGPT?

Privacy and Security Concerns: Passing sensitive data, such as Personally Identifiable Information (PII) or Protected Health Information (PHI), to AI models may expose this information to unauthorized parties, leading to privacy breaches and possible legal repercussions. Developers should comply with industry-standard compliance frameworks like PCI-DSS, HIPAA, SOC 2, ISO 27001,IRS 4557 ,and data protection regulations like GDPR to avoid penalties and maintain user trust.
Data Retention Policies: AI service providers may have data retention policies that dictate how long user data is stored. If sensitive information is passed to the AI model, there's a risk it could be retained and potentially exposed at a later time.
Unintended Consequences: AI models may inadvertently reveal sensitive information through their generated outputs, even if the original input data is anonymized. This can lead to unintended privacy violations and potentially harm individuals or organizations.
Model Training: AI models, especially LLMs, are trained on vast amounts of data, which may include sensitive information. Inadvertently including PII or PHI in the training data increases the risk of privacy breaches and may expose the model to potential biases.
Ethical Considerations: Using AI models that process sensitive data can raise ethical concerns, as the potential misuse of such data can lead to discrimination, social stigmatization, or other negative consequences. Developers should consider the ethical implications of their applications and strive to create responsible AI solutions.

Best practices to protect sensitive data from sending to LLM

Mask or Redact PII or PHI before sending to LLM

When storing Personally Identifiable Information (PII), it can be masked before sending to a Large Language Model (LLM), ensuring that the LLM operates only on masked or synthetic data rather than real customer information. This approach has several advantages. By keeping customer PII secure and out of reach of the LLM, businesses can achieve their objectives with a Responsible AI approach that prioritizes privacy and data security. However, while storing PII does introduce considerations around data retention and deletion policies, this isn’t necessarily a disadvantage. Rather, it’s a conscious and necessary action from a security perspective, requiring proactive data management to ensure compliance with policies and regulations.

Strac Redaction API

Developers can redact out sensitive text from prompt OR from documents via Strac APIs.

Here is the API to redact sensitive text: https://docs.strac.io/#operation/redactTextDocument
Here is the API to redact sensitive documents: https://docs.strac.io/#operation/redactDocument

‍

Strac Example of Masking/Pseudonymization

Best practices to protect sensitive data in ChatGPT

Strac ChatGPT Detect Sensitive Data Warning — Strac ChatGPT Detect Sensitive Data

Developers should take several precautions to ensure they do not pass sensitive data to AI models. Here are some best practices to follow:

Data Anonymization or Redaction: Remove or obfuscate any sensitive information, such as names, addresses, phone numbers, email addresses, and identification numbers from the data before passing it to the AI model. Techniques like masking, pseudonymization and generalization can anonymize the data while retaining its utility.

Data Tokenization or Encryption: Tokenize or Encrypt data before transmitting it to the AI model to protect it from unauthorized access.

Strac was built to secure sensitive data (PII, PHI, API Keys). It protects businesses by automatic redaction of sensitive data across all communication channels like email (Gmail, Microsoft 365), Slack DLP, customer support tools (Zendesk, Intercom, kustomer, HelpScout, Salesforce, Service Now), cloud storage solutions (One Drive, Sharepoint, Google Drive, Box, DropBox).

Strac Platform Integration — Strac Integrations

Explore more on AI and Data security:

Discover & Protect Data on SaaS, Cloud, Generative AI

Strac provides end-to-end data loss prevention for all SaaS and Cloud apps. Integrate in under 10 minutes and experience the benefits of live DLP scanning, live redaction, and a fortified SaaS environment.

Book a Demo