How to NOT Pass Customer PII or PHI in OpenAI LLM?
Want to call AI LLM APIs but don't want to share sensitive PII or PHI data to LLM for HIPAA, PCI, ISO 27001 compliance reasons?
With the breakthrough in Foundation Models (FM) and Large Language Models (LLM), developers are building apps that improve this world. Around 15 % workers pass sensitive data to ChatGPT inadvertently. Let's dive deep into why developers are building apps on FM and LLMs, what kind of sensitive data is shared with AI models like ChatGPT or GPT-4 or custom ones, why it is recommended not to pass sensitive data to AI models, and how to prevent sharing sensitive data to AI models.
Developers build applications on language models like ChatGPT (Large Language Models, or LLMs in general) for several reasons:
Advanced Natural Language Understanding: FM and LLM can understand and process human language at an unprecedented level. This enables developers to create applications communicating effectively with users, making the interaction more engaging and efficient.
Improved Decision-Making and Prediction: These models can analyze vast amounts of data and generate insights, offering valuable predictions and recommendations. This helps developers create applications that can make better decisions, automate processes, and optimize workflows.
Enhanced Creativity: FM and LLM can generate original content, such as text, music, or images, by learning from existing data. This capability enables developers to create applications that can offer creative suggestions, generate personalized content, or inspire new ideas.
Time and Cost Savings: By leveraging the powerful capabilities of FM and LLM, developers can build applications that would previously require significant time and resources, saving both time and money.
Scalability and Adaptability: Foundation Models can be fine-tuned and adapted to various domains and industries, enabling developers to create tailored solutions that address specific needs or problems.
Employees and developers inadvertently post sensitive data that may be vary depending on the context and use case. Some common types of sensitive data include
Personally Identifiable Information (PII): This refers to any information that can be used to identify an individual, directly or indirectly. Examples include names, addresses, phone numbers, email addresses, Social Security numbers, and driver's license numbers.
Financial Information: Data related to an individual's financial status or transactions, such as bank account numbers, credit card numbers, transaction history, and credit scores.
Health Information: Medical records, health conditions, diagnoses, treatments, and other health-related data that can be linked to an individual. This information is often regulated under laws such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States.
Biometric Data: Unique physical or behavioral characteristics of an individual, such as fingerprints, facial recognition data, voice patterns, or DNA sequences.
Employment Information: Data related to an individual's employment history, including job titles, salary, performance evaluations, and disciplinary records.
Education Information: Records related to an individual's educational background, such as transcripts, test scores, or enrollment history.
Location Data: Precise geolocation information can reveal an individual's whereabouts or movements over time.
Communications Data: Contents of private communications, such as emails, text messages, or instant messages, may contain sensitive information or reveal personal relationships.
Online Behavior: Data related to an individual's online activities, such as browsing history, search queries, or social media profiles, which can reveal personal interests, preferences, or affiliations.
Legal Information: Data related to an individual's legal history, such as criminal records, court proceedings, or background checks.
When storing Personally Identifiable Information (PII), it can be masked before sending to a Large Language Model (LLM), ensuring that the LLM operates only on masked or synthetic data rather than real customer information. This approach has several advantages. By keeping customer PII secure and out of reach of the LLM, businesses can achieve their objectives with a Responsible AI approach that prioritizes privacy and data security. However, while storing PII does introduce considerations around data retention and deletion policies, this isn’t necessarily a disadvantage. Rather, it’s a conscious and necessary action from a security perspective, requiring proactive data management to ensure compliance with policies and regulations.
Developers can redact out sensitive text from prompt OR from documents via Strac APIs.
Here is the API to redact sensitive text: https://docs.strac.io/#operation/redactTextDocument
Here is the API to redact sensitive documents: https://docs.strac.io/#operation/redactDocument
Developers should take several precautions to ensure they do not pass sensitive data to AI models. Here are some best practices to follow:
Data Anonymization or Redaction: Remove or obfuscate any sensitive information, such as names, addresses, phone numbers, email addresses, and identification numbers from the data before passing it to the AI model. Techniques like masking, pseudonymization and generalization can anonymize the data while retaining its utility.
Data Tokenization or Encryption: Tokenize or Encrypt data before transmitting it to the AI model to protect it from unauthorized access.
Strac was built to secure sensitive data (PII, PHI, API Keys). It protects businesses by automatic redaction of sensitive data across all communication channels like email (Gmail, Microsoft 365), Slack DLP, customer support tools (Zendesk, Intercom, kustomer, HelpScout, Salesforce, Service Now), cloud storage solutions (One Drive, Sharepoint, Google Drive, Box, DropBox).
Explore more on AI and Data security: