6 min read

Why healthcare organizations should avoid non-HIPAA compliant AI tools

Tshedimoso Makhene Nov 17, 2025 7:13:51 AM

HIPAA Compliance

Why healthcare organizations should avoid non-HIPAA compliant AI tools

Artificial intelligence is rapidly transforming healthcare operations. From automating administrative workflows to assisting in patient education, tools like ChatGPT and other large language models (LLMs) promise efficiency and innovation. But in a highly regulated industry like healthcare, not all AI is created equal. When healthcare professionals use non-HIPAA compliant AI tools, they expose their organizations to compliance, ethical, and security risks.

As David Holt, Owner, Holt Law LLC, says, “ChatGPT definitely has the potential to make a big difference in healthcare by speeding up administrative work, helping staff, and making patient education more engaging. There are some important limitations to keep in mind. First, it doesn’t actually ‘understand’ medicine—it can sound confident even when it gives incorrect or misleading information, which could be risky in a clinical setting. It’s also not up to date with the latest medical guidelines or treatments if you're using versions trained on older data. Another issue is bias—since ChatGPT was trained on large sets of data from the internet, it can reflect gaps and inequalities that already exist in healthcare, especially for underrepresented communities. Plus, as of today, it can only work with text, so it’s not helpful for anything that involves images, like X-rays or visual diagnoses. Sometimes the answers it gives can be too general or surface-level, missing the detail you’d need in complex medical situations. And maybe most importantly, the public versions aren’t HIPAA-compliant, which means using them with any patient data could lead to privacy risks or security breaches.”

David Holt’s insights reflect the growing realization that data protection and patient privacy cannot be outsourced to general-purpose AI. Public versions of LLMs like ChatGPT, Claude, or Gemini were never designed to handle protected health information (PHI). Using them in clinical, administrative, or even educational workflows that involve PHI may violate HIPAA, risking severe penalties and reputational harm.

Risks of using non-medical LLMs in clinical and administrative settings

Outdated knowledge and sources

Public LLMs are trained on static datasets that stop at a certain date. That means guidance about treatments, drug approvals, or clinical protocols can be out of date. “Large Language Models (LLMs) are often paired with a reported cutoff date, the time at which training data was gathered. Such information is crucial for applications where the LLM must provide up-to-date information,” states the study Dated Data: Tracing Knowledge Cutoffs in Large Language Models. With healthcare guidelines continuously changing and new evidence appearing frequently, relying on a model that doesn’t automatically reference the latest literature risks recommending obsolete or unsafe actions.

Even when models are updated more frequently, they are not substitutes for curated, peer-reviewed clinical guidance or local formularies. For use cases such as patient education or administrative drafting, LLMs can help with tone and structure, but they must be paired with verified, up-to-date clinical checks before the content is used with patients.

AI hallucinations

LLMs frequently have hallucinations, generating false, falsified, or deceptive information. Even a minor prevalence of hallucinations can be problematic in clinical settings since outputs are given in formal, authoritative language that may mislead administrators or busy clinicians..

The authors of the study, Developing and evaluating large language model–generated emergency medicine handoff notes, compare LLM-generated clinical notes to physician-written notes and found higher rates of incorrectness in model outputs. The study found that LLM notes had a 9.6% error rate compared to 2.0% for physician notes. Though many errors in that study were not catastrophic, the phenomenon is real and measurable. When hallucinations affect patient-facing or decision-influencing content, patient safety can quickly be jeopardized.

Examples from everyday life highlight the stakes: a high-profile error in a Google research write-up, the invented term “basilar ganglia”, showed how model-style mistakes can slip into clinical materials and be missed by reviewers, raising alarms about automation bias.

Read more:

Algorithmic bias

LLMs are a reflection of the vast quantities of online content that they were trained on. Systemic biases, preconceptions, and gaps are frequently present in such data. Algorithmic bias in healthcare can cause models to underidentify symptoms in specific populations, recommend culturally inappropriate solutions, or exacerbate gaps by favoring the language and norms of the majority group.

Research shows how prejudice can enter AI systems at various phases, including data collection, labeling, model construction, and deployment, and how, if not actively avoided, these biases can replicate or exacerbate health disparities. For this reason, any plan for implementing AI in healthcare must include diverse datasets, fairness testing, and stakeholder engagement.

Limitations of modality and clinical details

David Holt noted that “as of today, it can only work with text”—an important operational limitation for many consumer LLMs. Clinical work often depends on multimodal data: imaging (X-rays, CTs), waveforms (ECGs), scans, and photos. While specialized multimodal models are emerging, generic public LLMs are not designed to parse or interpret clinical images, nor to integrate them meaningfully into diagnostic reasoning.

Even in pure-text tasks, LLMs tend to produce generalist answers. They may miss the nuance required for complex cases: differential diagnosis subtleties, drug interactions in polypharmacy, dose adjustments for renal impairment, or contraindications tied to comorbidities. Those gaps make them unsuitable to replace clinical judgment.

Compliance and privacy

Perhaps the most immediate operational risk for providers is privacy and regulatory compliance. Public versions of consumer LLM platforms do not enter BAAs with covered entities, and data sent to those services can be retained and used to improve models. That means putting protected health information (PHI) into a public LLM may create a HIPAA violation or a data breach.

Clinicians and colleagues should not paste PHI into non-medical LLMs without a HIPAA-compliant contractual and technological arrangement, BAA, zero-data-retention endpoints, and corporate solutions with appropriate controls. This is according to unambiguous guidance and analysis from privacy experts.

Risks to safety in high-stakes, urgent situations

The use of poorly regulated AI in healthcare has been identified by ECRI as a major health-technology risk. Misleading AI outputs can cause disproportionate harm in emergency and acute settings, where time is of the essence and decisions have urgent implications. If left unchecked, the mix of clinician automation bias, time constraints, and trusted language creates a hazardous vector.

Vulnerabilities in transcription and adversaries

Adversarial inputs and transcription errors also pose practical dangers beyond accidental hallucinations. In the study Multi-model assurance analysis showing large language models are highly vulnerable to adversarial hallucination attacks during clinical decision support, the researchers show that LLMs can be manipulated or can mistakenly transcribe content, at times inserting fabricated sentences or misattributing statements in medical conversations. Even a small percentage of errors in clinical transcripts can have outsized consequences in legal or clinical documentation.

Best practices to use LLMs safely in healthcare

The systematic review Implementing large language models in healthcare while balancing control, collaboration, costs and security on AI adoption in healthcare notes stakeholder engagement, continuous monitoring, workflow alignment, and ethical governance. We can derive a robust set of safe-use practices for LLMs in clinical settings:

Engage multidisciplinary stakeholders early: The review underscored that successful AI systems adopt a human-centred, problem-driven approach with engagement from clinicians, biomedical scientists, operational leads, IT staff, and patients. For LLM deployment: build a team that spans compliance, clinicians, legal, data science, and patient-education specialists. Define clearly what tasks the LLM will assist with (e.g., draft education materials, summarize admin forms) and map out human review steps.
Define appropriate use-cases for LLMs (low-risk first): The study found that many AI pilots falter when they are misaligned with clinical workflows or attempt to solve the wrong problem. For LLMs, begin with nonclinical, administrative, or patient-communication tasks rather than diagnostic or treatment decision roles. For example: generating patient onboarding emails, drafting FAQs, and summarising policy updates, always with human oversight.
Pair LLM outputs with curated, up-to-date knowledge sources: One key takeaway from the review is that model performance alone (e.g., retrospective accuracy) does not guarantee clinical utility or safety. For LLM use, ensure outputs reference verified clinical guidelines, formulary documents, or institution-specific protocols. Institute a retrieval-augmented generation (RAG) workflow: the LLM drafts, then a human reviews using current evidence.
Implement human-in-the-loop (HITL) workflows and preserve final clinical judgement: The review emphasized that AI should augment, not replace, human intelligence, particularly in healthcare, where nuance, judgment, and context matter. For LLMs: every piece of content used with patients should be reviewed by a qualified staff member; if the LLM influences clinical decision-making, that decision must be documented alongside the human reviewer’s sign-off.
Continuous monitoring, maintenance, and feedback loops: Post-deployment monitoring was a major theme in the systematic review: “monitor and maintain” as a core component of a reliable AI system. For LLMs: log usage, track error reports or feedback from staff and patients, and audit for hallucinations, bias, or misalignment. Update workflows if model drift or new risk patterns appear.
Bias testing, fairness assessment, and ethical governance: The literature highlighted the need to address algorithmic bias, fairness, ethics, and governance in AI adoption. For LLMs: conduct fairness audits across patient demographics (age, gender, ethnicity, and language dialects). Ensure that templates generated don’t perpetuate stereotypes or exclude underserved populations. Implement governance structures: ethics oversight and documentation of risk mitigation strategies.
Privacy, compliance, and data protection safeguards: While the review focused broadly on AI systems, its governance findings apply equally to LLMs used in healthcare. Data privacy, security, traceability, and accountability are essential. For LLM use: avoid feeding un-redacted PHI into public LLMs; use enterprise versions with zero retention or sign a business associate agreement (BAA) where required; log data flows, encryption, and access controls; ensure that any content interacting with patient data aligns with Health Insurance Portability and Accountability Act (HIPAA) requirements and your local jurisdiction’s laws.

FAQS

What are non-medical LLMs?

Non-medical LLMs are large language models like ChatGPT or Gemini that were trained on general internet data rather than healthcare-specific, peer-reviewed medical datasets. They can write or summarize text effectively, but were not designed for clinical accuracy, safety, or compliance with healthcare regulations such as HIPAA.

Are there HIPAA compliant versions of ChatGPT or other LLMs?

Yes, some enterprise-grade platforms, such as Microsoft Azure OpenAI Service, can offer HIPAA compliance if a business associate agreement (BAA) is in place and appropriate data-handling safeguards are configured. Always confirm this directly with the vendor before using PHI.

What should an organization do if it accidentally enters PHI into a public LLM?

Treat it as a potential HIPAA breach. Notify your compliance officer immediately, document the exposure, and follow your organization’s breach response plan. Evaluate whether the data can be contained and whether patient notification or HHS reporting is required.