Feeding the Machine: Improving Population Health Analyses Through AI Tools

With the increased need for data to support artificial intelligence (AI) and large language models, data aggregation and de-identification are more important than ever when health data is involved. Although both concepts have long been used to support healthcare analytics, their use in connection with AI models requires careful attention. Poorly structured data practices can create regulatory risk, undermine contractual commitments, and erode patient trust.

For organizations seeking to use health data in AI-enabled population health work, there are several important pathways to consider. In broad terms, those pathways include

data aggregation by a business associate;
use of a shared vendor or platform operating under appropriate business associate agreements (BAAs);
Health Insurance Portability and Accountability Act of 1996 (HIPAA)-compliant de-identification; and
treatment, payment, and healthcare operations activities.

These pathways often overlap in practice, but they are not interchangeable. Each has different requirements and limitations. A vendor that can perform data aggregation under one arrangement may not be permitted to use the same data for broader AI development. Similarly, a dataset that has been masked or tokenized may still be PHI unless it satisfies HIPAA’s de-identification standard. Understanding the differences among these pathways is particularly important as organizations use AI tools for population health analysis, quality improvement, and public health planning.

USING PHI IN AI TOOLS: FOUR PRIMARY PATHWAYS

Absent a HIPAA-compliant authorization from the individual or the individual’s personal representative, or another permitted basis under HIPAA, protected health information (PHI) subject to HIPAA may not be utilized by AI tools. Even where HIPAA permits a use or disclosure, the activity may still be constrained by the relevant BAA, data use agreement, notice of privacy practices, minimum necessary requirements, or other federal or state laws or contractual commitments.

Pathway 1 – Data Aggregation

In the HIPAA Privacy Rule, data aggregation is recognized as a service that business associates may perform for covered entities when authorized by an applicable BAA. See 65 Fed. Reg. 82462, 82475 (Dec. 28, 2000).

In this context, however, data aggregation is not simply the large-scale combination of health data for anonymization purposes. HIPAA defines “data aggregation” more narrowly: it is PHI, created or received by a business associate in its capacity as the business associate of a covered entity, that is combined with PHI that the business associate receives in its capacity as a business associate of another covered entity, to permit data analyses that relate to the healthcare operations of the respective covered entities. 45 CFR 164.501.

Pathway 2 – Shared Vendors

A second pathway involves the use of a shared vendor or platform that maintains appropriate BAAs with participating covered entities or business associates. This pathway may overlap with data aggregation, but it should be analyzed separately. A common vendor may support AI-enabled analytics, quality improvement, risk stratification, or care coordination services across multiple covered entities. If the vendor receives, maintains, or processes PHI on behalf of those entities, it will generally need to operate as a business associate for each of the covered entities and comply with the terms of each BAA.

BAAs should address permitted uses and disclosures, safeguards, subcontractor data handling, return or destruction of PHI, and whether data aggregation or other AI-related services are authorized. Importantly, the use of a common vendor does not, by itself, authorize broad secondary use of PHI.

Pathway 3 – De-Identification

De-identification is the removal of specific personal identifiers from PHI in accordance with HIPAA’s de-identification standard, enabling the data to be used and disclosed outside the HIPAA Privacy Rule’s restrictions.

Health information that does not identify the individual—or for which there is no reasonable basis to believe the information can be used to identify the individual—is, by definition, not individually identifiable health information. Therefore, once properly de-identified, health data is no longer PHI subject to HIPAA and may be used or disclosed without the restraints of that statute. Of course, such data may still be subject to other federal or state laws, contractual restrictions, or other data use commitments.

HIPAA’s de-identification standard may be satisfied through either the Safe Harbor method, which requires removal of 18 categories of identifiers and no actual knowledge that the remaining information could be used to identify the individual, or the Expert Determination method, under which a qualified expert applies generally accepted statistical and scientific principles and methods, determines that the risk of identification is very small, and documents the methods and results of the analysis. See 45 CFR 164.514(b)(1)–(2).

Pathway 4 – Treatment, Payment, Operations

Finally, HIPAA permissions for treatment, payment, and healthcare operations may support certain AI-enabled analytics activities, particularly where the activity is tied to the covered entity’s own operations or to permitted arrangements among participating entities. These may include quality assessment and improvement, population-based activities relating to improving health or reducing healthcare costs, case management, care coordination, and related functions. But the fact that an activity is framed as population health or quality improvement does not automatically mean that all proposed AI uses of PHI are permissible. The analysis remains fact-specific.

PRACTICE TIP: DE-IDENTIFICATION IS NOT THE SAME AS MASKING

De-identification is one of the most useful tools for AI-enabled health analytics because, if done properly, it removes the data from HIPAA’s restrictions. AI and machine learning tools can also assist with de-identification, especially when organizations are working with large datasets or unstructured text such as clinical notes, discharge summaries, and laboratory narratives.

Where PHI is appropriately de-identified, it can be further analyzed and potentially aggregated with other datasets. De-identified data can support medical research studies, policy assessments, quality improvement initiatives, effectiveness analyses, and other studies without violating patient privacy or requiring individual authorizations under HIPAA. At the same time, organizations should be cautious about assuming that technical changes to a dataset are enough.

Masking or pseudonymizing data (replacing specific identifiers with fictional values or tokens) is common, but it does not qualify as true de-identification by itself because the data can be re-identified if the appropriate key is available. This is a common misconception. Tokenization, hashing, or coding may support a broader de-identification strategy, particularly under the Expert Determination method, but those techniques must be evaluated in context and should not be treated as automatically sufficient.

AI tools can make de-identification more efficient and scalable, particularly when combined with appropriate validation and human oversight, particularly for large and complex datasets. Datasets can also be dynamically adjusted to reduce re-identification risks, including through active replacement of data values, generalization, aggregation of small cell sizes, or replacement of high-risk values. Because re-identification risk can change as outside datasets, recipients, and analytic techniques evolve, organizations should document their methodology and periodically reassess whether the data remains appropriate for the intended use.

HOW AGGREGATED DATA MOVES THROUGH THE AI LIFE CYCLE

Once an organization identifies a viable legal pathway to use health data for AI modeling, the next question is how the data will actually move through the AI process. In practice, aggregated health data tends to follow a life cycle: sources are identified, data is prepared and harmonized, the data is analyzed, outputs are validated, and the model or analytic process is governed over time.

AI can help at the earliest stages by identifying potential data sources, harmonizing inconsistent coding systems, normalizing variables, and surfacing patterns across populations appropriate for research studies and health analytics. But the starting point matters. Health data may be outdated, incomplete, or limited by permissions, authorizations, waivers, data use agreements, BAAs, or other restrictions applicable to the individuals whose data is used. It may also be limited by the number of available data points for a particular type of research.

Stage 1 – Data Selection

Organizations should evaluate data provenance, including where the data originated, how it was collected, what transformations have already been applied (if applicable), and whether the proposed AI use is consistent with the purpose for which the data was collected or disclosed.

Stage 2 – Data Harmonization

Once data sources are identified, organizations often need to clean, harmonize, and standardize the information before it can be used effectively. AI can assist by identifying inconsistent coding systems, normalizing variables, detecting outliers, and helping transform unstructured data into analyzable formats. Where the organization intends to use de-identified data, this is also the point at which HIPAA-compliant de-identification should occur.

Stage 3 – Data Aggregation and Analysis

The aggregation stage is where AI offers some of its most significant advantages. AI tools can analyze large datasets, identify correlations, generate hypotheses, and support population health research in areas that are typically underfunded or overlooked. Public health organizations, such as accountable communities of health, public health plans, and health associations that have data suitable for HIPAA-compliant de-identification or already maintain large de-identified datasets, can leverage AI tools to identify community and population-level health trends and develop evidence-based solutions.

AI can also generate synthetic data, allowing researchers to fill gaps in datasets and simulate data where real inputs are missing. Synthetic data may be useful for testing models, developing analytic tools, or supplementing limited datasets. However, synthetic data should not be assumed to be de-identified merely because it is synthetic. Organizations should evaluate whether the generation process could reproduce, reveal, or be reverse engineered to expose information about real individuals.

Stage 4 – Output Validation and Monitoring

The same qualities that make AI useful also create risk. AI is only as reliable as the data it is fed. Bias, outliers, and inconsistencies can compromise a model’s accuracy, and errors in AI can be amplified, leading to unintended or misleading results. For that reason, organizations should validate AI outputs before relying on them for clinical, operational, research, or policy decisions. Validation should include review of data quality, representativeness, model performance, potential bias, auditability, security controls, and whether outputs are explainable and appropriate for the intended use.

The analysis does not end when a model produces an output. Re-identification risk can change over time as outside datasets, analytic techniques, and recipients evolve. A model that performs adequately at one point may also degrade as clinical practices, coding systems, populations, or data inputs change. Organizations should document their methodology, monitor model performance, and periodically reassess whether the data and model remain appropriate for the intended use.

CONSIDERATIONS

AI introduces powerful new tools for data de-identification and aggregation that may help organizations focus on quality improvement, population health, and public health trends. It also raises familiar privacy questions in a new form. Before feeding health data into an AI tool, organizations should identify the applicable HIPAA pathway, confirm that the proposed use is permitted by relevant agreements and privacy commitments, and understand how the data will be governed throughout its life cycle.

For AI-enabled population health work, de-identification and properly structured data aggregation are critical tools. But they operate differently and should not be treated as interchangeable safe harbors. Given the contractual requirements and the narrow scope of the de-identification standards, limited data set rules, HIPAA data aggregation permissions, and healthcare operations pathways, interested persons should consult a qualified healthcare privacy attorney before using health data in AI tools.

For more AI in healthcare news, subscribe to Health Law Scan and explore all articles in our AI in Healthcare series.

AI in Healthcare: Executive Summary

In this article series, our healthcare, privacy, and FDA lawyers are covering the fundamentals for what providers, physicians, hospitals, and the vendors who support them need to know about how to maximize the impact of AI in their organizations while protecting important patient data and maintaining regulatory compliance.

AI in Healthcare: Key Legal Questions to Address Before Deployment

This article outlines key questions and compliance concepts to consider based on common scenarios in which healthcare entities “feed the machine” with sensitive data.

Healthcare AI Deployment: Compliance Through Contracting, BAAs, and Data Governance

This article highlights key legal considerations for using AI systems with protected health information (PHI), with a focus on agreements, business associate obligations, and data governance.

AI in Healthcare: A Practical Checklist for Compliance and Risk Management

AI is rapidly being integrated into healthcare delivery, operations, and patient engagement, increasing legal and compliance complexity. This checklist highlights key risks and considerations across data use, contracting, governance, and regulatory oversight.

When AI Informs Diagnosis: Privacy, Consent, and Liability Considerations

As healthcare organizations deploy AI in imaging, diagnostics, and remote patient monitoring, legal and regulatory questions are becoming increasingly difficult to ignore

Molecules & Machines: The Rise of AI-Assisted Drug Development

AI is accelerating drug discovery while raising new privacy, intellectual property, and regulatory considerations.

Terms and Conditions May Apply: What to Know Before Contracting for AI Services in Healthcare

Healthcare organizations should carefully address data rights, liability, privacy, transparency, and regulatory compliance when contracting for AI services.