Skip to content

Synthetic Data Guide

Mike Eng edited this page Apr 6, 2022 · 60 revisions

For development and testing purposes, the following claims data APIs:

provide a set of public synthetic beneficiary and claims data, captured in synthetic Patient, Coverage, and ExplanationOfBenefit (EOB) resources. This is a resource that you can use as you explore and develop against the APIs in the sandbox environments. The synthetic data contains no personally-identifiable information (PII) or personal health information (PHI).

Our synthetic data will remain unchanged so that you can rely on it for ongoing use. That is, you can write tests, demos, etc. against this data, and the underlying data should not change later and produce unexpected results.

Use Cases for Synthetic Data

This data is intended for use in development and testing activities, e.g. building out an application prior to getting production access and having a safe set of data to use for ongoing development activities. It's also a great resource when demoing your application, giving you a safe way to do that without exposing any beneficiary PII or PHI.

How to Access the Synthetic Data

The synthetic data will be available in each API's sandbox environment. Refer to specific API documentation for details.

Distinguishing Synthetic Data from Production Data

It's easy to distinguish the synthetic data from production data: all synthetic records have Patient.id and ExplanationOfBenefit.id values that are negative. All production records have positive values for those fields.

Please note that, while we are working to improve the coverage and quality of our synthetic data, there are absolutely still differences between it and our production data. In general, you should expect that:

  • Not all fields will be present in the synthetic data.
  • Some of the fields that are present may not have realistic values.
  • The various releases of synthetic data (see below) will have differences from each other.

Available Synthetic Beneficiaries

Title Beneficiary ID Ranges Characteristics
from October, 2021 -10000000000000 to -10000000009999 Beneficiary Characteristics File

Release History

Time Description Details
Spring, 2022 Additional 10,000 enhanced synthetic beneficiaries This updated synthetic data set includes many enhancements and fixes described in the categories below.

Population

Now covers a nationwide population, as opposed to a single state previously. Populations in different states reflect the demographics in those states based on US Census data as well as geographic price adjustments (based on the 2019 CMS Geographic Variation Public Use File).

Beneficiary

More accurate Medicare eligibility determinations based on End Stage Renal Disease. Additional beneficiary eligibility fields are now populated. Approximately 20% of beneficiaries change their current Part D plan each January, to simulate real rates of plan changes. Beneficiaries will also now take advantage of Indian Health Services (IHS) Health Centers and Health Stations, if they are eligible.

Claims

Annual wellness encounter claims are now moved from the outpatient file to the carrier file, based on provider.

Outpatient and Inpatient claims can now include encounters with coded "reasons for visit", even if no diagnoses or procedures are recorded for that encounter. Diagnoses that were "present on admission" are now properly identified, and external diagnosis codes are populated.

Durable Medical Equipment (DME) claims now include supplies, not just implantable medical devices.

Clinical

This synthetic data set adds the following conditions and treatments: COVID-19 vaccinations, updates to Sepsis, Spina Bifida, Cerebral Palsy, and the prescribing of opioids for chronic pain and Opioid Use Disorder (OUD) treatment.

Data Field Additions

This data set includes 143 additional claim fields. The count per claim file is below. Note that some fields repeat across claim types.

  1. Beneficiary: 41
  2. Beneficiary History: 3
  3. Inpatient: 71
  4. Outpatient: 50
  5. Carrier: 26
  6. PDE: 0
  7. DME: 25
  8. HHA: 47
  9. Hospice: 48
  10. SNF: 57

Bug Fixes

This release fixes a tab-related formatting error, correctly handles "future" death dates, corrects the data types for Claim ID and Beneficiary ID (from Integer to Long), and now correctly groups claim line items together. | | October, 2021 | 10,000 Enhanced synthetic beneficiaries | Adding 10,000 more synthetic beneficiaries to the sandbox and production environments with more realistic and robust data in response to user requests. For example, this set will contain:

  • More recent dates
  • More realistic values for NPIs and ZIP Codes (e.g. not all "99999", though NPIs will not tie to real organizations or providers)
  • More realistic patient information (e.g. dates of birth that are in line with most Medicare beneficiaries)
  • More realistic clinical scenarios (e.g. prescriptions and procedures that make sense with given diagnoses)
  • All of the same Explanation of Benefits (EOB) profiles that production data can include, which are:
    • Carrier
    • Durable Medical Equipment (DME)
    • Home Health Aide (HHA)
    • Hospice
    • Inpatient
    • Outpatient
    • Part D Events (PDE)
    • Skilled Nursing Facility (SNF)
| | Early 2021 | Added outpatient claims | Enhanced the initial set of beneficiaries to include outpatient claims, bringing the full list of EOB profiles to:
  • Carrier
  • Inpatient
  • Outpatient
  • Part D Events (PDE)
| | 2017 | Initial 30,000 Synthetic Beneficiaries | 30,000 synthetic beneficiaries and about 1,000,000 synthetic claims that covered the following Explanation of Benefits (EOB) profiles:
  • Carrier
  • Inpatient
  • Part D Events (PDE)
|

For further questions on this synthetic data, please see the FAQ.

Clone this wiki locally