PatientGenerator

Lifecycle: experimental R-CMD-check Codecov test coverage

PatientGenerator facilitates the creation of synthetic test datasets for the OMOP Common Data Model (CDM) using two complementary approaches:

The package also includes support for Hecate-powered concept lookups to ensure valid OMOP concept codes.

Installation

# install.packages("remotes")
remotes::install_github("mi-erasmusmc/PatientGenerator")

Workflow Overview

  1. Generate an initial synthetic cohort using patientChat.
  2. Save JSON test sets to the local filesystem.
  3. Refine patients using patientDesigner().

Synthetic Patient Generation with patientChat

Set an OPENAI_API_KEY environment variable (e.g., via usethis::edit_r_environ()) to enable LLM access.

Available models can be listed using PatientGenerator::availableModels().

library(PatientGenerator)

patientGenerator <- patientChat$new(
  model = "gpt-5.4",
  echo = "none"
)

Generating Patients via Natural Language Prompts

Provide detailed prompts, including specific concept sets, for optimal results.

patientGenerator$prompt(
  "Population (person table):
     - 10 adult patients
     - 5 female
     - 5 male
  
   Observation Period:
     - Start date between date of birth and 2025-12-31
  
   Condition Occurrence:
     - All patients must have Diabetes (condition_concept_id: 201826)
     - Start date between 2015-01-01 and 2020-12-31
  
   Drug Exposure:
     - All patients must have Semaglutide (drug_concept_id: 19079450)
     - Exposure within 30 days post-index date
  
   Measurement:
     - All patients must have Fasting glucose (measurement_concept_id: 3018251)
  
   Procedure Occurrence:
     - 50% of patients must have Amputation of toe (procedure_concept_id: 4159766)
  
   Output Requirements:
     - Populate only the tables specified in this prompt"
)

Integration with testthat

Save the generated dataset as a JSON file and utilize TestGenerator::patientsCDM to instantiate a CDM reference.

patientGenerator$save(name = "diabetes-patients")

cdm <- TestGenerator::patientsCDM(
  testName = "diabetes-patients",
  cdmVersion = "5.4"
)

cdm$person |> 
  collect() |> 
  print()
#> cdm$person |> collect() |> head(5)
#>    person_id gender_concept_id year_of_birth person_source_value
#>        <int>             <int>         <int>              <char>
#> 1:         1              8532          1965              SYN001
#> 2:         2              8532          1972              SYN002
#> 3:         3              8532          1958              SYN003
#> 4:         4              8532          1981              SYN004
#> 5:         5              8532          1949              SYN005

Iterative Refinement

The LLM can be instructed to modify the current test set within the same patientChat instance.

patientGenerator$prompt("Remove all male patients")
#> cdm$person |> collect() |> head(5)
#>    person_id gender_concept_id year_of_birth person_source_value
#>        <int>             <int>         <int>              <char>
#> 1:         1              8532          1965              SYN001
#> 2:         2              8532          1972              SYN002
#> 3:         3              8532          1958              SYN003
#> 4:         4              8532          1981              SYN004
#> 5:         5              8532          1949              SYN005

Visual Review and Editing with patientDesigner()

Launch the interactive editor to review and refine datasets:

PatientGenerator::patientDesigner()

The interface supports:

Concept Search with Hecate

patientDesigner integrates a concept search module powered by hecateSearch(). This allows users to search for and insert valid OMOP concept IDs directly into the CDM tables.

Configure Hecate globally via environment variables:

Sys.setenv(
  HECATE_BASE_URL = "https://your-hecate-server/api",
  HECATE_API_KEY = "your-api-key"
)

Or via package options:

options(PatientGenerator.hecate = list(
  base_url = "https://your-hecate-server/api",
  timeout_ms = 15000,
  api_key = "your-api-key"
))

Further Documentation