This repository contains a small Python script that converts an XLSX file in “dataquieR format” (metadata used to control automated data quality checks with the R package dataquieR) into one or multiple CDISC ODM XML files.
The goal is to transform the metadata table (variables, labels, value labels/codelists, missing lists, …) into a valid ODM structure (StudyEventDef, FormDef, ItemGroupDef, ItemDef, CodeList, …).
- The first sheet is treated as the main metadata sheet (variables).
- All remaining sheets are treated as lookup sheets (e.g. missing list tables).
The output ODM has the following structure:
Study(OID = file basename)MetaDataVersion(OID =MDV.1)- One
StudyEventDefper generated “group” - Inside each study event:
- one
FormDefper form - one
ItemGroupDefper form (1:1 mapping) - multiple
ItemDef(one per variable/row) CodeListelements for value labels (including merged missing lists)
- one
Rows are grouped into a 2-level structure:
-
StudyEvent key
- by default derived from
HIERARCHY - if column
DCEis present and not empty, it overrides the StudyEvent key
- by default derived from
-
Form key
- by default derived from
HIERARCHY - if column
STUDY_SEGMENTis present and not empty, it overrides the Form key
- by default derived from
So conceptually:
DCE→ StudyEvent (if present)STUDY_SEGMENT→ Form (if present)- otherwise: derived from
HIERARCHY
VALUE_LABELS(English) andVALUE_LABELS_DE(German) are parsed into dictionaries.- Identical codelists are deduplicated: they are only written once and referenced from all corresponding variables.
- If a row contains
MISSING_LIST_TABLE, the script attaches the missing list codes to the variable’s final codelist. - Missing list tables are taken from the corresponding additional sheet (same sheet name).
- Missing codes are appended and marked with an alias:
Alias Context="ORIGIN_CODELIST" Name="<sheet>"
To avoid huge ODM files, the script can split output automatically:
-
Default behavior (without
--force_single_odm):- if any generated output group exceeds ~5700 variables, it will be split further
- splitting logic uses
HIERARCHY-based repartitioning/chunking
-
With
--force_single_odm:- everything is written into a single ODM output (even if very large)
The script expects a “dataquieR-like” metadata table and typically uses these columns:
VARNAMESorVAR_NAMES(variable name)HIERARCHYSTUDY_SEGMENT(optional, affects forms)DCE(optional, affects study events)LABEL,LABEL_DENOTE,NOTE_DEDATA_TYPEVALUE_LABELS,VALUE_LABELS_DEMISSING_LIST_TABLE(optional)
If MISSING_LIST_TABLE references a sheet name, that sheet should usually contain:
CODE_VALUECODE_LABEL- and optionally additional columns (they will be written as
<Alias Context="..." Name="..."/>)
python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip pip install -r requirements.txt
py -m venv .venv ..venv\Scripts\Activate.ps1 python -m pip install --upgrade pip pip install -r requirements.txt
python3 dataquieR2ODM.py /path/to/your/file.xlsx
python dataquieR2ODM.py "C:\path\to\your\file.xlsx"
python3 dataquieR2ODM.py /path/to/your/file.xlsx --force_single_odm
Ist written to ../output/ relative to the script location. File naming: Study__.xml