Skip to content

IMI-HD/dataquieR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataquieR → ODM (XLSX Converter)

This repository contains a small Python script that converts an XLSX file in “dataquieR format” (metadata used to control automated data quality checks with the R package dataquieR) into one or multiple CDISC ODM XML files.

The goal is to transform the metadata table (variables, labels, value labels/codelists, missing lists, …) into a valid ODM structure (StudyEventDef, FormDef, ItemGroupDef, ItemDef, CodeList, …).


What the script does

1) Reads the XLSX

  • The first sheet is treated as the main metadata sheet (variables).
  • All remaining sheets are treated as lookup sheets (e.g. missing list tables).

2) Builds the ODM structure

The output ODM has the following structure:

  • Study (OID = file basename)
  • MetaDataVersion (OID = MDV.1)
  • One StudyEventDef per generated “group”
  • Inside each study event:
    • one FormDef per form
    • one ItemGroupDef per form (1:1 mapping)
    • multiple ItemDef (one per variable/row)
    • CodeList elements for value labels (including merged missing lists)

3) Grouping (StudyEvent / FormDef)

Rows are grouped into a 2-level structure:

  • StudyEvent key

    • by default derived from HIERARCHY
    • if column DCE is present and not empty, it overrides the StudyEvent key
  • Form key

    • by default derived from HIERARCHY
    • if column STUDY_SEGMENT is present and not empty, it overrides the Form key

So conceptually:

  • DCE → StudyEvent (if present)
  • STUDY_SEGMENT → Form (if present)
  • otherwise: derived from HIERARCHY

4) Codelists / VALUE_LABELS

  • VALUE_LABELS (English) and VALUE_LABELS_DE (German) are parsed into dictionaries.
  • Identical codelists are deduplicated: they are only written once and referenced from all corresponding variables.

5) Missing lists (MISSING_LIST_TABLE)

  • If a row contains MISSING_LIST_TABLE, the script attaches the missing list codes to the variable’s final codelist.
  • Missing list tables are taken from the corresponding additional sheet (same sheet name).
  • Missing codes are appended and marked with an alias:
    • Alias Context="ORIGIN_CODELIST" Name="<sheet>"

6) Splitting into multiple ODM files

To avoid huge ODM files, the script can split output automatically:

  • Default behavior (without --force_single_odm):

    • if any generated output group exceeds ~5700 variables, it will be split further
    • splitting logic uses HIERARCHY-based repartitioning/chunking
  • With --force_single_odm:

    • everything is written into a single ODM output (even if very large)

Input expectations (XLSX)

Required / commonly used columns

The script expects a “dataquieR-like” metadata table and typically uses these columns:

  • VARNAMES or VAR_NAMES (variable name)
  • HIERARCHY
  • STUDY_SEGMENT (optional, affects forms)
  • DCE (optional, affects study events)
  • LABEL, LABEL_DE
  • NOTE, NOTE_DE
  • DATA_TYPE
  • VALUE_LABELS, VALUE_LABELS_DE
  • MISSING_LIST_TABLE (optional)

Missing list tables (other sheets)

If MISSING_LIST_TABLE references a sheet name, that sheet should usually contain:

  • CODE_VALUE
  • CODE_LABEL
  • and optionally additional columns (they will be written as <Alias Context="..." Name="..."/>)

Installation

Linux/macOS

python3 -m venv .venv source .venv/bin/activate python -m pip install --upgrade pip pip install -r requirements.txt

Windows

py -m venv .venv ..venv\Scripts\Activate.ps1 python -m pip install --upgrade pip pip install -r requirements.txt


Usage

Linux/macOS

python3 dataquieR2ODM.py /path/to/your/file.xlsx

Windows

python dataquieR2ODM.py "C:\path\to\your\file.xlsx"

Force a single ODM

python3 dataquieR2ODM.py /path/to/your/file.xlsx --force_single_odm

Output

Ist written to ../output/ relative to the script location. File naming: Study__.xml

About

Converting XLSX into ODM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages