Implementation of the KarmaLego time-interval pattern mining pipeline with end-to-end data ingestion, pattern discovery, and patient-level application.
Based on the paper:
Moskovitch, Robert, and Yuval Shahar. "Temporal Patterns Discovery from Multivariate Time Series via Temporal Abstraction and Time-Interval Mining."
(See original for theoretical grounding.)
This implementation is designed to be used as an temporal analysis and feature extraction tool in my thesis.
KarmaLego mines frequent time-interval relation patterns (TIRPs) by combining two key stages:
KarmaLego first scans each patient's timeline to identify pairwise Allen relations (e.g., before, overlaps, meets) between intervals:
Each cell in the matrix shows the temporal relation between intervals (e.g., A¹ o B¹ = A¹ overlaps B¹). These relations become the building blocks of complex temporal patterns.
Patterns are built incrementally by traversing a tree of symbol and relation extensions, starting from frequent 1-intervals (K=1) and growing to longer TIRPs (K=2,3,...). Only frequent patterns are expanded (Apriori pruning), and relation consistency is ensured using transitivity rules.
Practical flow in this implementation. The pipeline enumerates singletons and all frequent pairs (k=2) in the Karma stage. The Lego stage then skips extending singletons and starts from k≥2, extending patterns to length k+1. This avoids regenerating pairs (and their support checks) a second time. I also apply CSAC (see below), which anchors each extension on the actual parent embeddings inside each entity, ensuring only consistent child embeddings are considered.
This structure enables efficient discovery of high-order, temporally consistent patterns without exhaustively searching all combinations.
This repository provides:
- A clean, efficient implementation of KarmaLego (Karma + Lego) for discovering frequent Time Interval Relation Patterns (TIRPs).
- Support for pandas / Dask-backed ingestion of clinical-style interval data.
- Symbol encoding, pattern mining, and per-patient pattern application (apply modes: counts and cohort-normalized features).
- Utilities for managing temporal relations, pattern equality/deduplication, and tree-based extension.
The design goals are: clarity, performance, testability, and reproducibility.
This implementation incorporates several core performance techniques from the KarmaLego framework:
-
Apriori pruning: Patterns are extended only if all their (k−1)-subpatterns are frequent, cutting unpromising branches early.
-
Temporal relation transitivity (+ memoization): Allen relation composition reduces relation search at extension time; the
compose_relation()function is memoized to eliminate repeated small-table lookups. -
SAC (Subset of Active Candidates): Support checks for a child TIRP are restricted to the entities that supported its parent, avoiding scans of unrelated entities at deeper levels.
-
CSAC (Consistent SAC, embedding-level)
Beyond entity filtering, we maintain the exact parent embeddings (index tuples) per entity and only accept child embeddings that extend those specific tuples.- Accuracy: identical to full search; CSAC is pruning, not approximation.
- Speed: large savings in dense timelines; no wasted checks on impossible extensions.
-
Skip duplicate pair generation: Pairs (k=2) are produced once in Karma and not re-generated in Lego. This eliminates ~×2 duplication for pairs and can reduce Lego runtime dramatically.
-
Precomputed per-entity views (reused everywhere): Lexicographic sorting and symbol→positions maps are built once and reused in support checks and extension, avoiding repeat work.
-
Integer time arithmetic: Timestamps are held as
int64; relation checks use pure integer math. If source data were datetimes, they are converted to ns; if they were numeric, they remain your unit.
These optimizations ensure that KarmaLego runs efficiently on large temporal datasets and scales well as pattern complexity increases.
Performance Notes:
- The core KarmaLego algorithm operates on in-memory Python lists (
entity_list) and is not accelerated by Dask. - The current Lego phase runs sequentially. Attempts to parallelize it (e.g., with Dask or multiprocessing) introduced overhead that slowed performance.
- Dask can still be useful during ingestion and preprocessing (e.g., using
dd.read_csv()for large CSVs). - Fine-grained parallelism is not recommended due to fast per-node checks and high task management overhead. If the support task increases significantly, perhaps a patient-level parallelism of a TIRP will become useful.
- Better scaling can be achieved by:
- Splitting the dataset into concept clusters or patient cohorts and running in parallel across jobs.
- Using
min_ver_suppandmax_kto control pattern explosion. - Persisting symbol maps to ensure consistent encoding across runs.
- No k=1→k=2 in Lego: pairs are already created in Karma; Lego starts from k≥2. This removes structural duplicates and their support checks.
- CSAC memory hygiene: parent embedding state is released after each support check; leaf nodes release their own embedding maps, keeping peak RAM lower on large runs.
KarmaLego/
├── core/
│ ├── __init__.py # package marker
│ ├── karmalego.py # algorithmic core: TreeNode, TIRP, KarmaLego/Karma/Lego pipeline
│ ├── io.py # ingestion / preprocessing / mapping / decoding helpers
│ ├── relation_table.py # temporal relation transition tables and definitions
│ └── utils.py # low-level helpers
├── data/
│ ├── synthetic_diabetes_temporal_data.csv # example input dataset (output from the Mediator)
│ ├── symbol_map.json # saved symbol encoding (concept:value -> int)
│ └── inverse_symbol_map.json # reverse mapping for human-readable decoding
├── unittests/
│ ├── test_treenode.py # TreeNode behavior
│ ├── test_tirp.py # TIRP equality, support, relation semantics
│ └── test_karmalego.py # core pipeline / small synthetic pattern discovery
├── main.py # example end-to-end driver / demo script
├── pyproject.toml # editable installation manifest
├── pytest.ini # pytest configuration
├── requirements.txt # pinned dependencies (pandas, dask, tqdm, pytest, numpy, etc.)
├── README.md # human-readable version of this document
└── .gitignore # ignored files for git
Recommended Python version: 3.8+
Use a virtual environment:
python -m venv .venv
# Windows:
.\.venv\Scripts\activate
pip install -e .
pip install -r requirements.txt pytestThe -e . makes the local package importable as core.karmalego during development.
Input must be a table (CSV or DataFrame) with these columns:
PatientId: identifier per entity (patient)ConceptName: event or concept (e.g., lab test name)StartDateTime: interval start (e.g.,"08/01/2023 00:00"inDD/MM/YYYY HH:MM)EndDateTime: interval end (same format)Value: discrete value or category (e.g.,'High','Normal')
You have full flexibility to affect the input and output shapes and formats in the io.py module, as long as you maintain this general structure.
Example row:
PatientId,ConceptName,StartDateTime,EndDateTime,Value
p1,HbA1c,08/01/2023 0:00,08/01/2023 0:15,High
The provided main.py demonstrates the full pipeline:
- Load the CSV (switch to Dask if scaling).
- Validate schema and required fields.
- Build or load symbol mappings (
ConceptName:Value→ integer codes). - Preprocess: parse dates, apply mapping.
- Convert to entity_list (list of per-patient interval sequences).
- Discover patterns using KarmaLego.
- Decode patterns back to human-readable symbol strings.
- Apply patterns to each patient using the apply modes (see below):
tirp-count,tpf-dist,tpf-duration. - Persist outputs: discovered_patterns.csv and a single wide CSV with five feature columns per pattern.
Example invocation:
python main.pyThis produces:
discovered_patterns.csv— flat table of frequent TIRPs with support and decoded symbols.patient_pattern_vectors.ALL.csv— one row per (PatientId, Pattern) with 5 columns: tirp_count_unique_last, tirp_count_all, tpf_dist_unique_last, tpf_dist_all, tpf_duration
You can pivot patient_pattern_vectors.ALL.csv to a wide feature matrix for modeling.
epsilon: temporal tolerance for equality/meet decisions (same unit as your preprocessed timestamps). If source columns were datetimes, they are converted to ns, so you may pass a numeric ns value or apd.Timedelta.max_distance: maximum gap between intervals to still consider them related (e.g., 1 hour →pd.Timedelta(hours=1)), same unit rule as above.min_ver_supp: minimum vertical support threshold (fraction of patients that must exhibit a pattern for it to be retained).
-
tirp-count
Horizontal support per patient. Counting strategy:- unique_last (default): count one occurrence per distinct last-symbol index among valid embeddings (e.g., in
A…B…A…B…C,A<B<Ccounts 1). - all: count every embedding.
- unique_last (default): count one occurrence per distinct last-symbol index among valid embeddings (e.g., in
-
tpf-dist
Min–max normalize thetirp-countvalues across the cohort per pattern into [0,1]. -
tpf-duration
For each patient and pattern, take the union of the pattern’s embedding windows-each window is the full span from the start of the first interval in the embedding to the end of the last-so overlapping/touching windows are merged and not double-counted (gaps inside a window are included). The per-patient total is then min–max normalized across patients (per pattern) to [0,1].
Example: ifA<Boccurs with windows of 10h and 5h that don’t overlap, duration =10 + 5 = 15; if the second starts 2h before the first ends, duration =10 + 5 − 2 = 13.
-
Horizontal support (per entity): number of embeddings (index-tuples) of the TIRP found in a given entity.
Discovery counts all valid embeddings (standard KarmaLego).
Apply offers a strategy: unique_last (one per distinct last index; default) or all (every embedding).
Example: InA…B…A…B…C, the patternA…B…Chas 3 embeddings; discovery counts 3(A₀,B₁,C₄),(A₀,B₃,C₄),(A₂,B₃,C₄), whiletirp-countwithunique_lastcounts 1. -
Vertical support (dataset level): fraction of entities that have at least one embedding of the TIRP.
If you prefer “non-overlapping” or “one-per-window” counts for downstream modeling, compute that in the apply phase without changing discovery (a toggle can be added there).
from core.karmalego import KarmaLego
# Prepare entity list, examples in io.py module (may vary between datasources)
# Full running example in main.py
kl = KarmaLego(epsilon=pd.Timedelta(minutes=1),
max_distance=pd.Timedelta(hours=1),
min_ver_supp=0.03)
df_patterns = kl.discover_patterns(entity_list, min_length=1, max_length=None) # returns DataFramefrom core.parallel_runner import run_parallel_jobs
# 1. Define your jobs
# Each job is a dict with 'name', 'data' (entity_list), and 'params'
jobs = [
{
'name': 'cohort_A',
'data': entity_list_A,
'params': {
'epsilon': pd.Timedelta(minutes=1),
'max_distance': pd.Timedelta(hours=1),
'min_ver_supp': 0.5,
'min_length': 2
}
},
{
'name': 'cohort_B',
'data': entity_list_B,
'params': {
'epsilon': pd.Timedelta(minutes=1),
'max_distance': pd.Timedelta(hours=1),
'min_ver_supp': 0.4,
'min_length': 2
}
}
]
# 2. Run in parallel (uses multiprocessing)
# Returns a single DataFrame with a 'job_name' column identifying the source job
df_all = run_parallel_jobs(jobs, num_workers=4)# Build all 5 feature columns in one CSV
rep_to_str = {repr(t): s for t, s in zip(df_patterns["tirp_obj"], df_patterns["tirp_str"])}
pattern_keys = [repr(t) for t in df_patterns["tirp_obj"]]
vec_count_ul = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tirp-count", count_strategy="unique_last")
vec_count_all = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tirp-count", count_strategy="all")
vec_tpf_dist_ul = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tpf-dist", count_strategy="unique_last")
vec_tpf_dist_all = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tpf-dist", count_strategy="all")
vec_tpf_duration = kl.apply_patterns_to_entities(entity_list, df_patterns, patient_ids,
mode="tpf-duration", count_strategy="unique_last")
rows = []
for pid in patient_ids:
for rep in pattern_keys:
rows.append({
"PatientId": pid,
"Pattern": rep_to_str.get(rep, rep),
"tirp_count_unique_last": vec_count_ul.get(pid, {}).get(rep, 0.0),
"tirp_count_all": vec_count_all.get(pid, {}).get(rep, 0.0),
"tpf_dist_unique_last": vec_tpf_dist_ul.get(pid, {}).get(rep, 0.0),
"tpf_dist_all": vec_tpf_dist_all.get(pid, {}).get(rep, 0.0),
"tpf_duration": vec_tpf_duration.get(pid, {}).get(rep, 0.0),
})
import pandas as pd
pd.DataFrame(rows).to_csv("data/patient_pattern_vectors.ALL.csv", index=False)This block can be parallelized on a patient level or on a function level, but since it's usage can change between works, no point in adding a single parallelism method.
Run the full test suite:
pytest -q -sRun a single test file:
pytest unittests/test_tirp.py -q -sThe -s flag shows pattern printouts and progress bars for debugging.
Contains:
symbols(tuple of encoded ints)relations(tuple of temporal relation codes)k(pattern length)vertical_supportsupport_countentity_indices_supportingindices_of_last_symbol_in_entitiestirp_obj(internal object; drop before sharing)symbols_readable(if decoded)
Wide long format: one row per (PatientId, Pattern) with the following columns:
tirp_count_unique_last— horizontal support per patient using unique_last counting.tirp_count_all— horizontal support per patient counting all embeddings.tpf_dist_unique_last— min–max oftirp_count_unique_lastacross patients, per pattern.tpf_dist_all— min–max oftirp_count_allacross patients, per pattern.tpf_duration— union of embedding spans per patient (no overlap double-counting), then min–max across patients, per pattern.
Note:
tpf-*values are normalized per pattern to [0,1] across the cohort.
Example for singletons: ifAspans are[1,2,1]across patients, they normalize to[0.0, 1.0, 0.0].
- Replace
pandas.read_csvwithdask.dataframe.read_csvfor large inputs; the ingestion helpers inio.pysupport Dask. - Persist precomputed symbol maps to keep encoding stable across runs.
- Use categorical dtype for symbol column after mapping to reduce memory pressure.
- Tune
min_ver_suppto control pattern explosion vs sensitivity. - If memory is tight on extremely dense data, consider limiting
max_kor post-processing horizontal support to non-overlapping counts; CSAC itself preserves correctness but can retain many embeddings per entity for highly frequent TIRPs.
This implementation is entirely in-memory. The raw data (entity_list), the pattern tree, and the CSAC embedding maps (which store valid index-tuples for every active pattern) all reside in RAM.
Memory usage is driven by:
- Dataset Size: Number of records (intervals).
- Pattern Density: How many frequent patterns exist and how many embeddings they have per patient.
- CSAC Overhead: Storing exact embedding tuples for active candidates is memory-intensive for dense data.
Rough Estimation Table Assumption: Average of 500 records (intervals) per patient.
| Patients | Records (Total) | Unique Events | Est. RAM (Min) | Est. RAM (Safe) | Recommendation |
|---|---|---|---|---|---|
| 10k | 5M | 20 | 4 GB | 8 GB | Laptop OK. |
| 10k | 5M | 50 | 8 GB | 16 GB | Workstation. |
| 10k | 5M | 100 | 16 GB | 32 GB | High-RAM Workstation. |
| 50k | 25M | 20 | 16 GB | 32 GB | High-RAM Workstation. |
| 50k | 25M | 50 | 32 GB | 64 GB | Server / Split. |
| 50k | 25M | 100 | 64 GB | 128 GB | Split Cohort. |
| 100k | 50M | 20 | 32 GB | 64 GB | Server / Split. |
| 100k | 50M | 50 | 64 GB | 128 GB | Split Cohort. |
| 100k | 50M | 100 | 128 GB+ | 256 GB+ | Must Split. |
These loads are only assessed by AI and are not verified in real usage.
Mitigation Strategy: If your data exceeds these limits, do not run as a single job.
- Split the cohort: Divide patients into chunks (e.g., 10k patients per chunk).
- Run in parallel: Use the
run_parallel_jobsutility to process chunks independently. - Merge results: Concatenate the resulting pattern DataFrames. Note that vertical support will be local to each chunk; you may need to re-aggregate global support if exact global statistics are required.
- Core logic lives in
core/karmalego.py. Utilities (relation inference, transition tables) incore/utils.pyandcore/relation_table.py. - Input-Output logic lives in
core/io.pyand controls the formats, data structure, source and destination. Currently adjusted to work on local CPU with csv files. - The pattern tree is built lazily/iteratively; flat exports are used downstream for speed.
- Equality and hashing ensure duplicate candidate patterns merge correctly.
- Tests provide deterministic synthetic scenarios for regression.
- SAC/CSAC implementation notes:
- We store embeddings_map on each TIRP (per-entity lists of index tuples) and pass it to children as parent_embeddings_map so LEGO only considers child embeddings that extend actual parent tuples (CSAC).
- After support filtering, we free parent_embeddings_map. If a node has no valid extensions, we also free its own embeddings_map.
- LEGO does not extend singletons; pairs are produced in Karma once and re-used, eliminating duplicate pair discovery.

