feat(medcat): CU-869bhknfm Refactor setting of filters for embedding linker #268

mart-r · 2025-12-18T14:33:17Z

Problem

The filter setting mechanism in the embedding linker was inefficient, causing significant performance bottlenecks when processing datasets with many small documents that with individual filters. This was particularly problematic during metric calculation for the COMETA dataset, where filter setup dominated the overall runtime.

Performance Impact

Before: Running with the spacy tokenizer took 14,671 seconds for COMETA

After: Took 205.7 seconds with spacy for COMETA

Speedup: ~70x improvement

Changes Made

Added inverted index precomputation (_initialize_filter_structures):
- Built _cui_idx_to_name_idxs: maps CUI indices to lists of name indices containing them
- This flips the lookup direction from O(n) to O(1) for filter operations
- Cached _has_cuis_all_cached to avoid recomputation
Optimized filter methods
- _get_include_filters_1cui: Single CUI include filter using inverted index
- _get_include_filters_multi_cui: Multi-CUI include filter with NumPy concatenation
- _get_exclude_filters_1cui / _get_exclude_filters_multi_cui: Corresponding exclude filters
- Routing methods _get_include_filters and _get_exclude_filters choose appropriate implementation
Refactored _set_filters method:
- Replaced nested loops and list comprehensions with direct index lookups
- Simplified logic flow using the new optimized methods

Checks to make sure this doesn't change behaviour

I ran the metrics on both COMETA and the Linking Challenge datasets before and after. The precision/recall/F1 are identical. So I'm fairly confident the changed hasn't messed anything up.

tomolopolis · 2025-12-18T14:33:22Z

Task linked: CU-869bhknfm Make setting filters faster for embedding linker

CU-869bhknfm: Refactor setting of filters for embedding linker

32e5908

CU-869bfagqw: Fix a small typing issue

92fa0b2

tomolopolis requested a review from adam-sutton-1992 December 19, 2025 12:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(medcat): CU-869bhknfm Refactor setting of filters for embedding linker #268

feat(medcat): CU-869bhknfm Refactor setting of filters for embedding linker #268

Uh oh!

mart-r commented Dec 18, 2025

Uh oh!

tomolopolis commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(medcat): CU-869bhknfm Refactor setting of filters for embedding linker #268

Are you sure you want to change the base?

feat(medcat): CU-869bhknfm Refactor setting of filters for embedding linker #268

Uh oh!

Conversation

mart-r commented Dec 18, 2025

Problem

Performance Impact

Changes Made

Checks to make sure this doesn't change behaviour

Uh oh!

tomolopolis commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants