Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Dec 18, 2025

Problem

The filter setting mechanism in the embedding linker was inefficient, causing significant performance bottlenecks when processing datasets with many small documents that with individual filters. This was particularly problematic during metric calculation for the COMETA dataset, where filter setup dominated the overall runtime.

Performance Impact

Before: Running with the spacy tokenizer took 14,671 seconds for COMETA

After: Took 205.7 seconds with spacy for COMETA

Speedup: ~70x improvement

Changes Made

  1. Added inverted index precomputation (_initialize_filter_structures):
    • Built _cui_idx_to_name_idxs: maps CUI indices to lists of name indices containing them
    • This flips the lookup direction from O(n) to O(1) for filter operations
    • Cached _has_cuis_all_cached to avoid recomputation
  2. Optimized filter methods
    • _get_include_filters_1cui: Single CUI include filter using inverted index
    • _get_include_filters_multi_cui: Multi-CUI include filter with NumPy concatenation
    • _get_exclude_filters_1cui / _get_exclude_filters_multi_cui: Corresponding exclude filters
    • Routing methods _get_include_filters and _get_exclude_filters choose appropriate implementation
  3. Refactored _set_filters method:
    • Replaced nested loops and list comprehensions with direct index lookups
    • Simplified logic flow using the new optimized methods

Checks to make sure this doesn't change behaviour

I ran the metrics on both COMETA and the Linking Challenge datasets before and after. The precision/recall/F1 are identical. So I'm fairly confident the changed hasn't messed anything up.

@tomolopolis
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants