Skip to content

Releases: mcvickerlab/GenVarLoader

0.19.1

20 Dec 18:55

Choose a tag to compare

0.19.1 (2025-12-20)

Fix

  • perf: upgrade genoray for (much) faster writes. test: idempotent fixutres in test_ds_haps

[main e41f7f7] bump: version 0.19.0 → 0.19.1
2 files changed, 7 insertions(+), 1 deletion(-)

0.19.0

03 Dec 03:05

Choose a tag to compare

0.19.0 (2025-12-03)

Feat

  • allow subsetting by region name. fix: convert eligible ak.Array to sp.rag.Ragged for gvl.RaggedVariants whenever possible

[main babe777] bump: version 0.18.3 → 0.19.0
2 files changed, 7 insertions(+), 1 deletion(-)

0.18.3

09 Nov 04:27

Choose a tag to compare

0.18.3 (2025-11-09)

Fix

  • perf: faster reverse complementing

[main 0690647] bump: version 0.18.2 → 0.18.3
2 files changed, 7 insertions(+), 1 deletion(-)

0.18.2

03 Nov 04:28

Choose a tag to compare

0.18.2 (2025-11-03)

Fix

  • correctly parse and load variant fields (skip duplicates)
  • use ak.str.length instead of ak.num to get ref and alt lengths. docs: shape docstring

[main c1900e9] bump: version 0.18.1 → 0.18.2
2 files changed, 8 insertions(+), 1 deletion(-)

0.18.1

23 Oct 06:21

Choose a tag to compare

0.18.1 (2025-10-23)

Fix

  • track file format version

[main 877b6d4] bump: version 0.18.0 → 0.18.1
2 files changed, 7 insertions(+), 1 deletion(-)

0.18.0

22 Oct 05:19

Choose a tag to compare

0.18.0 (2025-10-22)

Feat

  • make RaggedVariants an Awkward Array subclass supporting arbitrary additional fields.

[main 27420c0] bump: version 0.17.0 → 0.18.0
2 files changed, 7 insertions(+), 1 deletion(-)

0.17.0

22 Aug 21:58

Choose a tag to compare

0.17.0 (2025-08-22)

❗Breaking Changes❗

  • The layout of Ragged offsets has changed to match Awkward for better performance. This ONLY affects SVAR datasets which must have their offsets rewritten to have shape (2, n_regions, n_samples, ploidy) instead of (n_regions, n_samples, ploidy, 2). The SVAR genotype metadata shape info must be updated to reflect this as well. In addition, any code that uses Ragged.offsets directly may need to be adjusted as 2-D offsets have had their axes transposed.
  • The transform and return_indices arguments have been moved to Dataset.to_torch_dataset and Dataset.to_dataloader

📦Additions📦

  • Dataset.n_intervals() and Dataset.n_variants() methods to get arrays of how many intervals and variants are present in each region/sample/ploid/track
  • Dataset.with_seqs('variants') and Dataset.with_tracks(kind='intervals') to return RaggedVariants and RaggedIntervals, respectively. These are simple containers of Ragged arrays. Methods to convert RaggedVariants and RaggedIntervals to dictionaries of PyTorch Nested tensors.
  • Ragged arrays are now a subclass of Awkward Arrays. As a result, Ragged arrays now support all element-wise operators, all awkward functions, and NumPy ufuncs. Their indexing behavior also directly inherits from awkward arrays. As a result, the Ragged.to_awkward() method is deprecated. If you want to remove the Ragged behavior from a Ragged array and get a plain awkward array, you can use Ragged.to_ak(). Awkward arrays can be converted to Ragged ones by simply passing an awkward array to the Ragged constructor directly: Ragged(ak_array).
  • gvl.data_registry.fetch() to obtain pre-processed datasets. Currently includes Geuvadis with RNA-seq tracks and the full 1000 Genomes Project WGS data as VCF/PLINK and BigWigs that can be passed to gvl.write().

Commits

Feat

  • use new seqpro.Ragged interface
  • method to obtain the number of variants per region,sample,haplotype.
  • support for reverse complementing alt alleles of RaggedVariants
  • ragvariant methods for pytorch, fix squeeze for scalar indexing, upstream genoray fix for extending genotypes from PGEN, (perf) batched contig normalization
  • subsetting for RefDataset

Fix

  • right shape for double slice indexing
  • shape/broadcasting bug for reverse complementing variants
  • perf: breaking changes to ragged data layout from seqpro to eliminate copies when converting non-contiguous ragged data to/from awkward arrays.
  • when ref not passed to Dataset.open, emit a warning instead of error and let ds return raggedvariants
  • refdataset transforms with indices

[main d38797c] bump: version 0.16.0 → 0.17.0
2 files changed, 19 insertions(+), 1 deletion(-)

0.16.0

05 Jun 19:40

Choose a tag to compare

0.16.0 (2025-06-05)

Feat

  • make DatasetWithSites return both wild-type and mutant haplotypes
  • data_registry submodule

Fix

  • transform not applied when dataset returns single item. docs: add basenji2 evaluation
  • use genoray>=0.12. docs: basenji2 eval
  • ensure samples are re-ordered by subset_to if necessary
  • Let GVL recognize bgz-compressed VCFs
  • pad ref_coords with max value for dtype to ensure ref coords are sorted
  • remove transform arg for dummy dataset
  • finish deprecating the transform setting on Dataset, which was moved to dataloading functionality
  • PR #101, ensure variable length output corresponds to ArrayDataset
  • update genoray pixi version
  • permit ragged output for dataloading, emitting a warning instead of raising an error
  • constrain genoray for breaking changes
  • torch dataset issues
  • bump seqpro to 0.4.2

[main 30ef648] bump: version 0.15.0 → 0.16.0
2 files changed, 23 insertions(+), 1 deletion(-)

0.15.0

23 May 02:25

Choose a tag to compare

0.15.0 (2025-05-23)

⚠️ Breaking Changes ⚠️

  • The .with_indices and .with_transform methods have been moved to be arguments to the .to_dataloader and/or .to_torch_dataset methods. Motivation: returning indices and transforms are not necessary outside of a dataloading context. Best practice to facilitate subsetting with GVL is to obtain a PyTorch Datset via gvl.Dataset.to_torch_dataset(), pass this to torch.utils.data.Subset, and then provide to this gvl._torch.get_dataloder. This is a very thin wrapper around the the native PyTorch DataLoader with better defaults for GVL to take advantage of multithreading (better than multiprocessing). Documentation updates are pending.

📦 Notable Features 📦

  • The RefDataset class, an interface for lazily accessing and organizing subsequences from a reference genome
  • The DatasetWithSites class, which combines a Dataset with a table of site-only variants e.g. from ClinVar or gnomAD and inserts the site-only variants into haplotypes. Currently only 1 variant at a time and only SNPs.
  • Annotation tracks for Datasets, which take a BED file and broadcast its values to be treated as a track for all samples.
  • Now includes SeqPro >= 0.4.2, which adds functionality for scanning GTFs as polars.LazyFrames and extracting GTF attributes via the scan and attr functions under the seqpro.gtf namespace.

Feat

  • add reference property to Dataset and add path attribute to Reference
  • add RefDataset to work with one or more reference genomes. Also change internal indexing to never materialize a full dataset index, dramatically reducing memory usage to support datasets with 1M+ regions. BREAKING CHANGE: move returning indices and transforms to torch API, since these features are generally unnecessary for non-dataloading contexts.
  • sites-only changes for QoL. fix: consider output length < region length for sites overlap
  • allow tracks to pass-through dataset with sites since SNPs have no affect on them
  • wip: pad ragged annotated ref coords with max dtype value. pass sanity checks.
  • wip: change ref_coord annotation so that right-pad values have position MAX_I32
  • type-safe Dataset, passes all tests.
  • refactor Dataset implementation to be (almost) fully type-safe.
  • wip: sites-only variants
  • wip: use sp.bed functions.
  • wip: small updates
  • apply sites-only SNPs, filtering non-SNPs out from VCFs.
  • sites-only classes, intersecting them with Datasets, and obtaining information necessary to apply variants.
  • add annot track to dummy dataset
  • wip: initial implementation for read/write annotation tracks, incorporating them along the track dimension.
  • deprecate unphased variants
  • wip: dosages/CCFs on ragged variants
  • prototype of returning ragged variants from Dataset

Fix

  • shape of single item from RefDataset
  • update init for seqpro bump
  • bump seqpro to 0.4.0 which includes basic gtf ops
  • update dummy dataset for changes to Reference. docs: add more docstrings
  • jittering by folding it into data (re)construction
  • contig offset mapping for in-memory reference and incrementing offset when writing cache
  • bump genoray version to handle unsorted PVAR contigs
  • bump genoray version for filtered PGEN fix
  • ensure annot tracks match on-disk ordering
  • update for internal breaking changes
  • check for SNPs
  • bump genoray so bioconda pgenlib is valid
  • pass all tests
  • internal breaking changes
  • treat POS as 1-based to match VCF spec
  • type annotations
  • type annotation
  • contig naming for reference fasta.
  • contig normalization
  • pass all tests.
  • pass tests.
  • pass tests.
  • Dataset.open returns highest complexity ds by default (haps + all tracks, sorted).
  • use pandera polars not pandas
  • correct manipulation of active tracks
  • dummy dataset
  • wrong germline ccfs for 3rd germline variant and beyond
  • wrong geno path
  • parsing SVAR metadata, bump genoray

Refactor

  • use genoray

[main c0fd8be] bump: version 0.14.4 → 0.15.0
2 files changed, 60 insertions(+), 1 deletion(-)

0.14.4

12 May 20:29

Choose a tag to compare

0.14.4 (2025-05-12)

Fix

  • data corruption when rc_helper is parallelized

[main 39e4ac3] bump: version 0.14.3 → 0.14.4
2 files changed, 7 insertions(+), 1 deletion(-)