Releases · mcvickerlab/GenVarLoader

20 Dec 18:55

d-laub

0.19.1

e41f7f7

0.19.1 Latest

Latest

0.19.1 (2025-12-20)

Fix

perf: upgrade genoray for (much) faster writes. test: idempotent fixutres in test_ds_haps

[main e41f7f7] bump: version 0.19.0 → 0.19.1
2 files changed, 7 insertions(+), 1 deletion(-)

Assets 2

03 Dec 03:05

d-laub

0.19.0

babe777

0.19.0

0.19.0 (2025-12-03)

Feat

allow subsetting by region name. fix: convert eligible ak.Array to sp.rag.Ragged for gvl.RaggedVariants whenever possible

[main babe777] bump: version 0.18.3 → 0.19.0
2 files changed, 7 insertions(+), 1 deletion(-)

Assets 2

09 Nov 04:27

d-laub

0.18.3

0690647

0.18.3

0.18.3 (2025-11-09)

Fix

perf: faster reverse complementing

[main 0690647] bump: version 0.18.2 → 0.18.3
2 files changed, 7 insertions(+), 1 deletion(-)

Assets 2

03 Nov 04:28

d-laub

0.18.2

c1900e9

0.18.2

0.18.2 (2025-11-03)

Fix

correctly parse and load variant fields (skip duplicates)
use ak.str.length instead of ak.num to get ref and alt lengths. docs: shape docstring

[main c1900e9] bump: version 0.18.1 → 0.18.2
2 files changed, 8 insertions(+), 1 deletion(-)

Assets 2

23 Oct 06:21

d-laub

0.18.1

877b6d4

0.18.1

0.18.1 (2025-10-23)

Fix

track file format version

[main 877b6d4] bump: version 0.18.0 → 0.18.1
2 files changed, 7 insertions(+), 1 deletion(-)

Assets 2

22 Oct 05:19

d-laub

0.18.0

27420c0

0.18.0

0.18.0 (2025-10-22)

Feat

make RaggedVariants an Awkward Array subclass supporting arbitrary additional fields.

[main 27420c0] bump: version 0.17.0 → 0.18.0
2 files changed, 7 insertions(+), 1 deletion(-)

Assets 2

22 Aug 21:58

d-laub

0.17.0

d38797c

0.17.0

0.17.0 (2025-08-22)

❗Breaking Changes❗

The layout of Ragged offsets has changed to match Awkward for better performance. This ONLY affects SVAR datasets which must have their offsets rewritten to have shape (2, n_regions, n_samples, ploidy) instead of (n_regions, n_samples, ploidy, 2). The SVAR genotype metadata shape info must be updated to reflect this as well. In addition, any code that uses Ragged.offsets directly may need to be adjusted as 2-D offsets have had their axes transposed.
The transform and return_indices arguments have been moved to Dataset.to_torch_dataset and Dataset.to_dataloader

📦Additions📦

Dataset.n_intervals() and Dataset.n_variants() methods to get arrays of how many intervals and variants are present in each region/sample/ploid/track
Dataset.with_seqs('variants') and Dataset.with_tracks(kind='intervals') to return RaggedVariants and RaggedIntervals, respectively. These are simple containers of Ragged arrays. Methods to convert RaggedVariants and RaggedIntervals to dictionaries of PyTorch Nested tensors.
Ragged arrays are now a subclass of Awkward Arrays. As a result, Ragged arrays now support all element-wise operators, all awkward functions, and NumPy ufuncs. Their indexing behavior also directly inherits from awkward arrays. As a result, the Ragged.to_awkward() method is deprecated. If you want to remove the Ragged behavior from a Ragged array and get a plain awkward array, you can use Ragged.to_ak(). Awkward arrays can be converted to Ragged ones by simply passing an awkward array to the Ragged constructor directly: Ragged(ak_array).
gvl.data_registry.fetch() to obtain pre-processed datasets. Currently includes Geuvadis with RNA-seq tracks and the full 1000 Genomes Project WGS data as VCF/PLINK and BigWigs that can be passed to gvl.write().

Commits

Feat

use new seqpro.Ragged interface
method to obtain the number of variants per region,sample,haplotype.
support for reverse complementing alt alleles of RaggedVariants
ragvariant methods for pytorch, fix squeeze for scalar indexing, upstream genoray fix for extending genotypes from PGEN, (perf) batched contig normalization
subsetting for RefDataset

Fix

right shape for double slice indexing
shape/broadcasting bug for reverse complementing variants
perf: breaking changes to ragged data layout from seqpro to eliminate copies when converting non-contiguous ragged data to/from awkward arrays.
when ref not passed to Dataset.open, emit a warning instead of error and let ds return raggedvariants
refdataset transforms with indices

[main d38797c] bump: version 0.16.0 → 0.17.0
2 files changed, 19 insertions(+), 1 deletion(-)

Assets 2

05 Jun 19:40

d-laub

0.16.0

30ef648

0.16.0

0.16.0 (2025-06-05)

Feat

make DatasetWithSites return both wild-type and mutant haplotypes
data_registry submodule

Fix

transform not applied when dataset returns single item. docs: add basenji2 evaluation
use genoray>=0.12. docs: basenji2 eval
ensure samples are re-ordered by subset_to if necessary
Let GVL recognize bgz-compressed VCFs
pad ref_coords with max value for dtype to ensure ref coords are sorted
remove transform arg for dummy dataset
finish deprecating the transform setting on Dataset, which was moved to dataloading functionality
PR #101, ensure variable length output corresponds to ArrayDataset
update genoray pixi version
permit ragged output for dataloading, emitting a warning instead of raising an error
constrain genoray for breaking changes
torch dataset issues
bump seqpro to 0.4.2

[main 30ef648] bump: version 0.15.0 → 0.16.0
2 files changed, 23 insertions(+), 1 deletion(-)

Assets 2

23 May 02:25

d-laub

0.15.0

c0fd8be

0.15.0

0.15.0 (2025-05-23)

⚠️ Breaking Changes ⚠️

The .with_indices and .with_transform methods have been moved to be arguments to the .to_dataloader and/or .to_torch_dataset methods. Motivation: returning indices and transforms are not necessary outside of a dataloading context. Best practice to facilitate subsetting with GVL is to obtain a PyTorch Datset via gvl.Dataset.to_torch_dataset(), pass this to torch.utils.data.Subset, and then provide to this gvl._torch.get_dataloder. This is a very thin wrapper around the the native PyTorch DataLoader with better defaults for GVL to take advantage of multithreading (better than multiprocessing). Documentation updates are pending.

📦 Notable Features 📦

The RefDataset class, an interface for lazily accessing and organizing subsequences from a reference genome
The DatasetWithSites class, which combines a Dataset with a table of site-only variants e.g. from ClinVar or gnomAD and inserts the site-only variants into haplotypes. Currently only 1 variant at a time and only SNPs.
Annotation tracks for Datasets, which take a BED file and broadcast its values to be treated as a track for all samples.
Now includes SeqPro >= 0.4.2, which adds functionality for scanning GTFs as polars.LazyFrames and extracting GTF attributes via the scan and attr functions under the seqpro.gtf namespace.

Feat

add reference property to Dataset and add path attribute to Reference
add RefDataset to work with one or more reference genomes. Also change internal indexing to never materialize a full dataset index, dramatically reducing memory usage to support datasets with 1M+ regions. BREAKING CHANGE: move returning indices and transforms to torch API, since these features are generally unnecessary for non-dataloading contexts.
sites-only changes for QoL. fix: consider output length < region length for sites overlap
allow tracks to pass-through dataset with sites since SNPs have no affect on them
wip: pad ragged annotated ref coords with max dtype value. pass sanity checks.
wip: change ref_coord annotation so that right-pad values have position MAX_I32
type-safe Dataset, passes all tests.
refactor Dataset implementation to be (almost) fully type-safe.
wip: sites-only variants
wip: use sp.bed functions.
wip: small updates
apply sites-only SNPs, filtering non-SNPs out from VCFs.
sites-only classes, intersecting them with Datasets, and obtaining information necessary to apply variants.
add annot track to dummy dataset
wip: initial implementation for read/write annotation tracks, incorporating them along the track dimension.
deprecate unphased variants
wip: dosages/CCFs on ragged variants
prototype of returning ragged variants from Dataset

Fix

shape of single item from RefDataset
update init for seqpro bump
bump seqpro to 0.4.0 which includes basic gtf ops
update dummy dataset for changes to Reference. docs: add more docstrings
jittering by folding it into data (re)construction
contig offset mapping for in-memory reference and incrementing offset when writing cache
bump genoray version to handle unsorted PVAR contigs
bump genoray version for filtered PGEN fix
ensure annot tracks match on-disk ordering
update for internal breaking changes
check for SNPs
bump genoray so bioconda pgenlib is valid
pass all tests
internal breaking changes
treat POS as 1-based to match VCF spec
type annotations
type annotation
contig naming for reference fasta.
contig normalization
pass all tests.
pass tests.
pass tests.
Dataset.open returns highest complexity ds by default (haps + all tracks, sorted).
use pandera polars not pandas
correct manipulation of active tracks
dummy dataset
wrong germline ccfs for 3rd germline variant and beyond
wrong geno path
parsing SVAR metadata, bump genoray

Refactor

use genoray

[main c0fd8be] bump: version 0.14.4 → 0.15.0
2 files changed, 60 insertions(+), 1 deletion(-)

Assets 2

12 May 20:29

d-laub

0.14.4

39e4ac3

0.14.4

0.14.4 (2025-05-12)

Fix

data corruption when rc_helper is parallelized

[main 39e4ac3] bump: version 0.14.3 → 0.14.4
2 files changed, 7 insertions(+), 1 deletion(-)

Assets 2

Releases: mcvickerlab/GenVarLoader

0.19.1

0.19.1 (2025-12-20)

Fix

Uh oh!

0.19.0

0.19.0 (2025-12-03)

Feat

Uh oh!

0.18.3

0.18.3 (2025-11-09)

Fix

Uh oh!

0.18.2

0.18.2 (2025-11-03)

Fix

Uh oh!

0.18.1

0.18.1 (2025-10-23)

Fix

Uh oh!

0.18.0

0.18.0 (2025-10-22)

Feat

Uh oh!

0.17.0

0.17.0 (2025-08-22)

❗Breaking Changes❗

📦Additions📦

Commits

Feat

Fix

Uh oh!

0.16.0

0.16.0 (2025-06-05)

Feat

Fix

Uh oh!

0.15.0

0.15.0 (2025-05-23)

⚠️ Breaking Changes ⚠️

📦 Notable Features 📦

Feat

Fix

Refactor

Uh oh!

0.14.4

0.14.4 (2025-05-12)

Fix

Uh oh!