Releases: mcvickerlab/GenVarLoader
Releases · mcvickerlab/GenVarLoader
0.19.1
0.19.0
0.18.3
0.18.2
0.18.1
0.18.0
0.17.0
0.17.0 (2025-08-22)
❗Breaking Changes❗
- The layout of Ragged offsets has changed to match Awkward for better performance. This ONLY affects SVAR datasets which must have their offsets rewritten to have shape (2, n_regions, n_samples, ploidy) instead of (n_regions, n_samples, ploidy, 2). The SVAR genotype metadata shape info must be updated to reflect this as well. In addition, any code that uses Ragged.offsets directly may need to be adjusted as 2-D offsets have had their axes transposed.
- The
transformandreturn_indicesarguments have been moved toDataset.to_torch_datasetandDataset.to_dataloader
📦Additions📦
Dataset.n_intervals()andDataset.n_variants()methods to get arrays of how many intervals and variants are present in each region/sample/ploid/trackDataset.with_seqs('variants')andDataset.with_tracks(kind='intervals')to returnRaggedVariantsandRaggedIntervals, respectively. These are simple containers of Ragged arrays. Methods to convertRaggedVariantsandRaggedIntervalsto dictionaries of PyTorch Nested tensors.- Ragged arrays are now a subclass of Awkward Arrays. As a result, Ragged arrays now support all element-wise operators, all awkward functions, and NumPy ufuncs. Their indexing behavior also directly inherits from awkward arrays. As a result, the
Ragged.to_awkward()method is deprecated. If you want to remove the Ragged behavior from a Ragged array and get a plain awkward array, you can useRagged.to_ak(). Awkward arrays can be converted to Ragged ones by simply passing an awkward array to the Ragged constructor directly:Ragged(ak_array). gvl.data_registry.fetch()to obtain pre-processed datasets. Currently includes Geuvadis with RNA-seq tracks and the full 1000 Genomes Project WGS data as VCF/PLINK and BigWigs that can be passed togvl.write().
Commits
Feat
- use new seqpro.Ragged interface
- method to obtain the number of variants per region,sample,haplotype.
- support for reverse complementing alt alleles of RaggedVariants
- ragvariant methods for pytorch, fix squeeze for scalar indexing, upstream genoray fix for extending genotypes from PGEN, (perf) batched contig normalization
- subsetting for RefDataset
Fix
- right shape for double slice indexing
- shape/broadcasting bug for reverse complementing variants
- perf: breaking changes to ragged data layout from seqpro to eliminate copies when converting non-contiguous ragged data to/from awkward arrays.
- when ref not passed to Dataset.open, emit a warning instead of error and let ds return raggedvariants
- refdataset transforms with indices
[main d38797c] bump: version 0.16.0 → 0.17.0
2 files changed, 19 insertions(+), 1 deletion(-)
0.16.0
0.16.0 (2025-06-05)
Feat
- make DatasetWithSites return both wild-type and mutant haplotypes
- data_registry submodule
Fix
- transform not applied when dataset returns single item. docs: add basenji2 evaluation
- use genoray>=0.12. docs: basenji2 eval
- ensure samples are re-ordered by subset_to if necessary
- Let GVL recognize bgz-compressed VCFs
- pad ref_coords with max value for dtype to ensure ref coords are sorted
- remove transform arg for dummy dataset
- finish deprecating the transform setting on Dataset, which was moved to dataloading functionality
- PR #101, ensure variable length output corresponds to ArrayDataset
- update genoray pixi version
- permit ragged output for dataloading, emitting a warning instead of raising an error
- constrain genoray for breaking changes
- torch dataset issues
- bump seqpro to 0.4.2
[main 30ef648] bump: version 0.15.0 → 0.16.0
2 files changed, 23 insertions(+), 1 deletion(-)
0.15.0
0.15.0 (2025-05-23)
⚠️ Breaking Changes ⚠️
- The
.with_indicesand.with_transformmethods have been moved to be arguments to the.to_dataloaderand/or.to_torch_datasetmethods. Motivation: returning indices and transforms are not necessary outside of a dataloading context. Best practice to facilitate subsetting with GVL is to obtain a PyTorch Datset viagvl.Dataset.to_torch_dataset(), pass this totorch.utils.data.Subset, and then provide to thisgvl._torch.get_dataloder. This is a very thin wrapper around the the native PyTorch DataLoader with better defaults for GVL to take advantage of multithreading (better than multiprocessing). Documentation updates are pending.
📦 Notable Features 📦
- The RefDataset class, an interface for lazily accessing and organizing subsequences from a reference genome
- The DatasetWithSites class, which combines a Dataset with a table of site-only variants e.g. from ClinVar or gnomAD and inserts the site-only variants into haplotypes. Currently only 1 variant at a time and only SNPs.
- Annotation tracks for Datasets, which take a BED file and broadcast its values to be treated as a track for all samples.
- Now includes SeqPro >= 0.4.2, which adds functionality for scanning GTFs as polars.LazyFrames and extracting GTF attributes via the
scanandattrfunctions under theseqpro.gtfnamespace.
Feat
- add reference property to Dataset and add path attribute to Reference
- add RefDataset to work with one or more reference genomes. Also change internal indexing to never materialize a full dataset index, dramatically reducing memory usage to support datasets with 1M+ regions. BREAKING CHANGE: move returning indices and transforms to torch API, since these features are generally unnecessary for non-dataloading contexts.
- sites-only changes for QoL. fix: consider output length < region length for sites overlap
- allow tracks to pass-through dataset with sites since SNPs have no affect on them
- wip: pad ragged annotated ref coords with max dtype value. pass sanity checks.
- wip: change ref_coord annotation so that right-pad values have position MAX_I32
- type-safe Dataset, passes all tests.
- refactor Dataset implementation to be (almost) fully type-safe.
- wip: sites-only variants
- wip: use sp.bed functions.
- wip: small updates
- apply sites-only SNPs, filtering non-SNPs out from VCFs.
- sites-only classes, intersecting them with Datasets, and obtaining information necessary to apply variants.
- add annot track to dummy dataset
- wip: initial implementation for read/write annotation tracks, incorporating them along the track dimension.
- deprecate unphased variants
- wip: dosages/CCFs on ragged variants
- prototype of returning ragged variants from Dataset
Fix
- shape of single item from RefDataset
- update init for seqpro bump
- bump seqpro to 0.4.0 which includes basic gtf ops
- update dummy dataset for changes to Reference. docs: add more docstrings
- jittering by folding it into data (re)construction
- contig offset mapping for in-memory reference and incrementing offset when writing cache
- bump genoray version to handle unsorted PVAR contigs
- bump genoray version for filtered PGEN fix
- ensure annot tracks match on-disk ordering
- update for internal breaking changes
- check for SNPs
- bump genoray so bioconda pgenlib is valid
- pass all tests
- internal breaking changes
- treat POS as 1-based to match VCF spec
- type annotations
- type annotation
- contig naming for reference fasta.
- contig normalization
- pass all tests.
- pass tests.
- pass tests.
- Dataset.open returns highest complexity ds by default (haps + all tracks, sorted).
- use pandera polars not pandas
- correct manipulation of active tracks
- dummy dataset
- wrong germline ccfs for 3rd germline variant and beyond
- wrong geno path
- parsing SVAR metadata, bump genoray
Refactor
- use genoray
[main c0fd8be] bump: version 0.14.4 → 0.15.0
2 files changed, 60 insertions(+), 1 deletion(-)