Skip to content
/ wsds Public

Efficient library for large-scale multimodal (speech and video) datasets with native SQL querying capabilities.

License

Notifications You must be signed in to change notification settings

HumeAI/wsds

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WSDS

wsds merges SQL querying capabilities with native support for multimodal data (speech and video) in a single data format and a unified API. It uses shards for efficiency and to support very-scalable parallel data processing.

wsds has a powerful database query engine integrated into it (built on top of Polars). This makes database-style operations like duplicate detection, group by operations and aggregations very fast and easy to write. This tight integration let's you run both SQL queries and efficient dataloaders directly on your data without any conversion or importing.

Getting Started

# create environment
conda create -n wsds python=3.10
conda activate wsds

# install hume_wsds
pip install https://github.com/HumeAI/wsds.git

Tests

To run tests you currently need a copy of the librilight dataset. The tests can be run with:

WSDS_DATASET_PATH=/path/to/the/librilight/folder python tests.py

About

Efficient library for large-scale multimodal (speech and video) datasets with native SQL querying capabilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages