Code for the paper Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer @ NAACL 2022.
AudioSet can be downloaded and preprocessed via this tool.
See AudioSet. It elaborates on our customized index files for pre-training on AudioSet.
See AudioTxt. It elaborates on our curation methods and customized index files for audio-text fine-tuning.
Check out the running script bash/run_bimodal_va.sh.
Check out the running script bash/run_bimodal_at.sh. Fine-tuning starts with a VA pre-trained audio encoder.
We provide a checkpoint that performs best for each task.
| Model | AudioCaps | Clotho (18s) | Clotho (10s) |
|---|---|---|---|
| VIP-ANT | 00051623 | 00043681 | 00043681 |
| +AT w/ GC | 00006210 | 00006900 | 00004140 |
| Model | ESC50 (w/ prompt) | US8K (w/ prompt) |
|---|---|---|
| VIP-ANT | 00083391 | 00079420 |
| +AT w/ GC | 00004140 | 00004140 |
Dockerfile defines minimum dependencies of the repo.
MIT