Evaluation toolkit from SoundReactor
Contanct: Koichi Saito: koichi.saito@sony.com
This repository supports the evaluations of:
-
Fréchet Distances (FD)
-
Fréchet Stereo Audio Distances FSAD
-
Inception Scores (IS)
-
Mean KL Distances (MKL)
-
CLAP Scores
- LAION_CLAP, cosine similarity between text and audio embeddings computed by LAION-CLAP.
-
ImageBind Score
Cosine similarity between video and audio embeddings computed by ImageBind, sometimes scaled by 100
-
DeSync Score
Average misalignment (in seconds) predicted by Synchformer with the
24-01-04T16-39-21model trained on AudioSet. We average the results from the first 4.8 seconds and last 4.8 seconds of each video-audio pair.
Install docker and build docker container.
You can build dockerfile is located at container/dockerfile.
docker build -t tag .Or you can install via miniforge.
Yaml file is located at container/environment.yml.
mamba env create -f environment.ymlThen
mamba activate v2a_evalThen install pytorch and flash-attn (we only tested on this version but it might work on different one.)
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install flash-attn==2.4.1 --no-build-isolationIf you plan to evaluate on videos, you will also need ffmpeg. Note that torchaudio imposes a maximum version limit (ffmpeg<7).
- Download LAION-CLAP checkpoint. Specify path to the checkpoint at
--clap_ckpt_pathin the shell script. - Download Synchformer. Specify path to the checkpoint at
--syncformer_ckpt_pathin the shell script. - Download StereoCRW. Put the checkpoint under
stereoFAD/checkpoints/pretrained-models.
Evaluation is a two-stage process:
- Extraction: extract video/text/audio features for ground-truth and audio features for the predicted audios.
- Evaluation: compute the desired metrics using the extracted features.
*For FSAD, please check here.
For video-to-audio applications, visual features are extracted from input videos. Input requirements:
- Videos in .mp4 format (any FPS or resolution).
- Video names must match the corresponding audio file names (excluding extensions).
Run the following to extract visual features using Synchformer and ImageBind:
bash extract_video.shFor applications using text, text features are extracted from input text data.
Input requirements:
- A CSV file with at least two columns with a header row:
name: Matches the corresponding audio file name (excluding extensions).caption: The text associated with the audio.
Run the following to extract text features using LAION-CLAP:
bash extract_text.shYou can run evaluation via
bash evaluate.shSince pretrained audio feature extractors are trained with mono audio except for StereoCRW, which we can compute FSAD separately at stereoFAD/run_eval.sh, audio samples are loaded as mono.
You can spesify mono type via --mono_type such as 'mean', 'left', 'right', and 'side' (difference between left and right).
If you don't want to recompute audio feature extractions, you can skip them by excluding
--recompute_gt_cachefor ground-truth audio features.--recompute_pred_cachefor predicted audio features.
If there is no extracted features on text and videos, metrics related to those features are automatically skipped.
To cite this repository, please use the following BibTeX entry:
@article{saito2025soundreactor,
title={SoundReactor: Frame-level Online Video-to-Audio Generation},
author={Koichi Saito and Julian Tanke and Christian Simon and Masato Ishii and Kazuki Shimada and Zachary Novack and Zhi Zhong and Akio Hayakawa and Takashi Shibuya and Yuki Mitsufuji},
year={2025},
eprint={2510.02110},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2510.02110},
journal={arXiv preprint arXiv:2510.02110},
}https://github.com/hkchengrex/av-benchmark
https://github.com/PeiwenSun2000/Both-Ears-Wide-Open/