This project explores fastText, mGPT, mBERT and a C-KDE way with no LLMs to predict the lexical identity of a word from its prosody. We then use the information-theoretic measure of entropy to quantify the amount of information that prosody conveys about lexical identity.
conda create --name prosody python=3.10
conda activate prosody
conda install -r requirements.txtOr you can install the requirements with environment.yaml file:
conda env create -f environment.yamlcd data
git clone https://github.com/Helsinki-NLP/prosody.gitYou can download the Common Voice dataset from the Common Voice website.
You need to preprocess the Common Voice dataset to extract the prosodic features. You can use the following script to extract the prosodic features:
- MFA (Montreal Forced Aligner) to align the audio files with the text files. Tutorial by Eleanor Chodroff: link the MFA tool: link.
- You can extract the prosodic features f0 using the
extract.pyscript, and prominence features using theextract_prominence.pyscript. For other prosodic features, you can change the arguments in theextract.pyscript. Or run the bash script f0_extraction.sh in your server to extract the f0 features.
We followed the models in the project Quantifying the redundancy between prosody and text and enhanced the performance by adapting the Conditional Density Estimation (CDE) method in this repository: CDE. Please also see the documentation for more information: CDE Documentation.
All the bash script for training the models are in the scripts folder. In the baselines subfolder, you can find the bash scripts for using the fastText models with CDE.
In the mGPT subfolder, you can find the bash scripts for using the mGPT models with CDE. In the mBERT subfolder, you can find the bash scripts for using the mBERT models with CDE.
In the notebooks folder, you can find the bash scripts for using the C-KDE-all and C-KDE-split models.
In the notebooks folder, you can find the scripts for using the KDE and Monte Carlo sampling for differential entropy estimation.
The directory structure of new project looks like this:
├── .github <- Github Actions workflows
│
├── configs <- Hydra configs
│ ├── callbacks <- Callbacks configs
│ ├── data <- Data configs
│ ├── debug <- Debugging configs
│ ├── experiment <- Experiment configs
│ ├── extras <- Extra utilities configs
│ ├── hparams_search <- Hyperparameter search configs
│ ├── hydra <- Hydra configs
│ ├── local <- Local configs
│ ├── logger <- Logger configs
│ ├── model <- Model configs
│ ├── paths <- Project paths configs
│ ├── trainer <- Trainer configs
│ │
│ ├── eval.yaml <- Main config for evaluation
│ └── train.yaml <- Main config for training
│
├── data <- Project data
│
├── logs <- Logs generated by hydra and lightning loggers
│
├── notebooks <- Jupyter notebooks
│
├── r_scripts <- R scripts for data analysis and visualization
│
├── results <- Results of the experiments for data analysis and visualization in R
│
├── scripts <- Shell scripts
│
├── src <- Source code
│ ├── data <- Data scripts
│ ├── models <- Model scripts
│ ├── utils <- Utility scripts
│ │
│ ├── eval.py <- Run evaluation
│ ├── train_cv.py <- Run training with cross-validation
│ ├── train.py <- Run training
│ ├── extraction.py <- Extract f0 features
│ └── extraction_prominence.py <- Extract prominence features
│
│
├── tests <- Tests of any kind
│
├── visualization <- Visualization figures
│
├── .env.example <- Example of file for storing private environment variables
├── .gitignore <- List of files ignored by git
├── .pre-commit-config.yaml <- Configuration of pre-commit hooks for code formatting
├── .project-root <- File for inferring the position of project root directory
├── environment.yaml <- File for installing conda environment
├── Makefile <- Makefile with commands like `make train` or `make test`
├── pyproject.toml <- Configuration options for testing and linting
├── requirements.txt <- File for installing python dependencies
├── setup.py <- File for installing project as a package
└── README.md