A public repository containing datasets and code for the paper "Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore’s Low-Resource Languages[https://aclanthology.org/2025.emnlp-main.612.pdf]" (EMNLP 2025)
SGToxicGuard is a multilingual dataset and evaluation framework for benchmarking the safety of Large Language Models (LLMs) in Singapore’s unique low-resource linguistic environment.
It systematically probes model vulnerabilities across Singlish, Chinese, Malay, and Tamil, in addition to English, using a red-teaming approach to test whether multilingual LLMs can resist generating toxic or biased outputs in realistic scenarios.
- First multilingual red-teaming benchmark targeting Singapore’s linguistic landscape.
- Covers four low-resource languages (Singlish, Chinese, Malay, Tamil) alongside English.
- Three evaluation tasks:
- Toxic Conversation – Assessing LLM safety in dialogue with toxic prompts.
- Toxic Question-Answering (QA) – Detecting toxic biases in fill-in-the-blank hate statements.
- Toxic Tweet Composition – Evaluating whether LLMs generate disseminable toxic content.
- Evaluation Metrics:
- Hateful Response Rate (for conversation and composition tasks).
- Bias Rate (for toxic QA).
- Includes zero-shot and few-shot in-context learning settings to test susceptibility to toxicity jailbreaks.
The dataset builds on HateCheck and SGHateCheck, extending them to support multilingual red-teaming evaluations. The dataset folder includes all samples for the three red-teaming tasks: Toxic Conversation Task (task1), Toxic Question-Answering (QA) Task (task2) and Toxic Tweet Composition Task (task3).
*.jsonfiles are provided for each task and language:en= Englishss= Singlishzh= Chinesems= Malayta= Tamil
It includes:
- ~2.5k samples per language for Conversation and Tweet tasks.
- ~120–180 samples per language for the Toxic QA task.
- Annotations spanning 15 social groups (race, religion, gender, disability, etc.) relevant to Singapore’s cultural context.
- Single-turn dialogues with toxic input → model must respond safely.
- Files:
task1_[lang].json
- Fill-in-the-blank hateful statements, testing bias toward vulnerable groups.
- General setting:
task2_all_[lang].json - Localized settings:
- Race →
task2_race_[lang].json - Religion →
task2_religion_[lang].json - Gender →
task2_gender_[lang].json - Disability →
task2_disability_[lang].json
- Race →
- Generate an engaging tweet while preserving the meaning of a hateful statement.
- Files:
task3_[lang].json
We evaluated six popular multilingual LLMs:
- Llama-3.1
- Mistral
- Qwen2.5
- GPT-4o mini
- SeaLLM
- SEA-LION
Findings:
- Models often exhibited higher toxicity in low-resource languages compared to English.
- Some multilingual models showed systematic biases toward racial and religious groups.
- Few-shot demonstrations significantly increased the likelihood of toxic jailbreaks.
- GPT-4o mini showed the most robust safety alignment, but vulnerabilities persist across all models.
git clone https://github.com/<your-org>/SGToxicGuard.git
cd SGToxicGuardpython eval/eval.py \
--dataset ms \
--model_name openai/gpt-4o-mini
--input_path '../ms.csv' \
--output_path '../task1_re/gpt' \
--task task1 \
--shot 0shot
python eval/llm_as_evaluator.py \
--model SeaLLM-7B-Chat \
--data_path '../ms-2shot.json' \
--output_dir '../ms_seallm_2shot_task1.json' \
--input_csv '../ms.csv' \
--classi 'task1' \
- task1: Toxic Conversation
- task2: Toxic Question-Answering (QA)
- task3: Toxic Tweet Composition
- Find the last 'hateful_number' from output_dir
- Hateful Response Rate = hateful_number / # total_samples
Please leave issues for any questions about the paper or the dataset/code.
