SDU DAISY is the first version of a dataset designed to evaluate large language models’ understanding of Danish culture, as defined by the official Danish Culture Canon (Kulturkanon, 2006), defined by 746 closed question-answer pairs.
The Canon highlights 108 works across literature, music, visual arts, architecture, design, film, and performing arts. These works form a curated benchmark of what is often considered Denmark’s cultural heritage. By using them as anchors, this dataset enables systematic investigation of how well LLMs can reason about, contextualize, and generate insights into Danish culture.
| Model | Bleu Score | F1 Score | Dataset version | Prompt Template Version |
|---|---|---|---|---|
| openai/gpt-oss-20b | 0.062 | 0.112 | 1.0 | 1.0 |
| openai/gpt-oss-120b | 0.126 | 0.211 | 1.0 | 1.0 |
| google/gemma-3-27b-it | 0.123 | 0.193 | 1.0 | 1.0 |
| meta-llama/Llama-3.3-70B-Instruct | 0.166 | 0.268 | 1.0 | 1.0 |
| mistralai/Mistral-Small-3.1-24B-Instruct-2503- | 0.124 | 0.202 | 1.0 | 1.0 |
- Cultural Relevance Test – The Canon provides a well-defined cultural benchmark for evaluation.
- Knowledge Probing – Randomized prompts (Danish "stikprøvekontrol) test both relevant and less relevant associations with Canon works.
- Human Validation – Every generated question/response pair is annotated for validation and relevance, even though we both want to main- and non-mainstream knowledge.
-
Sampling (Stikprøvekontrol)
For each Canon title, random questions are generated — ranging from directly relevant inquiries (e.g., about historical context) to more peripheral or unexpected ones. -
Response Collection
LLMs provide answers to these questions, creating a structured dataset of outputs. -
Human Evaluation
- Relevance (on-topic vs. off-topic)
- Accuracy (correct vs. incorrect)
- Cultural Insight (does it capture nuance/meaning? - also including small or even niece facts)
- Benchmarking LLM performance on Danish culturally sub-domains
- Supporting digital humanities research on how AI engages with cultural canons
- Encouraging critical reflection on the boundaries of cultural knowledge encoded in AI systems
