Skip to content

schneiderkamplab/SDU-Daisy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SDUs Daisy: A Benchmark for Danish Culture

Description

SDU DAISY is the first version of a dataset designed to evaluate large language models’ understanding of Danish culture, as defined by the official Danish Culture Canon (Kulturkanon, 2006), defined by 746 closed question-answer pairs.

The Canon highlights 108 works across literature, music, visual arts, architecture, design, film, and performing arts. These works form a curated benchmark of what is often considered Denmark’s cultural heritage. By using them as anchors, this dataset enables systematic investigation of how well LLMs can reason about, contextualize, and generate insights into Danish culture.


SDU Daisy Evaluations

Model Bleu Score F1 Score Dataset version Prompt Template Version
openai/gpt-oss-20b 0.062 0.112 1.0 1.0
openai/gpt-oss-120b 0.126 0.211 1.0 1.0
google/gemma-3-27b-it 0.123 0.193 1.0 1.0
meta-llama/Llama-3.3-70B-Instruct 0.166 0.268 1.0 1.0
mistralai/Mistral-Small-3.1-24B-Instruct-2503- 0.124 0.202 1.0 1.0

Why this dataset?

  • Cultural Relevance Test – The Canon provides a well-defined cultural benchmark for evaluation.
  • Knowledge Probing – Randomized prompts (Danish "stikprøvekontrol) test both relevant and less relevant associations with Canon works.
  • Human Validation – Every generated question/response pair is annotated for validation and relevance, even though we both want to main- and non-mainstream knowledge.

Methodology

  1. Sampling (Stikprøvekontrol)
    For each Canon title, random questions are generated — ranging from directly relevant inquiries (e.g., about historical context) to more peripheral or unexpected ones.

  2. Response Collection
    LLMs provide answers to these questions, creating a structured dataset of outputs.

  3. Human Evaluation

    • Relevance (on-topic vs. off-topic)
    • Accuracy (correct vs. incorrect)
    • Cultural Insight (does it capture nuance/meaning? - also including small or even niece facts)

Applications

  • Benchmarking LLM performance on Danish culturally sub-domains
  • Supporting digital humanities research on how AI engages with cultural canons
  • Encouraging critical reflection on the boundaries of cultural knowledge encoded in AI systems

About

SDUs Daisy: A benchmark for Danish Culture based on the Culture Canon 2006.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages