Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Standard

Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models. / Pedersen, Bolette Sandford; Sørensen, Nathalie Carmen Hau; Olsen, Sussi; Nimb, Sanni; Gray, Simon.

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, 2024. s. 16356.

Publikation: Bidrag til bog/antologi/rapport › Konferencebidrag i proceedings › Forskning › fagfællebedømt

Harvard

Pedersen, BS, Sørensen, NCH, Olsen, S, Nimb, S & Gray, S 2024, Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models. i Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, s. 16356. <https://aclanthology.org/2024.lrec-main.1421/>

APA

Pedersen, B. S., Sørensen, N. C. H., Olsen, S., Nimb, S., & Gray, S. (2024). Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models. I Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (s. 16356). ELRA and ICCL. https://aclanthology.org/2024.lrec-main.1421/

Vancouver

Pedersen BS, Sørensen NCH, Olsen S, Nimb S, Gray S. Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models. I Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL. 2024. s. 16356

Author

Pedersen, Bolette Sandford ; Sørensen, Nathalie Carmen Hau ; Olsen, Sussi ; Nimb, Sanni ; Gray, Simon. / Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). ELRA and ICCL, 2024. s. 16356

Bibtex

@inproceedings{28c3231f13424161a0f1d81cdd8809d1,

title = "Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models",

abstract = "We present the first version of a semantic reasoning benchmark for Danish compiled semi-automatically from a number of human-curated lexical-semantic resources, which function as our gold standard. Taken together, the datasets constitute a benchmark for assessing selected language understanding capacities of large language models (LLMs) for Danish. This first version comprises 25 datasets across 6 different tasks and include 3,800 test instances. Although still somewhat limited in size, we go beyond comparative evaluation datasets for Danish by including both negative and contrastive examples as well as low-frequent vocabulary; aspects which tend to challenge current LLMs when based substantially on language transfer. The datasets focus on features such as semantic inference and entailment, similarity, relatedness, and ability to disambiguate words in context. We use ChatGPT to assess to which degree our datasets challenge the ceiling performance of state-of-the-art LLMs, average performance being relatively high with an average accuracy of 0.6 on ChatGPT 3.5 turbo and 0.8 on ChatGPT 4.0.",

author = "Pedersen, {Bolette Sandford} and S{\o}rensen, {Nathalie Carmen Hau} and Sussi Olsen and Sanni Nimb and Simon Gray",

year = "2024",

language = "English",

pages = "16356",

booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",

publisher = "ELRA and ICCL",

}

RIS

TY - GEN

T1 - Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models

AU - Pedersen, Bolette Sandford

AU - Sørensen, Nathalie Carmen Hau

AU - Olsen, Sussi

AU - Nimb, Sanni

AU - Gray, Simon

PY - 2024

Y1 - 2024

N2 - We present the first version of a semantic reasoning benchmark for Danish compiled semi-automatically from a number of human-curated lexical-semantic resources, which function as our gold standard. Taken together, the datasets constitute a benchmark for assessing selected language understanding capacities of large language models (LLMs) for Danish. This first version comprises 25 datasets across 6 different tasks and include 3,800 test instances. Although still somewhat limited in size, we go beyond comparative evaluation datasets for Danish by including both negative and contrastive examples as well as low-frequent vocabulary; aspects which tend to challenge current LLMs when based substantially on language transfer. The datasets focus on features such as semantic inference and entailment, similarity, relatedness, and ability to disambiguate words in context. We use ChatGPT to assess to which degree our datasets challenge the ceiling performance of state-of-the-art LLMs, average performance being relatively high with an average accuracy of 0.6 on ChatGPT 3.5 turbo and 0.8 on ChatGPT 4.0.

AB - We present the first version of a semantic reasoning benchmark for Danish compiled semi-automatically from a number of human-curated lexical-semantic resources, which function as our gold standard. Taken together, the datasets constitute a benchmark for assessing selected language understanding capacities of large language models (LLMs) for Danish. This first version comprises 25 datasets across 6 different tasks and include 3,800 test instances. Although still somewhat limited in size, we go beyond comparative evaluation datasets for Danish by including both negative and contrastive examples as well as low-frequent vocabulary; aspects which tend to challenge current LLMs when based substantially on language transfer. The datasets focus on features such as semantic inference and entailment, similarity, relatedness, and ability to disambiguate words in context. We use ChatGPT to assess to which degree our datasets challenge the ceiling performance of state-of-the-art LLMs, average performance being relatively high with an average accuracy of 0.6 on ChatGPT 3.5 turbo and 0.8 on ChatGPT 4.0.

UR - https://aclanthology.org/2024.lrec-main.1421/

M3 - Article in proceedings

SP - 16356

BT - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

PB - ELRA and ICCL

ER -

ID: 394988339

Biomedicinsk Institut