A downloadable project

Evaluation of Large Language Models in Cooperative Language Games

Samuel Knoche

Independent

Abstract

This report investigates the potential of cooperative language games as an evaluation tool of language models. Specifically, the investigation focuses on LLM’s ability to both act as the “spymaster” and the “guesser” in the game of Codenames, focusing on the spymaster's capability to provide hints which will guide their teammate to correctly identify the “target” words, and the guesser's ability to correctly identify the target words using the given hint. We investigate both the capability of different LLMs at self-play, and their ability to play cooperatively with a human teammate. The report concludes with some promising results and suggestions for further investigation.

Keywords: Scale oversight, benchmarks, ML safety

Download

Download
results.pkl 33 kB
Download
codemaster.ipynb 86 kB
Download
ScaleOversight hackathon Write-up - Player of Games.pdf 220 kB

Leave a comment

Log in with itch.io to leave a comment.