Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe
Oguzhan Topsakal, Jackson B. Harper- Electrical and Electronic Engineering
- Computer Networks and Communications
- Hardware and Architecture
- Signal Processing
- Control and Systems Engineering
This study investigates the strategic decision-making abilities of large language models (LLMs) via the game of Tic-Tac-Toe, renowned for its straightforward rules and definitive outcomes. We developed a mobile application coupled with web services, facilitating gameplay among leading LLMs, including Jurassic-2 Ultra by AI21, Claude 2.1 by Anthropic, Gemini-Pro by Google, GPT-3.5-Turbo and GPT-4 by OpenAI, Llama2-70B by Meta, and Mistral Large by Mistral, to assess their rule comprehension and strategic thinking. Using a consistent prompt structure in 10 sessions for each LLM pair, we systematically collected data on wins, draws, and invalid moves across 980 games, employing two distinct prompt types to vary the presentation of the game’s status. Our findings reveal significant performance variations among the LLMs. Notably, GPT-4, GPT-3.5-Turbo, and Llama2 secured the most wins with the list prompt, while GPT-4, Gemini-Pro, and Mistral Large excelled using the illustration prompt. GPT-4 emerged as the top performer, achieving victory with the minimum number of moves and the fewest errors for both prompt types. This research introduces a novel methodology for assessing LLM capabilities using a game that can illuminate their strategic thinking abilities. Beyond enhancing our comprehension of LLM performance, this study lays the groundwork for future exploration into their utility in complex decision-making scenarios, offering directions for further inquiry and the exploration of LLM limits within game-based frameworks.