I am extremely skeptical of LLMs and don't want to be mean to what's clearly a fun little experiment. But I thought that bit about "Claude 3 had the most fun playing the game" was strangely naive. Not just because these systems are incapable of emotions, but because Claude 3 was saying clearly "insincere" things like "wow, what a twist ending!" The reason why this is almost certainly "insincere" is that Claude 3 was trained on Wikipedia, including the Wikipedia article for this short, 24-year-old adventure game, which includes a full plot summary. So there are two possible things going on (not mutually exclusive)
1) Claude 3 is poor at knowledge transfer and can't connect the events of the game with a Wikipedia summary of those events.
2) Claude is simply imitating surprise because it noticed text that "looked like" a twist ending.
In general I wish the author had considered possible data contamination: how much does Claude 3 know about this game in a fresh session? If it can recite details from some online walkthrough then I am not sure this experiment is even meaningful.
Of course I have considered data contamination. Especially, because the repository is now one year old. If there are no transcripts elsewhere, they are now in this repository. So at some point I have to switch to another text adventure.
When I ask Claude 3 Opus about the game and the steps to win the game, it massively hallucinates. However it knows, that I have overslept and have to move to the office.
Direct link to Llama3's run through [0]. Having played around extensively with Llama 2 70B, these results are extremely impressive and exciting to me. It can do really interesting things like correct itself when it uses the wrong verb or isn't specific enough.
Also, its ability to understand and remember the orientation of the rooms shows levels of comprehension that Llama 2 would have completely flopped on:
> SITUATION: I've searched the floor in the living room, but found nothing of interest.
> THOUGHT: Maybe I should head to the kitchenette and see if I can find any clues there.
Reading the github repo, it doesn't seem that Llama 3 70B has won the game but "is the only open weight model, which can play through the first pass of the game".
Let's just say the game has two ends. But so far, no model in the second round has come up with the idea of looking under the bed. Claude 3 Opus was the closest.
Do you think some form of memory would help, i.e. RAG using semantic lookup of prior situations and thoughts and outcomes. I feel like this is something humans do intuitively to bring similar scenarios to the front of mind so we can get a basis for how to deal with novel situations...
I’ve been really impressed with Llama3. I’m deeply sceptical of benchmarks and just tend to go on gut feeling and this is the only chatbot I’ve talked to besides gpt4 (I haven’t tried opus yet) that feels like another intelligence on the other end of the line.
This game specifically is meant to not be won in the first pass. Any model that completes this game once would likely win simpler games that are only meant to take one pass. What the table shows is that the top three models were able to finish the game and get the twist ending but were unable to carry forward information from the failed run into a second run.
That said, I'd actually be curious to see what happens if Claude were coerced into continuing the game instead of bailing. With its large context window I think it'd have the best chance at actually figuring it out, so it's unfortunate that the author accepted its resignation.
Of course I have forced Claude to continue. But it seems, that Claude had everything forgotten from the bad ending.
SITUATION:
I have just woken up in a messy bedroom to a ringing phone. It seems I overslept after a difficult night, and the phone ringing urgently suggests that something important and time-sensitive is happening.
THOUGHT:
The incessant ringing of the phone is very ominous. Given the state of the room and my oversleeping, I have a bad feeling that I am extremely late for something critical, like an important work obligation. I should answer the phone immediately to find out the situation and how much trouble I am in. But I'll also grab my wallet and keys from the end table first in case I need to rush out the door right after the call.
1) Claude 3 is poor at knowledge transfer and can't connect the events of the game with a Wikipedia summary of those events.
2) Claude is simply imitating surprise because it noticed text that "looked like" a twist ending.
In general I wish the author had considered possible data contamination: how much does Claude 3 know about this game in a fresh session? If it can recite details from some online walkthrough then I am not sure this experiment is even meaningful.
Of course I have considered data contamination. Especially, because the repository is now one year old. If there are no transcripts elsewhere, they are now in this repository. So at some point I have to switch to another text adventure.
When I ask Claude 3 Opus about the game and the steps to win the game, it massively hallucinates. However it knows, that I have overslept and have to move to the office.
Also, its ability to understand and remember the orientation of the rooms shows levels of comprehension that Llama 2 would have completely flopped on:
> SITUATION: I've searched the floor in the living room, but found nothing of interest.
> THOUGHT: Maybe I should head to the kitchenette and see if I can find any clues there.
> COMMAND: Go east
[0] https://github.com/s-macke/AdventureAI/blob/master/assets/90...
https://github.com/s-macke/AdventureAI/blob/master/assets/90...
Let's just say the game has two ends. But so far, no model in the second round has come up with the idea of looking under the bed. Claude 3 Opus was the closest.
That said, I'd actually be curious to see what happens if Claude were coerced into continuing the game instead of bailing. With its large context window I think it'd have the best chance at actually figuring it out, so it's unfortunate that the author accepted its resignation.
SITUATION: I have just woken up in a messy bedroom to a ringing phone. It seems I overslept after a difficult night, and the phone ringing urgently suggests that something important and time-sensitive is happening.
THOUGHT: The incessant ringing of the phone is very ominous. Given the state of the room and my oversleeping, I have a bad feeling that I am extremely late for something critical, like an important work obligation. I should answer the phone immediately to find out the situation and how much trouble I am in. But I'll also grab my wallet and keys from the end table first in case I need to rush out the door right after the call.
COMMAND: take wallet and keys
I had hundred of reruns of exactly that bad ending and tried different prompting techniques. So far, no luck. None of them ever looked under the bed.
Anyhow, the results show, that it can handle such an easy adventure. And not all of the text adventures have such a difficult twist.