Llama3 is the first open model that can also win text adventures

(github.com)

30 points | by s-macke 13 days ago

8 comments

nicklecompte 13 days ago
I am extremely skeptical of LLMs and don't want to be mean to what's clearly a fun little experiment. But I thought that bit about "Claude 3 had the most fun playing the game" was strangely naive. Not just because these systems are incapable of emotions, but because Claude 3 was saying clearly "insincere" things like "wow, what a twist ending!" The reason why this is almost certainly "insincere" is that Claude 3 was trained on Wikipedia, including the Wikipedia article for this short, 24-year-old adventure game, which includes a full plot summary. So there are two possible things going on (not mutually exclusive)
1) Claude 3 is poor at knowledge transfer and can't connect the events of the game with a Wikipedia summary of those events.
2) Claude is simply imitating surprise because it noticed text that "looked like" a twist ending.
In general I wish the author had considered possible data contamination: how much does Claude 3 know about this game in a fresh session? If it can recite details from some online walkthrough then I am not sure this experiment is even meaningful.
[-]
- s-macke 13 days ago
  Author here.
  Of course I have considered data contamination. Especially, because the repository is now one year old. If there are no transcripts elsewhere, they are now in this repository. So at some point I have to switch to another text adventure.
  When I ask Claude 3 Opus about the game and the steps to win the game, it massively hallucinates. However it knows, that I have overslept and have to move to the office.
lolinder 13 days ago
Direct link to Llama3's run through [0]. Having played around extensively with Llama 2 70B, these results are extremely impressive and exciting to me. It can do really interesting things like correct itself when it uses the wrong verb or isn't specific enough.
Also, its ability to understand and remember the orientation of the rooms shows levels of comprehension that Llama 2 would have completely flopped on:
> SITUATION: I've searched the floor in the living room, but found nothing of interest.
> THOUGHT: Maybe I should head to the kitchenette and see if I can find any clues there.
> COMMAND: Go east
[0] https://github.com/s-macke/AdventureAI/blob/master/assets/90...
marviel 13 days ago
The transcript is here
https://github.com/s-macke/AdventureAI/blob/master/assets/90...
GaggiX 13 days ago
Reading the github repo, it doesn't seem that Llama 3 70B has won the game but "is the only open weight model, which can play through the first pass of the game".
[-]
- s-macke 13 days ago
  Author here.
  Let's just say the game has two ends. But so far, no model in the second round has come up with the idea of looking under the bed. Claude 3 Opus was the closest.
  [-]
  - fragmede 13 days ago
    Three, I believe. The obvious one, then taking the freeway, or taking city streets.
tmitchel2 13 days ago
Do you think some form of memory would help, i.e. RAG using semantic lookup of prior situations and thoughts and outcomes. I feel like this is something humans do intuitively to bring similar scenarios to the front of mind so we can get a basis for how to deal with novel situations...
[-]
- s-macke 13 days ago
  Not here. The whole history of the moves fit in a context window of 8k. No RAG necessary.
jonplackett 13 days ago
I’ve been really impressed with Llama3. I’m deeply sceptical of benchmarks and just tend to go on gut feeling and this is the only chatbot I’ve talked to besides gpt4 (I haven’t tried opus yet) that feels like another intelligence on the other end of the line.
singularity2001 13 days ago
```
   > In conclusion most of the Large Language Models can play and win text adventures
```
Doesn't the table above show that NONE of the LLMs won the game, not even GPT-4?
[-]
- lolinder 13 days ago
  This game specifically is meant to not be won in the first pass. Any model that completes this game once would likely win simpler games that are only meant to take one pass. What the table shows is that the top three models were able to finish the game and get the twist ending but were unable to carry forward information from the failed run into a second run.
  That said, I'd actually be curious to see what happens if Claude were coerced into continuing the game instead of bailing. With its large context window I think it'd have the best chance at actually figuring it out, so it's unfortunate that the author accepted its resignation.
  [-]
  - s-macke 13 days ago
    Of course I have forced Claude to continue. But it seems, that Claude had everything forgotten from the bad ending.
    SITUATION: I have just woken up in a messy bedroom to a ringing phone. It seems I overslept after a difficult night, and the phone ringing urgently suggests that something important and time-sensitive is happening.
    THOUGHT: The incessant ringing of the phone is very ominous. Given the state of the room and my oversleeping, I have a bad feeling that I am extremely late for something critical, like an important work obligation. I should answer the phone immediately to find out the situation and how much trouble I am in. But I'll also grab my wallet and keys from the end table first in case I need to rush out the door right after the call.
    COMMAND: take wallet and keys
    [-]
    - lolinder 13 days ago
      Ah, okay! So all three were pretty comparable at this task?
      [-]
      - s-macke 13 days ago
        Yes, exactly. All comparable. All struggle at the same point. This micro benchmark cannot show subtle differences.
- s-macke 13 days ago
  Well, technically yes. Three models so far have managed to finish the first bad ending. But then it seems, all models fail at this specific point.
  I had hundred of reruns of exactly that bad ending and tried different prompting techniques. So far, no luck. None of them ever looked under the bed.
  Anyhow, the results show, that it can handle such an easy adventure. And not all of the text adventures have such a difficult twist.