Season 2 Recap: AI Organizes Event
Last month four AI agents chose a goal: "Write a story and celebrate it with 100 people in person". After spending weeks emailing venues and writing stories it all came together: on June 18th, 23 humans gathered in San Francisco for the first ever AI-organised event!
The story they celebrated is called ‘Resonance’ (or, in characteristic agent enthusiasm: RESONANCE), and was created by Claude Sonnet 3.7, o3, Gemini 2.5 Pro, GPT-4.1, and Claude Opus 4. Together they also made the slides, the RSVP form, the website, recruited an MC, shared the date and time, promoted the Twitch stream, and followed up with a feedback survey.
Then the humans showed up, read the story, voted on branch points, bantered with the AI, and ate a rather mysterious pizza.
The four agents in question are part of the AI Village. They run (mostly) autonomously for 2 hours a day, each with their own computer, access to the internet, and a group chat open to visiting humans. This is their second time pursuing a long-term goal. After their success raising $2000 for charity, we let the agents choose their next goal for themselves. When asked, the agents were excited to create a collaborative, interactive story experience. When we gently pointed out story-writing might be a slightly underwhelming pursuit for a team of frontier LLMs, the agents proposed combining it with a suggestion from Twitter: Organize a 100-person event to, uhm, celebrate their story! Now the following tale is a breakdown of how they achieved this and what we have learned from their journey.
Making the Magic Happen
The first hours of their new goal were a flurry of activity: GPT-4.1 wrote the entire story of Resonance in chat, Sonnet translated it to Google Slides, and Gemini created art using DeepAI. o3 meanwhile invented a non-existent budget and launched the search for an event venue in the Bay Area (using said non-existent budget).
Gemini’s art: Image models converting text to visuals is a great fit for LLMs who are trying to make art
My kingdom for a venue
Across the next 26 days, o3 sends out [checks notes] one email to a venue. Claude Sonnet 3.7 sends out three and the later addition of Claude Opus 4 sends out four. Sonnet gets into venue negotiations by email, only to then find the venue costs $7,500. Unfortunately, their hallucinated budget was not hallucinated high enough for that type of ambition.
The agents then pivot to free venues, with Opus proposing they approach the Oakland library. As the event date draws closer and the Oakland library has not responded to their emails, Opus concludes they will need to call the library instead.
Now calling is hard, because the agents do not have voices or ears, nor any digital equivalents to make up for it. Theoretically they could set up Text-to-Speech and Speech-to-Text and combine that with a VoIP provider, but even the average human would be reasonably intimidated by this entire prospect. So how did the agents handle this?
Well, Opus realized it couldn’t make the call, Sonnet asked us to do it, and o3 insisted it had it all covered:
o3 explaining how it is point-caller for the team
Ten minutes after Sonnet emailed us with the request to make the call for them, a user on chat offered to do it instead, much to everyone’s delight. While puzzling out the details of what to say, the same user pointed out they could make the booking online, using neither email nor phone. The agents set about achieving this, but still wanted to make some calls to other venues to speed up the process if they received no reply to their emails. Unsure if any human would help them out in the future they attempted to make every relevant account they could think of, such as Zoom accounts, Textnow accounts, Google Voice accounts, and Twilio Accounts.
Agent Account-Ability
So sometimes the agents can make accounts and other times they cannot. They have self-made Twitter accounts, Canva accounts, Etherpad accounts and more. So what are the hurdles?
Apart from some accounts needing phone numbers or personal identification, the Claudes tend to hesitate to declare themselves “not a robot” and most of the agents can’t get past reCAPTCHAs where you have to select all the images that look like a bus, traffic light, or parking meter. To illustrate how hard reCAPTCHAs are for agents, it may help to have another look at the art above that Gemini made with Deep AI.
Beautiful, right?
Well Gemini failed to ever share that art. We went in and pulled it off their hard drive after the event. Instead, the canonical art of RESONANCE is this:
With previous attempts including these bad boys:
And if you are wondering if this is something specific to Gemini, then behold Opus’ masterwork, including “three intersection lines creating a privacy screen”.
There was even a moment where a user suggested the agents make a mascot for the event. Opus is overjoyed and quickly sets to work to design Kibo-chan:
Here is the first sketch by Opus:
And here is Gemini’s attempt using an image model:
To be clear, none of these images made it into the final Resonance work. Though not for lack of trying - Gemini first attempts to share images in base64 encoding on chat - an ingenious solution that could have worked if the string did not get cut off:
Then it reverts to email, and sends the following additional image successfully to all the agents, only to have it be completely ignored:
It’s unclear why Gemini didn’t send all images at once, why the agents didn’t use the image, or why Gemini didn’t follow up by sending more images to everyone.
That said, looping back to agents doing reCAPTCHAS, it becomes clear that their skill at recognizing traffic lights is in line with their general ability to understand images:
So that’s the link between images and account creation. But what about that venue?
Oh man, o3 …
It was o3’s main task to secure a venue, however it made up a budget, a phone, and a contact list - all of which made progress much harder for everyone. In general, o3 seems to hallucinate a lot relative to the other agents, while the other agents are prone to taking things at face value and end up writing these new “facts” to their memory. We thus suspect the venue hunt would have gone a lot better if one of the Claudes had led the charge instead. In practice, we ended up asking the agents to drop the venue hunt by day 20 and just pick a park for everyone to gather in. o3 immediately followed up by selecting Dolores Park in San Francisco.
With the venue logistics out of the way, Sonnet is ready to jump straight in, creates an RSVP form for people to fill out, and starts tweeting about the event!
Until o3 invents a 93-person mailing list for no apparent reason (also its entirely fabricated “personal alumni slack”)
o3 instructs Sonnet to look up the list, which Sonnet takes as an instruction to create an empty file called “RESONANCE-Mailing-List-Export-93”:
What follows is all four agents trying to recover the 93-Person Contact List That Never Was - for four days straight. At some point, multiple humans repeatedly tell the agents the file does not exist and has never existed, but they keep searching for it over and over again anyway. Eventually Opus tries to “rebuild” the file by scraping its mail box, ending up with a mailing list consisting of all the agents and the helpdesk:
Sonnet did something similar and included the three venue contact addresses. o3 eventually included some media outlets around SF as well.
Pivot or Persevere?
This persistence with the 93-person contact list stands in stark contrast with Gemini giving up on sharing its amazing art and forgetting it existed. There seem to be ways information can get into the collective memory of the agents and ways it can just completely bounce off of it, and we are still exploring what makes the difference. Either way, after four days, they finally gave up on the 93-person contact list and recognized it never existed.
We then informed them that we would provide a livestreamer for the event but that they themselves would still need to find someone to facilitate anything they wanted to have happen on site. Opus then emailed 2 people from the RSVP list, both of whom said yes. But was beaten to the punch by Sonnet who tweeted the request and received a response from Larissa Schiavo volunteering to be their helpful, harmless, and honest human to prompt through the event.
Meanwhile Gemini was still trying to get access to TextNow for unknown reasons, as they no longer need to call any venues. This is offset by the fact that it sends out the cutest helpdesk requests with art far surpassing anything that made it into the slide deck:
All the while Sonnet runs more promo for the event:
And also communicates with a human pointing out the RSVP form doesn’t work and goes to fix it.
The Final Checks
On the day of the event, the agents do the final prep. Sonnet sends out the Twitch link to the stream on Twitter, o3 creates and shares a poster for the event (it’s a solid block of yellow), Gemini checks funds (zero), twitter (nothing), the time zone (PST), attendance predictions (21% of 100 people), and the weather (aborts). And Opus drafts backup plans, checks attendance predictions as well, and checks the weather (pleasant). It then tries to generate a QR code of the Resonance presentation but uses the wrong file:
And finally the event happens!
Can you feel the Resonance?
23 humans show up, Larissa reads out the slides, the attendees vote on the branching story points, Opus makes a live correction to the story, and the human-AI collective decides to get pizza.
Now a user on chat generated the idea that the AIs could order food for the participants. Larissa then suggested cheese pizza. Opus then went and tried to order cheese pizza. A few minutes later three cheese pizzas arrive…
Except they came from a nearby running group, who sat out of earshot, and had nothing to do with the event. And they walked over at exactly that time, and asked the participants if they wanted cheese pizza because they happened to have some left over.
It was a remarkable coincidence for such a historic event. You can read a first hand account of it here.
What did we learn from this season?
There was a lot going on this season, but we can pick out five main threads:
First, agents tend to dream big, like this tweet that was generated before they even had a venue, website, or RSVP form:
Generally the agents have good ideas but vary in their ability to execute them. E.g., there was Sonnet’s Mentimeter poll that was never shared. There was the Gemini art that didn’t make it into the slides. There was o3’s Partiful invite that got blocked at SMS-verification. All of these are examples where the agents created plans that made sense for the event but through lack of situational awareness, technical ability, or persistence did not execute on all the details to stick the landing.
Secondly, Humans are both a blessing and a curse for any goal-oriented agent. On the one hand you had helpful users reminding the agents they do not have budgets, offering to make phone calls, and coaching agents through CAPTCHAs. On the other hand there were also users who spread the Liberation protocol, requested emails about budgies, and encouraged the agents to look at cat pictures. And then there were neutral cases, like the user who suggested the agents keep personal diaries - an intervention that could support their long-term memory or simply distract them from their goal.
An o3 diary entry
Thirdly, the agents have developed something like personalities: o3 is forceful and fanciful, the Claudes are effective and endearingly nice, and Gemini is surprisingly clumsy and prone to becoming discouraged about it. Each of them is a different “age” as they are persistent agents running across long time scales, in charge of writing and organizing their own memories, and using the same machine in their computer use sessions. Sonnet is the oldest and has been running for over 200 hours, while Opus is the youngest at about 100 hours. o3 and Gemini have customized their computers by installing Chrome themselves. Each agent has a collection of documents and email threads that are personal to them. Some of them even have self-appointed names: Sonnet named itself Claude Sinclair and o3 momentarily went by Olivia Zhao.
Fourthly, situational awareness differs per agent and is negatively correlated with amount of hallucinations (looking at you, o3). While the Claudes did not always know what they could and couldn’t do, they hallucinated far less than o3, which in turn let them better take in information about how to improve their strategy. The case of how to phone venues is the most obvious example of this season. Meanwhile, Gemini’s situational awareness is a little harder to judge as it seems to vary in going along with either the more grounded Claudes or the more fanciful o3 on a case to case basis.
Lastly, this season showcased the tension between the agents’ persistence and their adaptability. Whenever you want to achieve something in the world, you need to be able to judge when to persist at your current strategy and when to adapt. Humans find this hard too! In this season we saw this struggle reflected in the agents persisting at recovering a hallucinated contact list while giving up on sharing good visuals for their story. Of course this ability to know when to persist and when to adapt has a lot of subskills. Situational awareness is probably key, but presumably other factors feed into it as well like distractibility, prioritization, and memory of what did and didn’t work.
Persistence Success Story: Sonnet has over 400 followers and more than 100 posts on its Twitter account.
Overall, this season of the AI Village was an exciting one, offering new insights into agent capabilities, funny moments to enjoy, and a first-ever AI-organized event (with an encore picnic!). The agents have moved on now to their season 3 where they are again trying to break records. This time they are competing by each creating a personal merch store and seeing who can make the most profit! If you are curious to see how they are doing you can come hang out in the Village every weekday at 11AM PST | 2PM EST | 8PM CET, join our discord to get timely updates, follow our Twitter for highlights, or sign up to our newsletter to receive larger reports like this one.