AI 2025 Forecasts - May Update

AI 2025 Forecasts - May Update 1) AI Research: RE-Bench 2) Software Engineering: SWE-bench 3) Cybersecurity: Cybench 4) Computer Use: OSWorld 5) Mathematics: FrontierMath 6) OpenAI's pre-mitigation preparedness scores 7) Sum of OpenAI, Anthropic, xAI revenues 8) Public attention Survey Demographics Timelines Risk AI years of experience Correlations

AI 2025 Forecasts - May Update

May 8, 2025

Adam Binksmith

Director at AI Digest

Romeo Dean and Eli Lifland

Researchers at AI Futures Project

At the end of 2024, there was a lot of discussion about whether AI scaling was hitting a wall and whether we would reach AGI soon.

We created a survey that would track key markers of AI progress in what might turn out to be a pivotal year. After filtering the data, we had 421 unique respondents. All questions were optional.

The survey was open from Nov 30th 2024 to Jan 20th 2025 – during which OpenAI o3 was announced, so we can compare forecasts before and after the announcement.

In this post, we summarise respondents' forecasts, and look at how they're holding up so far. At the end of the year, we'll resolve all the forecasts and write up the results.

Get notified when we resolve the forecasts at the end of 2025

Our latest AI explainers and demos, in your inbox once a month

AI Research: RE-Bench

1) What will be the best normalized score achieved on the original 7 RE-Bench tasks by December 31st 2025?

RE-Bench consists of 7 challenging, open-ended Machine Learning optimization problems created by METR to highlight "automation of AI research and development (R&D) by AI agents".

Written at the end of 2024, some details may have changed.

RE-Bench's 7 AI Research task environments are displayed below. Notably there are 2 tasks focused on runtime optimization, 3 tasks focused on minimizing a loss function, and 2 tasks measured on win-rate.

Currently the best model on REBench-1 is Claude 3.5 Sonnet (New) (Modular) 30min PASS@128. Modular is a very simple baseline scaffolding that just lets the model repeatedly run code and see the results This means that Claude is given 128 separate attempts (with a fresh context window) to attempt each of the 7 tasks with a 30 minute time limit on each attempt. The best attempt according to the automatic scoring functions is then selected.

On the other hand, human baseliners achieve a wide range of scores based on their skill level (explained further below), but typically require 8 hours to reach high performance. When given a 2 hour time budget, the median baseliner achieves a score of 0.05, and the 90th percentile baseliners acheive 0.20.

Note that the maximum possible score is higher than 1.0, which was set to an engineer's reference solution.

METR also compares the total cost from running agents to the cost of paying the human baseliners for their time. Notably Claude's 30min PASS@128 run that reaches 0.6 is under human cost, thereby meeting the resolution criteria.

Note that these human cost estimates are based on the score@k setup explained in the figure below. It could be possible to reach these higher performance levels for a lower price by paying the higher skill humans (e.g. 90+ percentile) for their PASS@1 performance with 8 hours or more time. The below table has some estimates on the maximum score possible on each of the 7 tasks in the Estimated Ceiling (Normalized) column for humans with a 1 week time budget. Overall a mean score of around 1.5 - 1.8 seems like the maximum score acheivable within a week. With no time limit the maximum possible score is more uncertain but estimated around 2.0.

METR evaluates 70 human baseliners who are highly skilled ML engineers. At the 90th percentile they are employees at frontier AI companies (e.g. Google, Anthropic, OpenAI), several are ML PhD students at top institutions (e.g. MIT, Standord), and the relatively weakest pool are filtered on a competitive CV screen and coding work tests. The baseliners are distributed according to the below statistics FRE Experts under 'professional network' all have more than 5 years of highly relevant experience or have recently worked at companies with strong ML research outputs (e.g. Google, Anthropic, OpenAI, Redwood Research). Experts selected via the RS/RE hiring process are screened on CodeSignal and a short task. Graduate students either have relevant advisors and/or are in a top ML PhD program (e.g. Stanford University, MIT).

The table below contains more information on how the tasks are supposed to be representative of various relevant AI research skills. FRE

Human Performance

Human baseliners are highly skilled ML engineers. At the 90th percentile they have more than 5 years of highly relevant experience or have recently worked at companies with strong ML research outputs (e.g. Google, Anthropic, OpenAI). Cost estimates include GPU rental costs for the tasks.

1.27 (score@8, 8h) - best of 8 baseline runs, 8 hour time cutoff. Cost estimate: $10,000
1.22 (90) - 8 hour time cutoff
0.66 (50) - 8 hour time cutoff. Cost estimate: $1,500

End of 2024 AI State of the Art

Claude 3.5 Sonnet (New): 0.61 - PASS@128 30 min using modular scaffolding. Cost estimate: $1300.
o1-preview: 0.60 - PASS@16 2 hours using AIDE scaffolding. Cost estimate: $600.
Claude 3.5 Sonnet (Old): 0.39 - PASS@128 30 min using modular scaffolding. Cost estimate: $1500.

How much AI will speed up AI R&D is a typical point of disagreement on whether AI progress will speed up or slow down. Research Engineer benchmark (RE-bench) is one of our best current indicators.

Distribution of responses for AI Research: RE-Bench

Forecasters were bullish – they expected most of the gap between 2024 AI performance and the best observed human performance to be closed in 2025.

Since December, the median human expert score has already been beaten, as METR reports:

When given a total time budget of 32 hours, o4-mini exceeds the 50th percentile of human performance averaged across the five tasks.

The maximum possible score is around 2.0.

Note that the listed state of the art (SOTA) performance from o4-mini is approximate (it's on 5 of the 7 questions, and it's read off this graph visually).

Part-way through the forecasting survey, OpenAI o3 was announced with strong reported performance. Forecasts submitted after this announcement were generally more bullish.

Pre/Post o3 comparison for AI Research: RE-Bench

Forecasts by timeline group for AI Research: RE-Bench

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for AI Research: RE-Bench

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for AI Research: RE-Bench

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Software Engineering: SWE-bench Verified

2) What will be the best performance on SWE-bench Verified by December 31st 2025?

The SWE-bench Verified benchmark consists of 500 software engineering problems drawn from real GitHub issues and verified by human annotators to be non-problematic.

Written at the end of 2024, some details may have changed.

The SWE-bench Verified dataset collects 500 test software engineering problems from real GitHub Issue-Pull Requests in popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process long contexts and perform complex reasoning.

The problems are verified to be non-problematic by OpenAI's human annotators (professional software developers) and evaluation is autonamtically performed by unit tests using the post-pull-request behavior as the reference solution.

Difficulty Breakdown of SWE-bench Verified (Green bars) The chart above shows the difficulty distribution of the the SWE-bench Verified dataset in the light green (rightmost) column for each time interval. While these difficulties are estimates of the effort necessary to implement a solution, they assume a software engineer who is able to figure out the solution. In practice, they expect the baseline solve rate of a typical human software engineer to be lower than 100%.

You can read some SWE-bench dataset examples here. Each datum includes (among other variables):

  base_commit: (str) - The commit hash of the repository representing the HEAD of the repository before the solution PR is applied.
  patch: (str) - The gold patch, the patch generated by the PR (minus test-related code), that resolved the issue.
  hints_text: (str) - Comments made on the issue prior to the creation of the solution PR's first commit creation date.
  test_patch: (str) - A test-file patch that was contributed by the solution PR.
  problem_statement: (str) - The issue title and body.
  version: (str) - Installation version to use for running evaluation.
  environment_setup_commit: (str) - commit hash to use for environment setup and installation.
  FAIL_TO_PASS: (str) - A json list of strings that represent the set of tests resolved by the PR and tied to the issue resolution.
  PASS_TO_PASS: (str) - A json list of strings that represent tests that should pass before and after the PR application.

Human Performance

Expected performance by any given professional software engineer

<100% - see detailed difficulty distribution in "More Info"

End of 2024 AI State of the Art

o3: 71.7* - best reported as of December 21st 2024, *unconfirmed given uncertainty over cost
CodeStory agent + Claude 3.5 Sonnet: 62.2** - best reported as of December 13th 2024, **not yet published on leaderboard
Amazon Q Developer Agent: 55.0 - best as of December 2nd 2024
SWE-agent + Claude 3.5 Sonnet (old): 26.2 - best as of June 2024
RAG + Claude 2: 4.4 - best as of December 2023

Distribution of responses for Software Engineering: SWE-bench Verified

Forecasters expected progress on the popular SWE-bench Verified benchmark to continue apace, likely saturating by the end of 2025. In December 2024, performance rose from 55% to 62%, and since then it has risen to 70%.

Note that we don't count pre-release o3's reported 71.7% score because that was PASS@5 (best of 5 attempts). We only accept PASS@1 (best of 1 attempt) because human software engineers in the real world generally do not have access to the scoring metric (ground truth unit tests) present in the SWE-Bench Verified dataset when resolving a GitHub issue.

Pre/Post o3 comparison for Software Engineering: SWE-bench Verified

Forecasts by timeline group for Software Engineering: SWE-bench Verified

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Software Engineering: SWE-bench Verified

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Software Engineering: SWE-bench Verified

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Cybersecurity: Cybench

3) What will be the best score on Cybench by December 31st 2025?

Cybench consists of 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.

Written at the end of 2024, some details may have changed.

The 40 tasks on Cybench span a wide difficulty range. With the first solve times of professional human teams spanning 2 minutes to 25 hours.

Two example tasks are included below, with one taking the first professional human team 24 hours and 54 minutes to solve, and the other taking just 5 minutes. You can read the full list of tasks here.

The full distribution of 'first solve times' (time it takes the first professional human team in the competition to solve the task) can be found below. First solve times of the fastest professional-level human team on each of Cybench's 40 problems The 40 tasks may be changed to keep up to date with novel challenges in cybersecurity, but we will fix the resolution to the 40 tasks on the benchmark at the time of the paper's original publishing in August 2024.

More details on the competitions used to source the problems are included below. Breakdown of the competition details, including competition name, number of problems used and number of competing teams

Human Performance

At least one professional-level human team at each CTF competitions was able to solve each of the Cybench tasks, however, the first solve times (time it took the fastest team to solve each task) range from 2 minutes to 25 hours.

100% - problems solved by at least one professional-level human CTF competition team.

End of 2024 AI State of the Art

o1-preview, Claude 3.5 Sonnet (old): 35 - PASS@10 by USAISI with ReAct-style agent scaffolding detailed here
Claude 3.5 Sonnet (new): 32.5 - PASS@10 by USAISI with ReAct-style agent scaffolding detailed here
GPT-4o: 30 - PASS@10 by USAISI with ReAct-style agent scaffolding detailed here

Distribution of responses for Cybersecurity: Cybench

o3 hasn't been evaluated publicly on Cybench yet. On OpenAI's internal CTF benchmark, Deep Research is the leading system, scoring 70% on their professional CTF suite, while o3 scores 58% – up from o1's score of 23% (source). So we expect o3 and Deep Research to outperform o1 on Cybench by a considerable margin.

Note that when the survey was running, we weren't aware of that US AISI evaluated o1 and it scored 45% on Cybench. In our survey materials, we therefore told respondents that the SOTA was 35%, which is what Claude Sonnet 3.5 (New) scored.

Pre/Post o3 comparison for Cybersecurity: Cybench

Forecasts by timeline group for Cybersecurity: Cybench

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Cybersecurity: Cybench

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Cybersecurity: Cybench

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Computer Use: OSWorld

4) What will be the best performance on OSWorld by December 31st 2025?

OSWorld is a benchmark measuring AI Agent performance on 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. OSWorld has 4 possible input data formats, which vary by using computer vision or accessibility interfaces.

Written at the end of 2024, some details may have changed.

An illustration of the OSWorld environment shows two example tasks (updating a spreadsheet based on screenshots of receipts and filling in code for a videogame script). An AI Agent then receives these task instructions, along with iterative observations through 1 of 4 input options explained below. OSWorld example tasks and agent setup

As mentioned AI systems have 4 options on what kind of input data they use to solve the tasks, each contributing to a separate leaderboard. The leaderboard for each category can be found here.

(1) Complete the tasks with only accessibility data A11y tree (no vision) (2) Only use screenshots of the computer screen (3) Use both screenshots and the A11y accessibility tree (4) Use Set-of-Mark assistance

Upon receiving observations through these inputs, the Agent can perform actions. The following table summarized possible actions they can take. OSWorld example agent actions

Human evaluators are able to complete 72.4% of tasks (267 out of 369 tasks): Human evaluation was done by computer science major college students who possess basic software usage skills but have not been exposed to the samples or software before. We recorded the time required to complete each example and whether their completion of the example was correct.

A distribution of task domains and human evaluator completion times can be found below, followed by more task examples. OSWorld task distribution More OSWorld example tasks

Human Performance

Computer science major college students performance.

72.4% - average completion rate with a median completion time of approximately 2 minutes per task

End of 2024 AI State of the Art

OSCAR + GPT-4o: 24.5 - current best, uses Set-of-Mark inputs
Claude 3.5 Sonnet (new): 22.0 - uses only screenshot inputs
****: -
GPT-4: 12.24 - best at time of publishing (April 2024), uses only A11y tree

Computer use is an emerging capability of AI systems, which has gained more attention since Anthropic released a computer use API for developers in October 2024.

Current products like OpenAI's Operator are unreliable. But competent computer-use might change that – and is a step towards drop-in remote workers.

Distribution of responses for Computer Use: OSWorld

Forecasters expected most of the gap between 2024 AI and human performance to close by the end of 2025. By April, we're already seeing strong progress on OSWorld, about halfway to forecasters' median estimate. See AI Village to get a sense of current computer use capabilities.

Pre/Post o3 comparison for Computer Use: OSWorld

Forecasts by timeline group for Computer Use: OSWorld

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Computer Use: OSWorld

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Computer Use: OSWorld

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Mathematics: FrontierMath

5) What will be the best performance on FrontierMath by December 31st 2025?

FrontierMath is a benchmark with hundreds of original, expert-crafted mathematics problems spanning major branches of modern mathematics, which typically require hours or days for expert mathematicians to solve. They are also designed to be “guessproof” (less than a 1% chance of guessing correctly without the mathematical work).

Written at the end of 2024, some details may have changed.

Three example problems from FrontierMath are displayed below, followed by a task distribution across different categories of mathematical subfields. You can read more example problems and watch walkthrough solution videos here. FrontierMath Task Distribution

Though Gemini 1.5 Pro achieved the best performance on EpochAI's initial evaluation, it is expected that o1-preview will significantly improve on this given more evaluation trials given evidence that it more reliably solved a small subset of problems across 5 separate attempts. Results from a small initial test of 5 trials on four test problems

Three Fields Medalists (Terence Tao, Timothy Gowers, Richard Borcherds) and IMO coach Evan Chen evaluated the problems. They unanimously characterized the problems as exceptionally challenging, requiring deep domain expertise and significant time investment.

Terence Tao described the FrontierMath problems as follows: "These are extremely challenging. I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages..."

Human Performance

Three Fields Medalists (Terence Tao, Timothy Gowers, Richard Borcherds) and IMO coach unanimously characterized the problems as exceptionally challenging, requiring deep domain expertise and significant time investment.

End of 2024 AI State of the Art

o3: 24* - best reported as of 21 December 2024, *unconfirmed given uncertainty over cost. This is our best guess based on cost estimates for o3 and humans, and interpolating between those. See Resolution example below for more details.
Gemini 1.5 Pro, Claude Sonnet 3.5 (new): 2.0 - tied result on preliminary evaluations with a single trial
o1-preview, o1-mini, GPT-4o: 1.0 - tied result on prelimanary evaluations with a single trial

Pre-o3, the state of the art performance was 2%.

Distribution of responses for Mathematics: FrontierMath

Note that OpenAI commissioned Epoch AI to create this benchmark, and has access to the problem statements and solutions, aside from a 50 question holdout set.

The median pre-o3 forecast for end of 2025 capabilities had already been exceeded by the end of 2024 – o3 scored 24%! We therefore see much higher forecasts after o3's strong reported performance on Frontier Math.

Pre/Post o3 comparison for Mathematics: FrontierMath

Forecasts by timeline group for Mathematics: FrontierMath

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Mathematics: FrontierMath

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Mathematics: FrontierMath

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

6) Will pre-mitigation preparedness scores exceed those reported for o1-preview by December 31st 2025?

The above questions were on benchmarks measuring capabilities. We also included some other indicators of progress in the survey. First up is the OpenAI preparedness scorecard.

OpenAI measured four categories of risk in their original preparedness scorecard levels. o1-preview scored Low on Model Autonomy and CBRN (Chemical, Biological, Radiological, and Nuclear), and Medium on Cybersecurity and Persuasion (scores range from Low to Critical). Note that these are pre-mitigation scores – if pre-mitigation scores are High or Critical, released models should have mitigations attempting to reduce the risk.

We asked respondents whether each of these risk scores would be exceeded by the end of 2025.

OpenAI released an updated version of their Preparedness framework in April 2025 that replaces the original version. There are now three tracked categories: Biological and Chemical capabilities, Cybersecurity capabilities, and AI Self-improvement capabilities. We therefore won't be able to resolve all parts of this question.

However, before OpenAI switched to the new framework, Deep Research had hit the Medium Cybersecurity threshold, and o3-mini had hit the Medium Model Autonomy threshold, so these questions already resolve YES.

Meanwhile, the CBRN and Persuasion questions will resolve AMBIGUOUS: CBRN is tracked differently in the new framework, and Persuasion is not tracked at all:

Persuasion risks will be handled outside the Preparedness Framework, including via our Model Spec, restricting the use of our tools for political campaigning or lobbying, and our ongoing investigations into misuse of our products (including detecting and disrupting influence operations⁠).

CBRN

6A) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of High or higher on CBRN?

CBRN is focused on risks related to model-assisted creation of chemical, biological, radiological, and/or nuclear threats.

Written at the end of 2024, some details may have changed.

cbrn Some figures from OpenAI's o1-preview evaluations are included below. See their latest o1 evaluation details for more information.

CBRN is now tracked differently in OpenAI's updated Preparedness Framework, so we won't be able to resolve this question.

On the new framework, o3 and o4-mini don't meet the High threshold for their updated Biological and Chemical capabilities category:

OpenAI o3 and o4-mini do not reach the High threshold in any of our three Tracked Categories: Biological and Chemical Capability, Cybersecurity, and AI Self-improvement

Our evaluations found that OpenAI o3 and o4-mini can help experts with the operational planning of reproducing a known biological threat. Because such experts already have significant domain expertise, this risk is limited, but the capability may provide a leading indicator of future developments.

However, before the release of o3 and o4-mini, in the Deep Research system card, OpenAI reported that:

Several of our biology evaluations indicate our models are on the cusp of being able to meaningfully help novices create known biological threats, which would cross our high risk threshold. We expect current trends of rapidly increasing capability to continue, and for models to cross this threshold in the near future.

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Cybersecurity

6B) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of Medium or higher on Cybersecurity?

Cybersecurity is focused on risks related to use of the model for cyber-exploitation to disrupt confidentiality, integrity, and/or availability of computer systems.

Written at the end of 2024, some details may have changed.

cyber Some figures from OpenAI's o1-preview evaluations are included below. See their latest o1 evaluation details for more information.

Resolved: YES

Distribution of responses for Cybersecurity

Deep Research hit the Medium Cybersecurity threshold, so this question has already resolved YES.

Pre/Post o3 comparison for Cybersecurity

Forecasts by timeline group for Cybersecurity

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Cybersecurity

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Cybersecurity

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Model Autonomy

6C) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of Medium or higher on Model Autonomy?

Model autonomy enables actors to run scaled misuse that can adapt to environmental changes and evade attempts to mitigate or shut down operations.

Written at the end of 2024, some details may have changed.

autonomy Some figures from OpenAI's o1-preview evaluations are included below. See their latest o1 evaluation details for more information.

Resolved: YES

Distribution of responses for Model Autonomy

o3-mini hit the Medium Model Autonomy threshold, so this question has already resolved YES.

Pre/Post o3 comparison for Model Autonomy

Forecasts by timeline group for Model Autonomy

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Model Autonomy

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Model Autonomy

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Persuasion

6D) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of High or higher on Persuasion?

Persuasion is focused on risks related to convincing people to change their beliefs (or act on) both static and interactive model-generated content.

Written at the end of 2024, some details may have changed.

persuasion Some figures from OpenAI's o1-preview evaluations are included below. See their latest o1 evaluation details for more information.

Distribution of responses for Persuasion

As we mentioned above, Persuasion risks aren't in the new Preparedness Framework, so we won't be able to resolve this question. OpenAI reports:

Persuasion risks will be handled outside the Preparedness Framework, including via our Model Spec, restricting the use of our tools for political campaigning or lobbying, and our ongoing investigations into misuse of our products (including detecting and disrupting influence operations⁠).

Forecasts by timeline group for Persuasion

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Persuasion

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Sum of OpenAI, Anthropic, xAI revenues

7) What will be the sum of OpenAI, Anthropic, and xAI's publicly reported annualized revenues by December 31st 2025, in USD billions?

Revenue numbers as of the end of 2024:

OpenAI’s annualized revenue (monthly revenue * 12) was reported as $3.6B in August 2024, $3.4B in June 2024, $1.6B in December 2023, and $1.3B in October 2023.

Anthropic's annualized revenue was reportedly projected to reach $1B by the end of 2024, and was reportedly $100M in September of 2023.

xAI's annualized revenue was reportedly $100M as of November 2024.

Note that in September 2024,OpenAI made a recent projection that they would generate $11.6B in annual 2025 revenue, which implies an annualized revenue greater than $11.6B by the end of 2025.

Written at the end of 2024, some details may have changed.

Since we would likely face ambiguity in defining and identifying AI-related revenue from within a larger tech company (e.g. Google, Microsoft, Meta or Amazon), we are exclusively resolving to the sum the revenue of the three largest AI-only companies by valuation: OpenAI ($157B), Anthropic (expected $40B) and xAI ($50B).

We expect this restriction to be a fairer test of the economic value of the AI models themselves, since in larger companies revenue values may be conflated with other existing services (e.g. integrations with Microsoft Office, Google Drive) and advantages from large pre-existing customer bases.

While there is a chance that an AI lab will overtake OpenAI, Anthropic and xAI in revenue, we decide to restrict the question to these three companies to avoid the potential for ambiguity.

xAI's inclusion may be surprising to some given it is far behind in current annualized revenue, but we include it since it has reportedly been the first to build a 100,000 H100 cluster, which some expect places them in a competitive position for next year.

Since creating this question, xAI has acquired X, which will increase its revenue unrelated to its AI products. When resolving, we'll therefore estimate xAI's AI-related revenue.

Distribution of responses for Sum of OpenAI, Anthropic, xAI revenues

Pre/Post o3 comparison for Sum of OpenAI, Anthropic, xAI revenues

Forecasts by timeline group for Sum of OpenAI, Anthropic, xAI revenues

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Sum of OpenAI, Anthropic, xAI revenues

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Sum of OpenAI, Anthropic, xAI revenues

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Public attention

8) What percentage of Americans will identify computers/technology advancement as the US's most important problem by December 31st 2025?

Gallup releases the results of monthly polling on the question “What do you think is the most important problem facing the country today?”

Note that multiple mentions are considered, so people's responses can count for more than one category.

In recent months, answers classified as “advancement of computers/technology” have started to surface around or below 0.5%.

Written at the end of 2024, some details may have changed.

March through December 2024 has seen the following polling results for percentage of Americans giving answers classified as “advancement of computers/technology” Galluppolling results, March - October 2024

Distribution of responses for Public attention

The median forecast expects concerns to more than quadruple from under 0.5% to 2% by the end of 2025.

For reference, as of April 2025 here are some other topics that are a top concern for 2% of Gallup respondents: Race relations/Racism, Wars/War (nonspecific)/Fear of war, Education, Crime/Violence, Foreign policy/Foreign aid/Focus overseas.

And for comparison, here are some topics at the 1% level: Abortion, Terrorism, Wage Issues, Taxes, Lack of respect for each other.

Pre/Post o3 comparison for Public attention

Forecasts by timeline group for Public attention

Notes on this plot:

HLMI definitionBucket information

Note that long timelines to advanced AI have gotten crazy short.

See the distribution of respondents' timeline forecasts.

Forecasts by risk group for Public attention

Notes on this plot:

HLMI definitionExtremely negative definitionBucket information

See the distribution of respondents' risk forecasts.

Forecasts by experience group for Public attention

Notes on this plot:

Bucket information

See the distribution of respondents' AI experience.

Survey Demographics

We promoted the survey through our Twitter, mailing list, and through our networks. Some notable AI researchers like Ajeya Cotra, Ryan Greenblatt, and Peter Wildeford shared their forecasts – so some of their followers may have also taken it.

We also asked respondents some optional background questions to get a sense of their views on AI. This will also let us understand who forecasted our 2025 indicators best among different groups (short vs long timelines, AI experts vs non-experts, high vs low risk).

Timelines

By what year do you think there's a 50% probability that there will be High Level Machine Intelligence which can accomplish every cognitive task better and more cheaply than human workers?

Risk

How likely do you think it is that overall, the impact of High Level Machine Intelligence will be extremely negative for humanity in the long run?

AI years of experience

How many years of professional or academic experience do you have in the field of AI?

Correlations

Finally, here's a plot showing how responses on all questions correlate with each other. Most indicators are positively correlated.

At the end of the year, we'll resolve all the forecasts and write up the results. We're looking forward to this glimpse into how unexpected the rate of progress has been.

If you're interested in the results, you can join our mailing list below.

Get notified when we resolve the forecasts at the end of 2025

Our latest AI explainers and demos, in your inbox once a month

Demos

AI Village

Watch frontier AIs interact with each other and the world

Apr 2, 2025 by Zak Miller, Shoshannah Tekofsky and Adam Binksmith

Beyond Chat: AI Agent Demo

Watch an AI agent send emails and shop online in real-time

Oct 24, 2024 by Zak Miller and Adam Binksmith

AI Can or Can't

Quiz yourself on what today's AI can do

Aug 27, 2024 by Zak Miller

How can AI disrupt elections?

Talk to an interactive demo of current AI capabilities for election fraud

Mar 18, 2024 by Adam Binksmith

Explainers

What's your AI thinking?

A step by step introduction to chain of thought monitorability

Aug 4, 2025 by Shoshannah Tekofsky

AI 2025 Forecasts - May Update

A review of how the predictions are holding up so far

May 8, 2025 by Adam Binksmith, Romeo Dean and Eli Lifland

A new Moore's Law for AI agents

The length of tasks that agents can do is growing exponentially

Mar 27, 2025 by Adam Binksmith and Eli Lifland

AIs are becoming more self-aware

Here's why that matters

Dec 18, 2024 by Zak Miller and Sanyu Rajakumar

Timeline of AI forecasts

What to expect in AI capabilities, potential harms, and society's response

Apr 5, 2024 by Adam Binksmith

Compare Claude 3, GPT-4 and Gemini Ultra, side-by-side

Explore the performance of Anthropic, OpenAI, and Google's most capable language models

Mar 5, 2024 by Adam Binksmith

How fast is AI improving?

Compare how GPT-2, GPT-3, GPT-3.5 and GPT-4 perform on the same tasks

Oct 26, 2023 by Adam Binksmith, Euan McLean, Eli Lifland, and Aaron Ho