AI Forecast Tracker
← Back to forecasts

Can AI Do Your Entire Workday Without You?

Will AI systems autonomously complete a full 8-hour professional workday — multiple tasks, context switching, decision-making — without human intervention by end of 2027?

This isn't about whether AI takes your job tomorrow — it's about how fast the 'AI can't do that' list is shrinking.

Target: Dec 2027(664 days until resolution)
Assessed Probability
60%
More likely than not
Based on 6 expert predictions, 4 evidence items
Community Forecast
Cast your vote
Be the first to weigh in below

Your Prediction

Where do you think this lands?

Join others who've weighed in

5%95%
50% — More likely than not
METR confirmed Opus 4.6 handles individual tasks taking 14+ hours of expert work. The task horizon doubled from ~5.3 to 14.5 hours in just 4 months. But the real signal isn't benchmarks — it's practitioners. Boris Cherny ships 22-27 PRs per day with 100% AI code, effectively running an AI workday for coding tasks. Claude agent teams coordinate multiple agents on different parts of a codebase simultaneously. The Opus 4.5/4.6 leap (November 2025) was qualitatively different from prior improvements — not just faster, but able to handle the kind of multi-step reasoning, context management, and decision-making that workdays require. If the 89-day doubling rate holds through 2027, the math works. The power-law applies here too: for the top 10% of AI-fluent professionals, the autonomous workday is already approaching reality for specific domains. For the average knowledge worker, it's further out.

Scenarios

Current value: 14.5 hours on single METR tasks (Opus 4.6, Feb 2026); Boris Cherny running ~AI workday for coding; Claude agent teams coordinating multi-agent workflows

S-curve position: Steep mid-curve — single-task autonomy nearly solved, multi-task coordination emerging rapidly

Bear Case

Single tasks only through 2028 (multi-task coordination, real-world messiness, interpersonal judgment too hard)

Base Case

6-8 hour semi-autonomous work sessions for structured professional work; full autonomy for coding/analysis domains

Bull Case

Full autonomous workday by Q3 2027 (Opus 4.5/4.6 leap suggests nonlinear progress in planning + memory)

How We'll Know

What we measure
Whether AI systems can autonomously complete a realistic 8-hour professional workday simulation involving multiple diverse tasks, context switching, and decision-making
Confirmed if
Frontier AI models demonstrate autonomous completion of multi-task 8-hour workday simulations, OR multiple companies publicly deploy AI for full-day autonomous work
Refuted if
Best frontier models remain limited to single-task autonomy below 4 hours on realistic workday simulations
Data sources
  • METR autonomous task evaluations
  • SWE-bench Pro
  • RE-bench (ML research)
  • Company-reported agent evaluations
  • Third-party autonomous work benchmarks

Evidence Trail

Evidence For

  • Mar 7, 2026

    METR Opus 4.6: 14.5-hour task horizon (50% success). Task horizon doubled from ~5.3hr to 14.5hr in ~4 months. Claude agent teams mode in production. 57% of enterprises running multi-step agent workflows.→ Probability: 40%

  • Mar 7, 2026

    Boris Cherny: 22-27 PRs/day with 100% AI code — effectively an AI coding workday. Opus 4.5/4.6 qualitative leap in multi-step reasoning. 89-day doubling rate projects 40+ hour task horizon by late 2027. Inference cost collapse (200x/year) enables longer autonomous sessions economically. Power-law: top 10% already approaching AI workday for specific domains.→ Probability: 55%

  • Mar 9, 2026

    GPT-5.4 (March 2026) scored 75% on OSWorld desktop automation — exceeding the human expert baseline of 72.4%. First frontier model to beat humans on full desktop workflow automation. Also achieved 83% GDPval score matching industry professionals across 44 occupations. Gartner predicts 40% of enterprise apps will embed AI agents by end of 2026.→ Probability: 60%

Evidence Against

  • Mar 7, 2026

    METR notes its task suite is 'nearly saturated' — unclear if results transfer to new task types. A workday involves context switching, interpersonal judgment, exception handling — qualitatively different from benchmark tasks. Diminishing returns likely as tasks become more open-ended.

How Our View Evolved

  • Mar 9, 202655%60%

    GPT-5.4 exceeded human expert baseline on OSWorld desktop automation (75% vs 72.4%). First model to beat humans on full workday simulation. Significant milestone for the autonomous workday thesis.

  • Mar 8, 2026Initial assessment: 55%

    Baseline — initial published assessment

What Experts Say

Dario Amodei

CEO, Anthropic

Track record: 8/10
AI models will handle most aspects of software engineering tasks from start to finish within 6-12 months
Jan 2026 | interview
We assess this claim as 65% likely

Dario Amodei

CEO, Anthropic

Track record: 8/10
Systems capable of outperforming Nobel laureates across most fields could arrive by 2027-2028
Oct 2025 | blog
We assess this claim as 10% very unlikely

Demis Hassabis

CEO, Google DeepMind; Nobel Laureate

Track record: 9/10
AGI is 3-5 years away; current systems lack reasoning, hierarchical planning, and long-term memory
Feb 2026 | interview
We assess this claim as 35% roughly even odds

Andrej Karpathy

AI Researcher, former Tesla AI Director, educator

Track record: 8/10
Agentic engineering (AI agents writing 99% of code, humans as oversight) becomes the default professional workflow
Feb 2026 | blog
We assess this claim as 50% more likely than not

Gary Marcus

AI Researcher, NYU Professor Emeritus, AI critic

Track record: 7/10
AGI will not arrive in 2026 or 2027
Dec 2025 | blog
We assess this claim as 85% very likely

Boris Cherny

Head of Claude Code, Anthropic

Track record: 8/10
AI can already write 100% of production code; top engineers using AI are 10x more productive
Feb 2026 | interview
We assess this claim as 70% likely

What Could Go Wrong

Benchmark saturation creates illusion of general capability. Real workdays involve ambiguity, social interaction, and judgment calls that don't appear in standardized evaluations. The doubling trend breaks down above 16 hours as tasks require fundamentally different capabilities.

What should we track about this topic?