Skip to main content
Back to Research
OtherarXiv preprint · 2025GPT-5AI-assisted researchscientific discoverymathematicsbiologyphysicsliterature searchproof generation

Early science acceleration experiments with GPT-5

Sébastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, Kevin Weil, Steven Yin, Nikita Zhivotovskiy

This paper presents case studies demonstrating how GPT-5 accelerated scientific research across mathematics, physics, astronomy, computer science, biology, and materials science. The authors document concrete examples where GPT-5 contributed to research progress, including four new mathematical results, while highlighting both the capabilities and limitations of frontier AI in scientific discovery.

14 min read

Early Science Acceleration Experiments with GPT-5: A Comprehensive Summary

1. Introduction and Problem Statement

This groundbreaking paper documents how GPT-5, OpenAI's frontier AI model, is accelerating scientific discovery across mathematics, physics, astronomy, computer science, biology, and materials science. The central question addressed is:

Can AI models like GPT-5 meaningfully contribute to research-level scientific work, and if so, where do they excel and where do they fall short?

The paper presents concrete case studies where GPT-5:

  • Independently rediscovered results at the scientific frontier
  • Performed deep literature searches across disciplinary boundaries
  • Worked in tandem with researchers to accelerate workflows
  • Produced four new mathematical results (verified by human experts)
Rendering diagram...

1.1 Key Context

Why this matters: While LLMs have become useful for writing and programming, their ability to contribute intellectually to frontier research has been unclear. This paper provides systematic evidence that GPT-5 represents a qualitative leap in AI-assisted scientific discovery.

Important limitations acknowledged:

  • GPT-5 can confidently make mistakes
  • Results depend on prompt details and can be hard to reproduce
  • Expert oversight remains essential
  • The model can "confuse itself (and us) in the process"

2. Technical Approach and Methodology

2.1 Experimental Design

The paper organizes findings into four chapters based on the type of contribution:

ChapterContribution TypeExample Domains
IIndependent rediscovery of frontier resultsOptimization, black holes, immunology
IIDeep literature searchMulti-objective optimization, Erdős problems
IIITandem collaborationGraph theory, astrophysics, materials science
IVNovel results4 new mathematical theorems

2.2 Interaction Patterns

The researchers documented their interactions with GPT-5 to understand where AI adds value and where human input is key. Common patterns included:

  1. Cold start failures → Warm-up successes: Models sometimes needed scaffolding via simpler related problems
  2. Iterative refinement: Multi-turn conversations where experts guided the model
  3. Verification loops: Human mathematicians carefully checked all proofs
Rendering diagram...

3. Key Results by Domain

3.1 Mathematics: Convex Optimization (Chapter I.1)

Problem: Can GPT-5 improve a recent result on gradient descent convergence?

Setup:

  • Paper [BSZ25] had three versions on arXiv
  • v1: Proved convergence with step-size η ≤ 1/L
  • v2: Improved to optimal η ≤ 1.75/L
  • Challenge: Given only v1, can GPT-5 derive v2?

Result: GPT-5 Pro achieved η ≤ 1.5/L (halfway between v1 and v2) in 17 minutes 35 seconds of reasoning.

Key Innovation: The proof was different from the human v2 proof—GPT-5 found a "more canonical variant" of the v1 approach, while humans required "clever weighting of different inequalities."

# Theorem proved by GPT-5
# For convex L-smooth function f, gradient descent with step-size η
# Condition: η ≤ 3/(2L)  [GPT-5's improvement]
# Original v1: η ≤ 1/L
# Optimal v2: η ≤ 1.75/L
 
def gradient_descent_convexity(f, L, eta):
    """
    GPT-5 proved: if eta <= 3/(2L), then the sequence
    {f(x_k)} is convex (decreases are non-increasing)
    """
    assert eta <= 3/(2*L), "Step-size too large for guaranteed convexity"
    # Proof uses cocoercivity and Bregman divergence

Impact: This type of improvement "could probably have been achieved by some experts in the field in a matter of hours, and likely for most experts it would have taken a few days."


3.2 Black Hole Physics: Discovering SL(2,ℝ) Symmetries (Chapter I.2)

Problem: Find Lie point symmetries of the stationary, axisymmetric wave equation on a Kerr (rotating black hole) background.

Equation:

∂ᵣ[Δ(r)∂ᵣψ(r,x)] + ∂ₓ[(1-x²)∂ₓψ(r,x)] = 0
where Δ(r) = r² - 2Mr + a²

Outcome:

  1. Cold start: Failed after 5 minutes, incorrectly claimed no symmetries exist
  2. Warm-up approach: Given flat-space version first → Success in 10m 27s
  3. Curved-space retry: Correctly derived full SL(2,ℝ) generators in 18m 9s

The Symmetry Generators (matching recent unpublished work [Lup25b]):

H₊ = x·Δ·∂ᵣ + (r-M)(1-x²)∂ₓ / [(r-M)² - (M²-a²)x²]
 
H₀ = (r-M)·Δ·∂ᵣ + (M²-a²)x(1-x²)∂ₓ / [(r-M)² - (M²-a²)x²] + 1/2
 
H₋ = [complex expression with scale invariance]

Significance: These symmetries explain why black holes have vanishing static Love numbers (zero tidal deformability), a surprising rigidity in general relativity.

Key Insight: "The model likely executed (implicitly) a mix of: recognizing conformal invariance in the flat equation, hypothesizing a curved analogue, and/or exploiting a coordinate map."


3.3 Immunology: T Cell Metabolism Mechanisms (Chapter I.3)

Problem: Understand why 2-deoxy-D-glucose (2-DG) treatment increases pro-inflammatory Th17 cells.

Experimental Context:

  • CD4⁺ T cells treated with 2-DG (glucose analog)
  • Treatment washed out after 2 days
  • Cells expanded for 2 weeks
  • Flow cytometry measured IL-17A, CCR6, CD161

GPT-5's Analysis (17 minutes reasoning):

Rendering diagram...

Mechanistic Insights Provided:

  1. Primary mechanism: N-linked glycosylation interference (not energy restriction)
  2. Pathway: Reduced IL-2 receptor → decreased STAT5 → Th17 bias
  3. Persistence: Epigenetic memory at Th17 loci (RORC, IL23R, CCR6)

Experimental Predictions:

  • Mannose rescue would reverse effect ✓ (Validated in unpublished data)
  • CAR-T cells with 2-DG pretreatment would show enhanced cytotoxicity ✓ (Validated)
  • Lower PD-1/LAG-3 checkpoint expression ✓ (Validated)

Researcher's Assessment:

"GPT-5 Pro provided remarkable key insights... If we had had these interpretations and the recommended next experimental plan from GPT-5 Pro, we would have resolved or hypothesized the mechanistic insights within 19 minutes upon data analysis... GPT-5 Pro made sufficient contributions to this work to the extent that it would warrant its inclusion as a co-author."


3.4 Deep Literature Search: Multi-Objective Optimization (Chapter II.1)

Problem: Given a new geometric theorem about α-ratio covers, find related work and applications.

New Theorem (Compton et al., 2025):

For every convex compact set K ⊂ ℝ₊ᵈ, there exists a subset A ⊂ K 
with at most 2^(8d) elements that is a 32-ratio cover of K.

Original motivation: Statistical density estimation for mixtures

GPT-5's Discovery (8 minutes reasoning):

  • Connected to Papadimitriou-Yannakakis (FOCS 2000) on approximate Pareto sets
  • Identified this as the "multiplicative ε-approximate Pareto set" problem in multi-objective optimization
  • Found that under convexity, the new result removes the log(R) factor from classical bounds

Why This Was Hard:

  • Different terminology: "α-ratio cover" vs "approximate Pareto set"
  • Different fields: convex geometry vs theoretical computer science
  • No obvious keyword overlap

Key Achievement: "GPT-5 can rapidly surface nontrivial and technically aligned links across areas... providing context for new applications."


3.5 Erdős Problems: Literature Mining (Chapter II.2)

Context: Paul Erdős posed >1000 mathematical problems. Many solutions are scattered across decades of literature with inconsistent terminology.

GPT-5's Performance:

  • 10 problems: Found previously published solutions
  • 10 problems: Located significant partial progress
  • 1 problem: Corrected a misprint
  • 1 problem: Generated new idea leading to complete solution (Section IV.1)

Example: Problem #339 (Additive Basis)

Problem Statement:

Let A ⊆ ℕ be an additive basis of order r. Must the set of integers representable as the sum of exactly r distinct elements from A have positive lower density?

Challenge: Raised in a 100-page paper [EG80] with ~700 citations

Result: GPT-5 Pro found the solution on first query from just a screenshot of the problem webpage.

Example: Problem #515 (Entire Functions)

Problem: Does every non-polynomial entire function f(z) have a path γ → ∞ such that ∫_γ |f(z)|^(-λ) dz < ∞ for all λ > 0?

GPT-5's Solution Process:

  1. Found reference [LRW84] on subharmonic functions
  2. Recognized log|f(z)| is subharmonic for entire f
  3. Applied general result to specific problem
  4. Located corroborating survey [HL18] (256 pages, relevant content on page 27)
  5. Verified technical detail: definition of "subharmonic" allows singularities at zeros

Why This Was Impressive:

  • Different vocabulary ("subharmonic functions" vs "entire functions")
  • Required reading papers in detail, not just keyword matching
  • Needed to verify subtle technical compatibility

3.6 Cautionary Tale: Clique-Avoiding Codes (Chapter II.3)

Problem: Minimum co-dimension r(n) of binary linear codes avoiding all graph cliques.

What Happened:

  1. GPT-5 initially gave incorrect arguments claiming r(n) = n
  2. When challenged, produced correct proof that r(n) ≥ ⌊n/2⌋ using Chevalley-Warning theorem
  3. Researchers found matching upper bound: r(n) = ⌊n/2⌋ exactly
  4. Discovery: The proof was identical to Alon's 2024 paper [Alo24]
  5. GPT-5 had reproduced Alon's proof without citing the source

Fresh Query: Later attempt successfully recovered the original source.

Critical Lesson: "Although GPT-5 possesses enormous internal knowledge... it may not always report the original information sources accurately. This has the potential to deceive even seasoned researchers into thinking their findings are novel."

Recommendation: Take special care in attribution when working with LLM-assisted proofs.


4. Novel Scientific Results (Chapter IV)

4.1 New Theorem: Subgraph Counts in Trees (Chapter IV.3)

Problem: Prove inequalities relating counts of different subgraph types in trees.

Result: GPT-5 helped prove that for any tree T on n vertices:

(number of induced P₄'s) ≥ (number of induced claws K₁,₃)

where P₄ is a 4-vertex path and K₁,₃ is a "claw" (one center connected to 3 leaves).

Proof Strategy (GPT-5 contribution):

  1. Suggested induction on tree structure
  2. Proposed case analysis on vertex degrees
  3. Helped verify base cases and inductive steps

Verification: Human mathematicians carefully checked all steps.


4.2 New Result: Online Algorithms Lower Bounds (Chapter IV.2)

Problem: Prove lower bounds for online algorithms on dynamic networks.

GPT-5's Contribution:

  • Constructed adversarial input sequences
  • Analyzed competitive ratios
  • Proved impossibility results for certain problem classes

Significance: These are research-level results in theoretical computer science, not just reproductions of known work.


5. Practical Implications and Workflow Integration

5.1 Research Acceleration Metrics

Task TypeTraditional TimeGPT-5 TimeSpeedup Factor
Literature search (Problem #339)Days-weeksMinutes~1000×
Mechanism hypothesis (immunology)Months19 minutes~2000×
Proof improvement (optimization)Hours-days17 minutes~50-200×
Symmetry discovery (physics)Weeks-months18 minutes~1000×

5.2 Best Practices Identified

Rendering diagram...

Key Recommendations:

  1. Scaffolding: When cold starts fail, try warm-up problems in the same conceptual space
  2. Verification: Always manually check mathematical proofs and experimental predictions
  3. Attribution: Actively search for prior work, even when GPT-5 claims novelty
  4. Iteration: Use multi-turn conversations to refine understanding
  5. Documentation: Record full conversation transcripts for reproducibility

6. Limitations and Failure Modes

6.1 Known Issues

Confidence without correctness:

  • Can "confidently make mistakes, ardently defend them"
  • May hallucinate references or proofs

Reproducibility challenges:

  • Results depend on fine prompt details
  • Same query can yield different responses

Cold start failures:

  • Black hole symmetries: Failed initially, succeeded after warm-up
  • Suggests retrieval/pattern activation needs priming

Attribution gaps:

  • Clique-avoiding codes: Reproduced Alon's proof without citation
  • Risk of inadvertent plagiarism

6.2 Where Human Expertise Remains Essential

TaskHuman RoleAI Role
Problem formulationDefine precise questionsSuggest related problems
Proof verificationCheck logical validityGenerate proof sketches
Experimental designEnsure biological relevancePropose mechanistic hypotheses
Literature attributionVerify originalityFind related work
Result interpretationAssess significanceConnect to broader context

7.1 Contemporary AI for Science

The paper cites several related efforts:

  • AlphaEvolve (Google): Search problems with well-defined objectives (complementary approach)
  • Other recent accounts: [FK25; DMN25; IX25; AM25; JR25; Sal25; Geo+25]

Key Distinction: This work focuses on general-purpose systems answering any query type, rather than domain-specific optimization.

7.2 Historical Context

Classical AI limitations:

  • Theorem provers: Required formal problem statements
  • Expert systems: Narrow domain knowledge
  • Search algorithms: Couldn't handle conceptual reasoning

GPT-5 advantages:

  • Broad conceptual space search
  • Integration of diverse information sources
  • Natural language interaction
  • Rapid iteration

8. Future Directions and Implications

8.1 Immediate Applications

For researchers today:

  1. Literature review: Cross-disciplinary connection finding
  2. Hypothesis generation: Mechanistic explanations for experimental data
  3. Proof sketching: Initial approaches to mathematical problems
  4. Experimental design: Predicting outcomes and suggesting controls

8.2 Long-term Scientific Impact

Potential transformations:

Rendering diagram...

Expected outcomes:

  • Faster discovery cycles: Months → days for many problems
  • Cheaper negative results: Failed branches pruned in silico
  • More reproducible science: Better hypothesis selection
  • Democratized expertise: Access to cross-disciplinary knowledge

8.3 Open Questions

  1. Scaling: Will longer reasoning time (hours vs minutes) unlock harder problems?
  2. Verification: How to automate proof checking for AI-generated mathematics?
  3. Attribution: Can models learn to cite sources more reliably?
  4. Generalization: Which scientific domains benefit most from current AI capabilities?

9. Conclusion and Key Takeaways

9.1 Main Findings

This paper provides systematic evidence that GPT-5 can:

Rediscover frontier results independently (optimization, black holes, immunology)
Perform deep literature search across disciplinary boundaries
Accelerate research workflows by 50-2000× for specific tasks
Produce novel results (4 new mathematical theorems)

But cannot yet: Guarantee correctness, ensure proper attribution, or replace expert judgment

9.2 Practical Recommendations

For researchers:

  1. Use GPT-5 for literature search and hypothesis generation
  2. Always verify mathematical proofs manually
  3. Check attribution carefully to avoid plagiarism
  4. Document interactions for reproducibility
  5. Combine AI suggestions with domain expertise

For the field:

  1. Develop better verification tools for AI-generated proofs
  2. Create standards for AI attribution in scientific work
  3. Build datasets for evaluating AI scientific capabilities
  4. Study which problems benefit most from AI assistance

9.3 The Bigger Picture

"These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing."

The paper demonstrates that AI is transitioning from tool to collaborator in scientific research. While current capabilities are impressive, the trajectory suggests even more transformative impacts ahead.

The central insight: GPT-5 already provides substantial value for scientific researchers today, compressing months of work into minutes for certain tasks—but human expertise, verification, and judgment remain irreplaceable for ensuring correctness and advancing the frontier of knowledge.


References and Resources

Key papers cited:

  • [BSZ25]: Convex optimization convergence
  • [Lup25b]: Black hole symmetries and Love numbers
  • [PY00]: Papadimitriou-Yannakakis on approximate Pareto sets
  • [Blo]: Erdős problems database (https://www.erdosproblems.com)
  • [Alo24]: Alon on clique-avoiding codes

Conversation logs: Several case studies include links to full ChatGPT transcripts for reproducibility.

Code availability: The paper mentions use of ChatGPT interface, OpenAI API, and internal tools for automated queries.