Tianhao Wu
curriculum vitae

Tianhao Wu

PhD · UC Berkeley EECS · agentic systems & self-improvement

Summary

My research focuses on autoresearch and agentic task generation (e.g. Terminal Bench). I built an agentic system, Hive, and used it to win the OpenAI Parameter Golf challenge twice. Previously, I worked on RL research and developed the open-source agentic RL framework rLLM (5k+ GitHub stars). My long-term goal is to design AI systems that can perform scientific discovery and improve themselves over time.

Education

  • 2021 - 2025
    PhD · University of California, Berkeley
    Electrical Engineering and Computer Sciences — advised by Jiantao Jiao and Kannan Ramchandran
  • 2017 - 2021
    Undergrad · Peking University
    School of Mathematical Sciences — Merit Student (Top 5%) — advised by Liwei Wang

Experience

  • 2024.05 - 2024.12
    2025.11 - Now
    Meta FAIR
    • Post-training research on agentic task generation.
    • Autoresearch for data generation pipelines.
    • AI self-improvement and methods for enhancing the reasoning abilities of LLMs.
  • 2025 Summer
    Hudson River Trading
    • Ranked #1 of 30 algo dev interns in an alpha prediction project.
    • Quantitative research on diffusion models and projected denoising.
  • 2024.02 - 2024.05
    Nexusflow
    • Trained Starling-7B-LM-beta, a small model surpassing Mixtral-8x7b and Gemini Pro.

Honors

  • 2026
    • Won OpenAI Parameter Golf challenge — twice (using Hive).
  • 2025
    • Ranked #1 of 30 algo dev interns at Hudson River Trading (alpha prediction project).
  • 2016
    • IMO Selection Pool — top 30 in China, trained for the International Math Olympiad.
  • 2015
    • Gold Medal, Chinese Mathematical Olympiad (CMO).
    • Gold Medal, Russian Mathematical Olympiad (RMO).

Research

  • EMNLP 2025
    Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
    • Proposed Meta-Judge, which provides feedback for self-improving the model's judging abilities.
    • Via self-play, improved win-rate of Llama-3-8B-Instruct on Arena-Hard from 20.6% to 29.1% and AlpacaEval from 22.9% to 39.4%, approaching Claude-Opus without human supervision.
  • COLM 2024
    Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF
    • Introduced Starling-LM-7B, an open-source language model trained by RLAIF using the GPT-4-labeled ranking dataset Nectar plus an advanced reward training and policy tuning pipeline.
    • Starling-7B-LM scored 1119 Elo on Chatbot Arena, surpassing Mixtral-8x7b and Gemini Pro.
  • COLM 2024
    Pairwise Proximal Policy Optimization: Harnessing Relative Feedback in LLM Alignment
    • Developed P3O, a new policy gradient algorithm with comparative updates that unifies the reward learning and RL fine-tuning stages of RLHF through comparative training.
    • P3O exhibited a high win-rate against established algorithms such as PPO and DPO.
  • ICML 2025
    Thinking LLMs: General Instruction Following with Thought Generation
    • Found that optimizing a model's internal thought process yields gains not just in reasoning and math, but across all instruction-following tasks.
    • Used RL and search to incentivize the model to enhance its own thought process, rather than explicitly teaching it how to think.
  • ICML 2025
    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
    • Introduced Arena-Hard, a data pipeline to build high-quality benchmarks from live data in Chatbot Arena.
  • ICLR 2025
    RouteLLM: Learning to Route LLMs with Preference Data
    • Trained routers to route between strong and weak model pairs.
    • Achieved cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K vs. using only GPT-4, while still achieving 95% of GPT-4's performance.
  • NeurIPS 2023
    A Reduction-based Framework for Sequential Decision Making with Delayed Feedback
    • Proposed a reduction-based framework that turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that manages stochastic delays.
    • Proved sharp upper bounds and provided the first guarantee in sequential decision making with function approximation.
  • ICML 2022
    Nearly Optimal Policy Optimization with Stable at Any Time Guarantee
    • Proposed a novel Online Mirror Descent type algorithm that eliminates the data coverage assumption.
    • Showed that a novel reference V estimator and l2 regularization in OMD ensure stability of the V estimation.
    • Demonstrated that stability is the key to nearly optimal regret, obtaining SOTA upper bound in the policy optimization setting.
  • ICLR 2021
    A Unified Framework for Conservative Exploration
    • Proposed a unified framework for conservative bandits and RL based on calculating the necessary and sufficient budget obtained from running a baseline policy.
    • Showed how to turn a non-conservative algorithm into a conservative one within the framework.
    • Obtained SOTA regret upper and lower bounds in tabular and low-rank settings.
  • ICML 2021
    On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP
    • Proposed a new RL algorithm robust to adversarial corruption with intensity lower than C.
    • Proved a regret upper bound of SAC under tabular settings and a matching lower bound, showing the algorithm is optimal.
    • Applying the algorithm to BMDPs yields the first √T regret bound in this setting.
  • NeurIPS 2021
    Sanity-checking Pruning Methods: Random Tickets Can Win the Jackpot
    • Experiments show that the conventional beliefs "pruning methods exploit training data to find good subnetworks" and "the architecture of the pruned network is crucial for good performance" are wrong.
    • Proposed a simple zero-shot random pruning method that outperforms or attains similar performance to SOTA.