cv | Tianhao Wu

Summary

My research focuses on autoresearch and agentic task generation (e.g. Terminal Bench). I built an agentic system, Hive, and used it to win the OpenAI Parameter Golf challenge twice. Previously, I worked on RL research and developed the open-source agentic RL framework rLLM (5k+ GitHub stars). My long-term goal is to design AI systems that can perform scientific discovery and improve themselves over time.

Education

2021 - 2025

PhD · University of California, Berkeley

Electrical Engineering and Computer Sciences — advised by Jiantao Jiao and Kannan Ramchandran
2017 - 2021

Undergrad · Peking University

School of Mathematical Sciences — Merit Student (Top 5%) — advised by Liwei Wang

Experience

2024.05 - 2024.12
2025.11 - Now
Meta FAIR
- Post-training research on agentic task generation.
- Autoresearch for data generation pipelines.
- AI self-improvement and methods for enhancing the reasoning abilities of LLMs.
2025 Summer
Hudson River Trading
- Ranked #1 of 30 algo dev interns in an alpha prediction project.
- Quantitative research on diffusion models and projected denoising.
2024.02 - 2024.05
Nexusflow
- Trained Starling-7B-LM-beta, a small model surpassing Mixtral-8x7b and Gemini Pro.

Honors

2026
- Won OpenAI Parameter Golf challenge — twice (using Hive).
2025
- Ranked #1 of 30 algo dev interns at Hudson River Trading (alpha prediction project).
2016
- IMO Selection Pool — top 30 in China, trained for the International Math Olympiad.
2015
- Gold Medal, Chinese Mathematical Olympiad (CMO).
- Gold Medal, Russian Mathematical Olympiad (RMO).

Research

EMNLP 2025
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Proposed Meta-Judge, which provides feedback for self-improving the model's judging abilities.
- Via self-play, improved win-rate of Llama-3-8B-Instruct on Arena-Hard from 20.6% to 29.1% and AlpacaEval from 22.9% to 39.4%, approaching Claude-Opus without human supervision.
post
COLM 2024
Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF
- Introduced Starling-LM-7B, an open-source language model trained by RLAIF using the GPT-4-labeled ranking dataset Nectar plus an advanced reward training and policy tuning pipeline.
- Starling-7B-LM scored 1119 Elo on Chatbot Arena, surpassing Mixtral-8x7b and Gemini Pro.
blog
COLM 2024
Pairwise Proximal Policy Optimization: Harnessing Relative Feedback in LLM Alignment
- Developed P3O, a new policy gradient algorithm with comparative updates that unifies the reward learning and RL fine-tuning stages of RLHF through comparative training.
- P3O exhibited a high win-rate against established algorithms such as PPO and DPO.
blog
ICML 2025
Thinking LLMs: General Instruction Following with Thought Generation
- Found that optimizing a model's internal thought process yields gains not just in reasoning and math, but across all instruction-following tasks.
- Used RL and search to incentivize the model to enhance its own thought process, rather than explicitly teaching it how to think.
post
ICML 2025
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
- Introduced Arena-Hard, a data pipeline to build high-quality benchmarks from live data in Chatbot Arena.
blog
ICLR 2025
RouteLLM: Learning to Route LLMs with Preference Data
- Trained routers to route between strong and weak model pairs.
- Achieved cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K vs. using only GPT-4, while still achieving 95% of GPT-4's performance.
blog
NeurIPS 2023
A Reduction-based Framework for Sequential Decision Making with Delayed Feedback
- Proposed a reduction-based framework that turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that manages stochastic delays.
- Proved sharp upper bounds and provided the first guarantee in sequential decision making with function approximation.
ICML 2022
Nearly Optimal Policy Optimization with Stable at Any Time Guarantee
- Proposed a novel Online Mirror Descent type algorithm that eliminates the data coverage assumption.
- Showed that a novel reference V estimator and l2 regularization in OMD ensure stability of the V estimation.
- Demonstrated that stability is the key to nearly optimal regret, obtaining SOTA upper bound in the policy optimization setting.
ICLR 2021
A Unified Framework for Conservative Exploration
- Proposed a unified framework for conservative bandits and RL based on calculating the necessary and sufficient budget obtained from running a baseline policy.
- Showed how to turn a non-conservative algorithm into a conservative one within the framework.
- Obtained SOTA regret upper and lower bounds in tabular and low-rank settings.
ICML 2021
On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP
- Proposed a new RL algorithm robust to adversarial corruption with intensity lower than C.
- Proved a regret upper bound of SAC under tabular settings and a matching lower bound, showing the algorithm is optimal.
- Applying the algorithm to BMDPs yields the first √T regret bound in this setting.
NeurIPS 2021
Sanity-checking Pruning Methods: Random Tickets Can Win the Jackpot
- Experiments show that the conventional beliefs "pruning methods exploit training data to find good subnetworks" and "the architecture of the pruned network is crucial for good performance" are wrong.
- Proposed a simple zero-shot random pruning method that outperforms or attains similar performance to SOTA.

Summary

Education

PhD · University of California, Berkeley

Undergrad · Peking University

Experience

Meta FAIR

Hudson River Trading

Nexusflow

Honors

Research

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge

Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback in LLM Alignment

Thinking LLMs: General Instruction Following with Thought Generation

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

RouteLLM: Learning to Route LLMs with Preference Data

A Reduction-based Framework for Sequential Decision Making with Delayed Feedback

Nearly Optimal Policy Optimization with Stable at Any Time Guarantee

A Unified Framework for Conservative Exploration

On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP

Sanity-checking Pruning Methods: Random Tickets Can Win the Jackpot