Summary
My research focuses on autoresearch and agentic task generation (e.g. Terminal Bench). I built an agentic system, Hive, and used it to win the OpenAI Parameter Golf challenge twice. Previously, I worked on RL research and developed the open-source agentic RL framework rLLM (5k+ GitHub stars). My long-term goal is to design AI systems that can perform scientific discovery and improve themselves over time.
Education
-
2021 - 2025 PhD · University of California, Berkeley
Electrical Engineering and Computer Sciences — advised by Jiantao Jiao and Kannan Ramchandran -
2017 - 2021 Undergrad · Peking University
School of Mathematical Sciences — Merit Student (Top 5%) — advised by Liwei Wang
Experience
-
2024.05 - 2024.12
2025.11 - NowMeta FAIR
- Post-training research on agentic task generation.
- Autoresearch for data generation pipelines.
- AI self-improvement and methods for enhancing the reasoning abilities of LLMs.
-
2025 Summer Hudson River Trading
- Ranked #1 of 30 algo dev interns in an alpha prediction project.
- Quantitative research on diffusion models and projected denoising.
-
2024.02 - 2024.05 Nexusflow
- Trained Starling-7B-LM-beta, a small model surpassing Mixtral-8x7b and Gemini Pro.
Honors
-
2026 - Won OpenAI Parameter Golf challenge — twice (using Hive).
-
2025 - Ranked #1 of 30 algo dev interns at Hudson River Trading (alpha prediction project).
-
2016 - IMO Selection Pool — top 30 in China, trained for the International Math Olympiad.
-
2015 - Gold Medal, Chinese Mathematical Olympiad (CMO).
- Gold Medal, Russian Mathematical Olympiad (RMO).
Research
-
EMNLP 2025 Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- Proposed Meta-Judge, which provides feedback for self-improving the model's judging abilities.
- Via self-play, improved win-rate of Llama-3-8B-Instruct on Arena-Hard from 20.6% to 29.1% and AlpacaEval from 22.9% to 39.4%, approaching Claude-Opus without human supervision.
-
COLM 2024 Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF
- Introduced Starling-LM-7B, an open-source language model trained by RLAIF using the GPT-4-labeled ranking dataset Nectar plus an advanced reward training and policy tuning pipeline.
- Starling-7B-LM scored 1119 Elo on Chatbot Arena, surpassing Mixtral-8x7b and Gemini Pro.
-
COLM 2024 Pairwise Proximal Policy Optimization: Harnessing Relative Feedback in LLM Alignment
- Developed P3O, a new policy gradient algorithm with comparative updates that unifies the reward learning and RL fine-tuning stages of RLHF through comparative training.
- P3O exhibited a high win-rate against established algorithms such as PPO and DPO.
-
ICML 2025 Thinking LLMs: General Instruction Following with Thought Generation
- Found that optimizing a model's internal thought process yields gains not just in reasoning and math, but across all instruction-following tasks.
- Used RL and search to incentivize the model to enhance its own thought process, rather than explicitly teaching it how to think.
-
ICML 2025 From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
- Introduced Arena-Hard, a data pipeline to build high-quality benchmarks from live data in Chatbot Arena.
-
ICLR 2025 RouteLLM: Learning to Route LLMs with Preference Data
- Trained routers to route between strong and weak model pairs.
- Achieved cost reductions of over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K vs. using only GPT-4, while still achieving 95% of GPT-4's performance.
-
NeurIPS 2023 A Reduction-based Framework for Sequential Decision Making with Delayed Feedback
- Proposed a reduction-based framework that turns any multi-batched algorithm for sequential decision making with instantaneous feedback into a sample-efficient algorithm that manages stochastic delays.
- Proved sharp upper bounds and provided the first guarantee in sequential decision making with function approximation.
-
ICML 2022 Nearly Optimal Policy Optimization with Stable at Any Time Guarantee
- Proposed a novel Online Mirror Descent type algorithm that eliminates the data coverage assumption.
- Showed that a novel reference V estimator and l2 regularization in OMD ensure stability of the V estimation.
- Demonstrated that stability is the key to nearly optimal regret, obtaining SOTA upper bound in the policy optimization setting.
-
ICLR 2021 A Unified Framework for Conservative Exploration
- Proposed a unified framework for conservative bandits and RL based on calculating the necessary and sufficient budget obtained from running a baseline policy.
- Showed how to turn a non-conservative algorithm into a conservative one within the framework.
- Obtained SOTA regret upper and lower bounds in tabular and low-rank settings.
-
ICML 2021 On Reinforcement Learning with Adversarial Corruption and Its Application to Block MDP
- Proposed a new RL algorithm robust to adversarial corruption with intensity lower than C.
- Proved a regret upper bound of SAC under tabular settings and a matching lower bound, showing the algorithm is optimal.
- Applying the algorithm to BMDPs yields the first √T regret bound in this setting.
-
NeurIPS 2021 Sanity-checking Pruning Methods: Random Tickets Can Win the Jackpot
- Experiments show that the conventional beliefs "pruning methods exploit training data to find good subnetworks" and "the architecture of the pruned network is crucial for good performance" are wrong.
- Proposed a simple zero-shot random pruning method that outperforms or attains similar performance to SOTA.