Self-Principled Critique Tuning (SPCT) // DeepSeek-GRM
Reinforcement learning (RL) has been widely adopted in post-training
for large language models (LLMs) at scale. Recently, the incentivization
of reasoning capabilities in LLMs from RL indicates that proper learning
methods could enable effective inference-time scalability. A key challenge of
RL is to obtain accurate reward signals for LLMs in various domains be-
yond verifiable questions or artificial rules. In this work, we investigate
how to improve reward modeling (RM) with more inference compute for
general queries, i.e. the inference-time scalability of generalist RM, and
further, how to improve the effectiveness of performance-compute scaling
with proper learning methods. For the RM approach, we adopt pointwise
generative reward modeling (GRM) to enable flexibility for different input
types and potential for inference-time scaling. For the learning method,
we propose Self-Principled Critique Tuning (SPCT) to foster scalable
reward generation behaviors in GRMs through online RL, to generate prin-
ciples adaptively and critiques accurately, resulting in DeepSeek-GRM
models. Furthermore, for effective inference-time scaling, we use parallel
sampling to expand compute usage, and introduce a meta RM to guide vot-
ing process for better scaling performance. Empirically, we show that SPCT
significantly improves the quality and scalability of GRMs, outperforming
existing methods and models in various RM benchmarks without severe
biases, and could achieve better performance compared to training-time
scaling. DeepSeek-GRM still meets challenges in some tasks, which we
believe can be addressed by future efforts in generalist reward systems.
The models will be released and open-sourced.
Key Takeaways: SPCT (Self-Principle Critique Tuning) by DeepSeek
- What is SPCT?
- A novel inference-time scaling (test-time compute) method for improving AI reasoning models.
- Focuses on self-generating adaptive principles and critiques to refine outputs dynamically during inference.
2. Core Idea
- Unlike traditional training-time scaling (e.g., fine-tuning), SPCT optimizes performance during inference by:
- Generating task-specific principles (e.g., “technical accuracy,” “clarity”) for each user query.
- Using critiques to evaluate responses against these principles in parallel.
- Combines generative reward models (GRMs) and meta-reward models to vote on the best output.
3. How It Works
- Training Phase:
- Fine-tunes a reward model (not the main LLM) using rejective fine-tuning (filtering low-quality data) and rule-based reinforcement learning.
- Inference Phase:
- For a user query, the system:
- Generates multiple principles (guidelines) and critiques (evaluations) in parallel.
- Scores responses via majority voting or a meta-reward model (for domain-specific tasks).
- No backpropagation (unlike Monte Carlo Tree Search), enabling efficient scaling.
4. Advantages
- Adaptive: Principles and critiques adjust to the query (e.g., math problems prioritize precision).
- Efficient: Parallel sampling explores a wider solution space without costly retraining.
- Performance: DeepSeek’s tests show SPCT outperforms traditional training-time scaling, even with smaller models (e.g., 27B parameters).
5. Comparison to Existing Methods
- Similar to Monte Carlo Tree Search but without backpropagation.
- Builds on ideas like GRPO (Generative Reward Post-Optimization) and entropic principles (from earlier 2025 research).
6. Potential Applications
- Could enhance Mixture of Experts (e.g., LLaMA-4’s 128 experts) by tailoring principles to each expert’s domain.
- Ideal for specialized tasks (e.g., medicine, finance) where meta-reward models add precision.
7. Study Highlights
- DeepSeek’s experiments used a GMA-27B base model, showing smaller reward models can rival larger ones when optimized for inference-time scaling.
- Meta-reward models further boost performance by guiding voting intelligently.
Final Thought
SPCT represents a paradigm shift — prioritizing inference-time optimization over brute-force model scaling. By making reward models “smarter,” DeepSeek unlocks efficiency gains without massive parameter increases.
For details, check the April 2025 paper by DeepSeek AI and Tsinghua University.
TL;DR: SPCT is a new way to improve AI responses during inference by dynamically generating and scoring principles/critiques, reducing the need for giant models. DeepSeek’s method is faster, adaptive, and highly parallelizable. 🚀