Bibliographic Details
Title: |
Process Reward Modeling with Entropy-Driven Uncertainty |
Authors: |
Cao, Lang, Chen, Renhong, Zou, Yingtian, Peng, Chao, Ning, Wu, Xu, Huacong, Chen, Qian, Wang, Yuxian, Su, Peishuo, Peng, Mofan, Chen, Zijie, Li, Yitong |
Publication Year: |
2025 |
Collection: |
Computer Science |
Subject Terms: |
Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language |
More Details: |
This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training. |
Document Type: |
Working Paper |
Access URL: |
http://arxiv.org/abs/2503.22233 |
Accession Number: |
edsarx.2503.22233 |
Database: |
arXiv |