广义行为正则化离线 Actor-Critic

作者:Cheng Yu-Hu; Huang Long-Yang; Hou Di-Yuan; Zhang Jia-Zhi; Chen Jun-Long*; Wang Xue-Song
来源:Chinese Journal of Computers, 2023, 46(4): 843-855.
DOI:10.11897/SP.J.1016.2023.00843

摘要

The behavior regularized actor-critic (BRAC) is an offline reinforcement learning algorithm. It alleviates the distribution shift problem by taking the Kullback-Leibler (KL)divergence between the current and behavior policies as the regularization term in the policy objective function. However,KL divergence is an unbounded measure of distribution difference. When the policy difference is too large, the expected cumulative return in the policy objective function will only play a limited role in the policy improvement,resulting in poor performance of the learned policy. To address the issue, we take the skew-symmetric Jensen-Shannon(JS)divergence between the current and behavior policies as the regularization term in the policy objective function and propose a generalized offline actor-critic with behavior regularization (GOACBR) algorithm. The theoretical analysis shows that since the skew-symmetric JS divergence is bounded,it’s helpful to reduce the difference in policy performance by taking it as the regularization term. Furthermore,since it’s difficult to directly calculate the skew-symmetric JS divergence between policies due that the behavior policy is unknown, an auxiliary neural network is designed to indirectly estimate it. Finally, the convergence of GOACBR is proved theoretically. The performance of GOACBR is evaluated on the D4RL benchmark dataset. Compared with BRAC, the total average cumulative return achieved by GOACBR on all testing tasks has increased by 289.8%. The source code is available at https://github.com/houge1996/GOAC. ? 2023 Science Press.

全文