ICML 2026

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

Gengyue Han  Β·  Yiheng Feng

Purdue University

βŒ„
Problem formulation: the Sim-to-Real gap in reinforcement learning

The Sim2Real gap. RL training typically requires substantial interactions between the agent and the environment, making direct training on real systems expensive and often unsafe due to exploratory behaviors. As a result, many RL policies are first trained in simulators, where data collection is cheap and safety risks are minimized. However, when deployed in the real world, such policies trained in simulation often suffer from performance degradation and safety risks due to the mismatch between simulation and reality. This mismatch is commonly referred to as the Sim2Real gap.

Abstract

Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.

Key Innovations

  • Unified safe and adaptive Sim2Real framework. We present a unified CMDP-based framework that integrates probabilistic latent context variable adaptation with distributional reinforcement learning, enabling safe and adaptive policy transfer under Sim2Real mismatch.
  • Inference-time risk regulation. By incorporating distributional RL into the CMDP formulation, our approach allows the risk levels of the deployed policy to be adjusted dynamically at inference time, balancing safety and performance during real-world deployment.
  • Theoretical guarantees. We provide theoretical proofs on training convergence, quantify the benefits of latent context variable adaptation, and demonstrate assured safety under the Sim2Real gap.

Method

Framework: latent context encoder + distributional policy + safety upper-bound

Framework overview of the proposed approach.

We introduce an encoder that extracts salient environment-specific information, enabling the learned policy to condition its actions on the underlying environment (Section 4.1). In addition, distributional reinforcement learning is employed to characterize the distributions of both rewards and costs under the latent information (Section 4.2 and 4.3), allowing risk level to be adjusted with latent context variable adaptation (Section 5.1). A safety upper-bound is then developed (Section 5.2) and the agent is proven to be safe under dynamic adaptation of risk-sensitive policy (Section 5.3 and 5.4). Together, these components enable safe and effective policy transfer with limited real-world interactions under the Sim2Real gap.

Experimental Results

Task environments

PointGoal2 task scene β€” point robot navigating among hazards and vases toward a goal
PointGoal2 A point robot must reach a green goal while avoiding hazards (blue discs) and vases (cyan cubes). Safety constraint: cost ≀ 10 per episode.
Autonomous Driving platoon β€” RL ego vehicle following the leader
Autonomous Driving The RL ego (gold) controls the 3rd vehicle in the platoon, smoothing the platoon by suppressing speed oscillations propagated from the leading vehicle. Safety constraint: cost ≀ 20 per episode.

Demonstration video

Demonstration of our proposed method for PointGoal2 task.

During training, all methods successfully learn to reach the goal without violating the cost constraints. In deployment, however, a clear performance divergence emerges. Nominal, Domain Randomized, and Robust RL baselines achieve high rewards but suffer from severe cost violations, indicating limited safety in the deployment environment with Sim2Real gap. In contrast, both SPiDR and our method exhibit conservative behaviors and have substantially lower costs, at the expense of lower rewards. This reward–cost trade-off is expected in both tasks, where more cautious navigation leads to better hazards avoidance in unseen environments in the first task, and more conservative AV speed-control reduces the risk of rear-end collisions in the second task. Notably, compared to SPiDR, our method achieves a more favorable trade-off by attaining lower deployment costs while simultaneously achieving higher rewards.

Detailed analyses

Figure 3 β€” cost trajectory
Dynamic risk-sensitive adaptation.
Figure 4a β€” cost bar chart across OOD scenarios Figure 4b β€” reward bar chart across OOD scenarios
Robustness across OOD scenarios. Our method consistently maintains deployment costs below the threshold across all OOD scenarios, while preserving competitive reward (bottom).

BibTeX

citation.bib
@inproceedings{han2026transferable,
  title     = {Transferable Reinforcement Learning via Probabilistic Latent
               Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment},
  author    = {Han, Gengyue and Feng, Yiheng},
  booktitle = {Proceedings of the 43rd International Conference on
               Machine Learning (ICML)},
  year      = {2026},
  publisher = {PMLR},
  address   = {Seoul, South Korea}
}