Jump-Start Reinforcement Learning | TransferLab

Reinforcement learning (RL) is known to have difficulties with sample efficiency, especially when initiating the training of an agent from the ground up. This is often due to the reward signal being sparse in relation to the action or observation space.

One way to address this issue is by leveraging offline data to initialize the new policy. This involves using episodes collected from an existing policy or human demonstrations, i.e. a human expert provides a set of trajectories and then each is assigned the appropriate reward. This approach can be particularly useful in industrial settings where there is an already established policy, e.g. rule-based or from classical control methods, or when manual data collection is feasible. An example of this is aligning large language models to human preferences, which we have discussed in a series of pills, e.g. on direct preference optimization and advantage-induced policy alignment.

However, offline RL is not without its difficulties. Specifically, the offline data may not be comprehensive enough to train a robust policy, leading to failures in generalising to unseen states and actions. Recent studies have devised innovative solutions to these bootstrapping errors. For instance, implicit Q learning demonstrates top-tier results on the D4RL benchmarks (refer to the pill on IQL).

Figure 1 from [Uch23J]. The graph illustrates the accuracy of a policy with and without a properly trained critic network. At all negative steps, the policy is trained with standard RL techniques, until a good score is achieved. Then, at step zero, the policy network is passed to a new agent with a freshly initialised random value network. The inadequate learning signal from the critic causes the initially effective policy to be rapidly forgotten.

A related area of research is centred on identifying optimal strategies to transition seamlessly from offline to online learning, aiming to reduce the number of online samples needed to train a successful policy. Transitioning from a well-performing offline policy to a new one can pose challenges, particularly for actor-critic or value-based methods. This is because value functions typically require both positive and negative data for effective learning. If the value functions are not initialised correctly, the advantages of offline pre-training can be negated. Figure 1 reports a demonstration of how quickly a good policy is forgotten if the critic is randomly initialised.

The recent paper [Uch23J] proposes a method (named “jump-start reinforcement learning”, JSRL) to effectively transition from a pre-existing “guide” policy $\pi^g$ to a new “exploration” policy $\pi^e$. The guide policy the guide policy is learned through offline RL techniques, while the exploration policy is optimized using online RL. The challenge lies in the fact that states from $\pi^g$ may be out-of-distribution for $\pi^e$ if introduced all at once. To mitigate this, the paper suggests a multi-stage curriculum where the guide policy is used up to a certain time step $h$, and then the exploration policy takes over to complete the episode. As the exploration policy improves, the guide policy’s role is gradually reduced until the exploration policy is fully in control (see figure 2). The paper also analyses an alternative approach where the time step switch is randomly selected for each episode. The method is reminiscent of other approaches that generate a curriculum based on different exploration modi, such as go-explore [Eco21F]. The key distinction is that in this case the specific starting position is not predetermined, but instead it is randomly selected by rolling out the guide policy.

Figure 2 from [Uch23J]. In standard RL training, the agent starts from scratch, selecting actions at random and receiving rewards. However, when a good policy is available, optimization can be carried out in stages. In the initial stage of the curriculum, the guide policy is unrolled to a state that is close to the target, and then exploration is performed in the remaining space, completing only the final steps of the trajectory. Once the exploration policy demonstrates good performance in this restricted problem, it can be applied to longer segments of the trajectory (stage 2 to N).

While the concept may seem simple, it can be mathematically proven that, under certain conditions on the guide policy, the proposed curriculum can significantly reduce the sample complexity. Specifically, when the guide policy covers all the states that will be accessed by the optimal policy, and in the case of $\epsilon$-greedy exploration, the sample complexity shifts from exponential to a more favorable polynomial. It is important to note that the assumption on the guide policy does not impose any constraints on the actions taken in those states. In other words, the guide policy should only provide a rough roadmap to go from any starting point to the goal, just enough so that exploration would not get stuck in local minima. In practice, even though the constraint on the guide policy is quite permissive, obtaining sufficient offline data to cover all the “important” states can be challenging and costly.

The authors conducted several experiments on the D4RL benchmarks to demonstrate the significant improvement in sample efficiency achieved by the proposed method. The results for some of the environments are shown in Fig 3. In these experiments, all the algorithms were first pre-trained on datasets of different sizes and then fine-tuned for one million steps. In the case of JSRL, the exploration policy and critic can either be copied from the guide policy or they can be randomly initialised. The paper analyzed both cases and found that the scores are strongly influenced by the environment and the size of the offline dataset. The result shown in fig 3 corresponds to the case where the exploration policy is randomly initialized.

The best algorithm is the one that utilizes IQL for training the guide policy and gradually incorporates the exploration policy. It is worth noting that the benefits of the curriculum approach (as compared to direct IQL fine-tuning) are particularly evident when the offline dataset is small. The paper also presents similar results on a more challenging robotic manipulation task (refer to Table 2 in [Uch23J]).

Figure 3 from [Uch23J]. The table displays the performance of various offline pre-training methods on the D4RL environments. AWAC, BC, CQL, and IQL are well-known offline RL methods, with IQL achieving state-of-the-art scores. All of these methods are used to pre-train a policy on datasets of different sizes, and the resulting networks are fine-tuned for one million steps using online learning. The last column presents the scores for IQL pre-training and JSRL fine-tuning (both with curriculum and random schedules). It is worth noting that curriculum learning is particularly beneficial when the offline dataset is small. Additionally, even the random schedule outperforms naïve IQL fine-tuning.

Curriculum learning is widely recognized as beneficial for RL training, but it is nonetheless reassuring to have a formal proof that confirms it. On the experimental side, this approach provides a valuable method for creating a curriculum that leverages offline training to speed up online RL. It is particularly useful in situations where collecting offline examples is expensive or the state space is large and intricate.

Reinforcement learning (RL) provides a theoretical framework for continuously improving an agent’s behavior via trial and error. However, efficiently learning policies from scratch can be very difficult, particularly for tasks that present exploration challenges. In such settings, it might be desirable to initialize RL with an existing policy, offline data, or demonstrations. However, naively …

Reinforcement learning promises to solve complex sequential-decision problems autonomously by specifying a high-level reward function only. However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse1 and deceptive2 feedback. Avoiding these pitfalls requires a thorough exploration of the environment, but creating algorithms that can …

References

In this series →