5 releases

0.0.7 Sep 1, 2024
0.0.6 Sep 19, 2023
0.0.5 Jan 22, 2022
0.0.4 Jul 18, 2021
0.0.3 May 30, 2021

#745 in Science

Download history 13/week @ 2024-06-16 17/week @ 2024-06-23 3/week @ 2024-06-30 27/week @ 2024-07-07 25/week @ 2024-07-14 11/week @ 2024-07-21 15/week @ 2024-07-28 11/week @ 2024-08-04 11/week @ 2024-08-11 7/week @ 2024-08-18 20/week @ 2024-08-25 177/week @ 2024-09-01 24/week @ 2024-09-08 30/week @ 2024-09-15 35/week @ 2024-09-22 29/week @ 2024-09-29

124 downloads per month
Used in 10 crates

MIT/Apache

87KB
2K SLoC

Core components for reinforcement learning.

Observation and action

[Obs] and [Act] traits are abstractions of observation and action in environments. These traits can handle two or more samples for implementing vectorized environments, although there is currently no implementation of vectorized environment.

Environment

[Env] trait is an abstraction of environments. It has four associated types: Config, Obs, Act and Info. Obs and Act are concrete types of observation and action of the environment. These types must implement [Obs] and [Act] traits, respectively. The environment that implements [Env] generates [Step<E: Env>] object at every environment interaction step with [Env::step()] method. Info stores some information at every step of interactions of an agent and the environment. It could be empty (zero-sized struct). Config represents configurations of the environment and is used to build.

Policy

[Policy<E: Env>] represents a policy. [Policy::sample()] takes E::Obs and generates E::Act. It could be probabilistic or deterministic.

Agent

In this crate, [Agent<E: Env, R: ReplayBufferBase>] is defined as trainable [Policy<E: Env>]. It is in either training or evaluation mode. In training mode, the agent's policy might be probabilistic for exploration, while in evaluation mode, the policy might be deterministic.

The [Agent::opt()] method performs a single optimization step. The definition of an optimization step varies for each agent. It might be multiple stochastic gradient steps in an optimization step. Samples for training are taken from R: ReplayBufferBase.

This trait also has methods for saving/loading parameters of the trained policy in a directory.

Batch

TransitionBatch is a trait of a batch of transitions (o_t, r_t, a_t, o_t+1). This trait is used to train Agents using an RL algorithm.

Replay buffer and experience buffer

ReplayBufferBase trait is an abstraction of replay buffers. One of the associated type ReplayBufferBase::Batch represents samples taken from the buffer for training Agents. Agents must implements [Agent::opt()] method, where ReplayBufferBase::Batch has an appropriate type or trait bound(s) to train the agent.

As explained above, ReplayBufferBase trait has an ability to generates batches of samples with which agents are trained. On the other hand, ExperienceBufferBase trait has an ability to store samples. [ExperienceBufferBase::push()] is used to push samples of type ExperienceBufferBase::Item, which might be obtained via interaction steps with an environment.

A reference implementation

SimpleReplayBuffer<O, A> implementats both ReplayBufferBase and ExperienceBufferBase. This type has two parameters O and A, which are representation of observation and action in the replay buffer. O and A must implement BatchBase, which has the functionality of storing samples, like Vec<T>, for observation and action. The associated types Item and Batch are the same type, GenericTransitionBatch, representing sets of (o_t, r_t, a_t, o_t+1).

SimpleStepProcessor<E, O, A> might be used with SimpleReplayBuffer<O, A>. It converts E::Obs and E::Act into BatchBases of respective types and generates GenericTransitionBatch. The conversion process relies on trait bounds, O: From<E::Obs> and A: From<E::Act>.

Trainer

Trainer manages training loop and related objects. The Trainer object is built with configurations of training parameters such as the maximum number of optimization steps, model directory to save parameters of the agent during training, etc. Trainer::train method executes online training of an agent on an environment. In the training loop of this method, the agent interacts with the environment to take samples and perform optimization steps. Some metrices are recorded at the same time.

Evaluator

[Evaluator<E, P>] is used to evaluate the policy's (P) performance in the environment (E). The object of this type is given to the Trainer object to evaluate the policy during training. [DefaultEvaluator<E, P>] is a default implementation of [Evaluator<E, P>]. This evaluator runs the policy in the environment for a certain number of episodes. At the start of each episode, the environment is reset using [Env::reset_with_index()] to control specific conditions for evaluation.

Dependencies

~4–6MB
~123K SLoC