5 releases
| 0.0.7 | Sep 1, 2024 |
|---|---|
| 0.0.6 | Sep 19, 2023 |
| 0.0.5 | Jan 22, 2022 |
| 0.0.4 | Jul 18, 2021 |
| 0.0.3 | May 30, 2021 |
#17 in #reinforcement
59 downloads per month
Used in 10 crates
87KB
2K
SLoC
Core components for reinforcement learning.
Observation and action
[Obs] and [Act] traits are abstractions of observation and action in environments.
These traits can handle two or more samples for implementing vectorized environments,
although there is currently no implementation of vectorized environment.
Environment
[Env] trait is an abstraction of environments. It has four associated types:
Config, Obs, Act and Info. Obs and Act are concrete types of
observation and action of the environment.
These types must implement [Obs] and [Act] traits, respectively.
The environment that implements [Env] generates [Step<E: Env>] object
at every environment interaction step with [Env::step()] method.
Info stores some information at every step of interactions of an agent and
the environment. It could be empty (zero-sized struct). Config represents
configurations of the environment and is used to build.
Policy
[Policy<E: Env>] represents a policy. [Policy::sample()] takes E::Obs and
generates E::Act. It could be probabilistic or deterministic.
Agent
In this crate, [Agent<E: Env, R: ReplayBufferBase>] is defined as trainable
[Policy<E: Env>]. It is in either training or evaluation mode. In training mode,
the agent's policy might be probabilistic for exploration, while in evaluation mode,
the policy might be deterministic.
The [Agent::opt()] method performs a single optimization step. The definition of an
optimization step varies for each agent. It might be multiple stochastic gradient
steps in an optimization step. Samples for training are taken from
R: ReplayBufferBase.
This trait also has methods for saving/loading parameters of the trained policy in a directory.
Batch
TransitionBatch is a trait of a batch of transitions (o_t, r_t, a_t, o_t+1).
This trait is used to train Agents using an RL algorithm.
Replay buffer and experience buffer
ReplayBufferBase trait is an abstraction of replay buffers.
One of the associated type ReplayBufferBase::Batch represents samples taken from
the buffer for training Agents. Agents must implements [Agent::opt()] method,
where ReplayBufferBase::Batch has an appropriate type or trait bound(s) to train
the agent.
As explained above, ReplayBufferBase trait has an ability to generates batches
of samples with which agents are trained. On the other hand, ExperienceBufferBase
trait has an ability to store samples. [ExperienceBufferBase::push()] is used to push
samples of type ExperienceBufferBase::Item, which might be obtained via interaction
steps with an environment.
A reference implementation
SimpleReplayBuffer<O, A> implementats both ReplayBufferBase and ExperienceBufferBase.
This type has two parameters O and A, which are representation of
observation and action in the replay buffer. O and A must implement
BatchBase, which has the functionality of storing samples, like Vec<T>,
for observation and action. The associated types Item and Batch
are the same type, GenericTransitionBatch, representing sets of (o_t, r_t, a_t, o_t+1).
SimpleStepProcessor<E, O, A> might be used with SimpleReplayBuffer<O, A>.
It converts E::Obs and E::Act into BatchBases of respective types
and generates GenericTransitionBatch. The conversion process relies on trait bounds,
O: From<E::Obs> and A: From<E::Act>.
Trainer
Trainer manages training loop and related objects. The Trainer object is
built with configurations of training parameters such as the maximum number of
optimization steps, model directory to save parameters of the agent during training, etc.
Trainer::train method executes online training of an agent on an environment.
In the training loop of this method, the agent interacts with the environment to
take samples and perform optimization steps. Some metrices are recorded at the same time.
Evaluator
[Evaluator<E, P>] is used to evaluate the policy's (P) performance in the environment (E).
The object of this type is given to the Trainer object to evaluate the policy during training.
[DefaultEvaluator<E, P>] is a default implementation of [Evaluator<E, P>].
This evaluator runs the policy in the environment for a certain number of episodes.
At the start of each episode, the environment is reset using [Env::reset_with_index()]
to control specific conditions for evaluation.
Dependencies
~4–6MB
~124K SLoC