5 releases
0.0.7 | Sep 1, 2024 |
---|---|
0.0.6 | Sep 19, 2023 |
0.0.5 | Jan 22, 2022 |
0.0.4 | Jul 18, 2021 |
0.0.3 | May 30, 2021 |
#17 in #reinforcement
59 downloads per month
Used in 10 crates
87KB
2K
SLoC
Core components for reinforcement learning.
Observation and action
[Obs
] and [Act
] traits are abstractions of observation and action in environments.
These traits can handle two or more samples for implementing vectorized environments,
although there is currently no implementation of vectorized environment.
Environment
[Env
] trait is an abstraction of environments. It has four associated types:
Config
, Obs
, Act
and Info
. Obs
and Act
are concrete types of
observation and action of the environment.
These types must implement [Obs
] and [Act
] traits, respectively.
The environment that implements [Env
] generates [Step<E: Env>
] object
at every environment interaction step with [Env::step()
] method.
Info
stores some information at every step of interactions of an agent and
the environment. It could be empty (zero-sized struct). Config
represents
configurations of the environment and is used to build.
Policy
[Policy<E: Env>
] represents a policy. [Policy::sample()
] takes E::Obs
and
generates E::Act
. It could be probabilistic or deterministic.
Agent
In this crate, [Agent<E: Env, R: ReplayBufferBase>
] is defined as trainable
[Policy<E: Env>
]. It is in either training or evaluation mode. In training mode,
the agent's policy might be probabilistic for exploration, while in evaluation mode,
the policy might be deterministic.
The [Agent::opt()
] method performs a single optimization step. The definition of an
optimization step varies for each agent. It might be multiple stochastic gradient
steps in an optimization step. Samples for training are taken from
R: ReplayBufferBase
.
This trait also has methods for saving/loading parameters of the trained policy in a directory.
Batch
TransitionBatch
is a trait of a batch of transitions (o_t, r_t, a_t, o_t+1)
.
This trait is used to train Agent
s using an RL algorithm.
Replay buffer and experience buffer
ReplayBufferBase
trait is an abstraction of replay buffers.
One of the associated type ReplayBufferBase::Batch
represents samples taken from
the buffer for training Agent
s. Agents must implements [Agent::opt()
] method,
where ReplayBufferBase::Batch
has an appropriate type or trait bound(s) to train
the agent.
As explained above, ReplayBufferBase
trait has an ability to generates batches
of samples with which agents are trained. On the other hand, ExperienceBufferBase
trait has an ability to store samples. [ExperienceBufferBase::push()
] is used to push
samples of type ExperienceBufferBase::Item
, which might be obtained via interaction
steps with an environment.
A reference implementation
SimpleReplayBuffer<O, A>
implementats both ReplayBufferBase
and ExperienceBufferBase
.
This type has two parameters O
and A
, which are representation of
observation and action in the replay buffer. O
and A
must implement
BatchBase
, which has the functionality of storing samples, like Vec<T>
,
for observation and action. The associated types Item
and Batch
are the same type, GenericTransitionBatch
, representing sets of (o_t, r_t, a_t, o_t+1)
.
SimpleStepProcessor<E, O, A>
might be used with SimpleReplayBuffer<O, A>
.
It converts E::Obs
and E::Act
into BatchBase
s of respective types
and generates GenericTransitionBatch
. The conversion process relies on trait bounds,
O: From<E::Obs>
and A: From<E::Act>
.
Trainer
Trainer
manages training loop and related objects. The Trainer
object is
built with configurations of training parameters such as the maximum number of
optimization steps, model directory to save parameters of the agent during training, etc.
Trainer::train
method executes online training of an agent on an environment.
In the training loop of this method, the agent interacts with the environment to
take samples and perform optimization steps. Some metrices are recorded at the same time.
Evaluator
[Evaluator<E, P>
] is used to evaluate the policy's (P
) performance in the environment (E
).
The object of this type is given to the Trainer
object to evaluate the policy during training.
[DefaultEvaluator<E, P>
] is a default implementation of [Evaluator<E, P>
].
This evaluator runs the policy in the environment for a certain number of episodes.
At the start of each episode, the environment is reset using [Env::reset_with_index()
]
to control specific conditions for evaluation.
Dependencies
~4–6MB
~124K SLoC