PPO Agent playing LunarLander-v2
This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.
Model Description
This model is a Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v2 environment from OpenAI Gymnasium. The agent learns to successfully land a lunar module by controlling its main engine and side thrusters while managing fuel consumption and landing precision.
Model Details
- Algorithm: Proximal Policy Optimization (PPO)
- Policy Network: Multi-Layer Perceptron (MlpPolicy)
- Framework: Stable-Baselines3
- Environment: LunarLander-v2 (Gymnasium)
Hyperparameters
The model was trained with the following PPO hyperparameters:
| Parameter | Value |
|---|---|
| Policy | MlpPolicy |
| n_steps | 1024 |
| batch_size | 64 |
| n_epochs | 4 |
| gamma (discount factor) | 0.999 |
| gae_lambda | 0.98 |
| ent_coef (entropy coefficient) | 0.01 |
Performance
Evaluation Results:
- Mean Reward: 262.07 ± 21.35
- Standard Deviation: 21.35
This performance indicates the agent has successfully learned to land the lunar module, as:
- Rewards > 200 typically indicate successful landings
- The positive mean reward shows consistent success across evaluation episodes
- Low standard deviation suggests stable, reliable performance
Environment Details
LunarLander-v2 Environment:
- Observation Space: 8-dimensional continuous space (position, velocity, angle, angular velocity, leg contact)
- Action Space: 4 discrete actions (do nothing, fire left engine, fire main engine, fire right engine)
- Success Criteria: Land between the flags with minimal fuel consumption and impact velocity
- Reward Range: Approximately -∞ to +∞ (typically -200 to +300 for meaningful episodes)
Training Information
- Training Framework: Stable-Baselines3
- Environment Wrapper: Monitor (for episode statistics tracking)
- Vectorized Environment: DummyVecEnv
- Render Mode: rgb_array (for video recording)
Usage
import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub
# Load the model from Hugging Face Hub
model = load_from_hub(
repo_id="Adilbai/ppo-LunarLander-v2",
filename="ppo-LunarLander-v2.zip"
)
# Create environment
env = gym.make("LunarLander-v2", render_mode="human")
# Run the trained agent
obs, info = env.reset()
for _ in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs, info = env.reset()
env.close()
Model Architecture
The PPO agent uses a Multi-Layer Perceptron (MLP) policy network that:
- Takes 8-dimensional observations as input
- Outputs action probabilities for 4 discrete actions
- Includes separate policy and value function heads
- Uses standard MLP layers with ReLU activations
Limitations and Considerations
- Environment Specific: This model is specifically trained for LunarLander-v2 and may not generalize to other environments
- Deterministic Evaluation: Performance metrics are based on deterministic policy evaluation
- Sample Efficiency: PPO is generally sample-efficient but may require significant training time for optimal performance
Training Reproducibility
To reproduce this model's training:
from stable_baselines3 import PPO
import gymnasium as gym
env = gym.make("LunarLander-v2")
model = PPO(
policy='MlpPolicy',
env=env,
n_steps=1024,
batch_size=64,
n_epochs=4,
gamma=0.999,
gae_lambda=0.98,
ent_coef=0.01,
verbose=1
)
model.learn(total_timesteps=500000) # Adjust based on your training duration
Citation
If you use this model, please cite:
@misc{ppo_lunarlander_2024,
title={PPO Agent for LunarLander-v2},
author={[Your Name]},
year={2024},
publisher={Hugging Face Hub},
url={https://huggingface.co/Adilbai/ppo-LunarLander-v2}
}
References
- Downloads last month
- 4
Evaluation results
- mean_reward on LunarLander-v2self-reported259.57 +/- 24.07