Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

October 22, 2021 06:47 am GMT

Competitive self-play with Unity ML-Agents

An overview of self-play

Competitive self-play involves training an agent against itself. It was used in famous systems such as AlphaGo and OpenAI Five (Dota 2). By playing increasingly stronger versions of itself, agents can discover new and better strategies.

In this post, we walk through using competitive self-play in Unity ML-Agents to train agents to play volleyball. This article is also part 5 of the series 'A hands-on introduction to deep reinforcement learning using Unity ML-Agents'.

The case for self-play

We previously trained agents using PPO with the following setup:

Symmetric environment
Both agents shared the same policy
Observations: velocity, rotation, and position vectors of the agent and ball
Reward function: +1 for hitting the ball over the net

This resulted in agents that were able to successfully volley the ball back-and-forth after ~20M training steps:

You can see that the agents make 'easy' passes by aiming the ball towards the centre of the court. This is because we set the reward function to incentivize keeping the ball in play.

Our aim now is to train competitive agents that are rewarded for winning (i.e. landing the ball in the opponent's court). We expect this will lead to agents that learn interesting strategies and make passes that are harder to return.

Self-play setup in ML-Agents

To follow along this section, you will need:

Unity ML-Agents Release 18+ (getting started instructions)
The latest version of the Ultimate Volleyball repo (or, you can use your own volleyball environment if you've been following the tutorial series)

Step 1: Put the agents on opposing teams

Open the Ultimate Volleyball environment in Unity
Open Assets > Prefabs > 2PVolleyballArea.prefab
Select either the PurpleAgent or BlueAgent object
In Inspector > Behavior Parameters, set TeamId to 1 (the actual value doesn't matter, as long as the PurpleAgent and BlueAgent have different Team ID's):

Step 2: Set up the self-play reward function

Our previous reward function was +1 for hitting the ball over the net.

For self-play, we'll switch to:

+1 to the winning team
-1 to the losing team

Open VolleyballEnvController.cs and add the rewards to the ResolveEvent() method:

case Event.HitBlueGoal:    // blue wins    blueAgent.AddReward(1f);    purpleAgent.AddReward(-1f);    // turn floor blue    StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.blueGoalMaterial, RenderersList, .5f));    // end episode    blueAgent.EndEpisode();    purpleAgent.EndEpisode();    ResetScene();    break;case Event.HitPurpleGoal:    // purple wins    purpleAgent.AddReward(1f);    blueAgent.AddReward(-1f);    // turn floor purple    StartCoroutine(GoalScoredSwapGroundMaterial(volleyballSettings.purpleGoalMaterial, RenderersList, .5f));    // end episode    blueAgent.EndEpisode();    purpleAgent.EndEpisode();    ResetScene();    break;

Remove AddReward from the other cases
You can also set penalties for hitting the ball out of the court (in case Event.HitOutOfBounds). From my experience, this may take longer for the agents to learn to hit the ball.

Step 3: Add self-play training parameters to the trainer config

Create a new .yaml file and copy in the following:

behaviors:  Volleyball:    trainer_type: ppo    hyperparameters:      batch_size: 2048      buffer_size: 20480      learning_rate: 0.0002      beta: 0.003      epsilon: 0.15      lambd: 0.93      num_epoch: 4      learning_rate_schedule: constant    network_settings:      normalize: true      hidden_units: 256      num_layers: 2      vis_encode_type: simple    reward_signals:      extrinsic:        gamma: 0.96        strength: 1.0    keep_checkpoints: 5    max_steps: 80000000    time_horizon: 1000    summary_freq: 20000    self_play:      window: 10      play_against_latest_model_ratio: 0.5      save_steps: 20000      swap_steps: 10000      team_change: 100000

Explaining self-play parameters

During self-play, one of the agents will be set as the learning agent and the other as the fixed policy opponent.

Every save_steps=20000 steps, a snapshot of the learning agent's existing policy will be taken. Up to window=10 snapshots will be stored. When a new snapshot is taken, the oldest one is discarded. These past versions of itself become the 'opponents' that the learning agent trains against.

Every swap_steps=10000 steps, the opponent's policy will be swapped with a different snapshot. The snapshot is sampled with a probability of play_against_latest_model_ratio=0.5 that it will play against the latest policy (i.e. the strongest opponent). This helps to prevent overfitting to a single opponent playstyle.

After team_change=100000 steps, the learning agent and opponent teams will be switched.

Feel free to play around with these default hyperparameters (more information available in the official ML-Agents documentation).

Training with self-play

Training with self-play in ML-Agents is done the same way as any other form of training:

Activate the virtual environment containing your installation ofml-agents.
Navigate to your working directory, and run in the terminal:

mlagents-learn <path to config file> --run-id=VB_1 --time-scale=1

When you see the message "Start training by pressing the Play button in the Unity Editor", clickwithin the Unity GUI.
In another terminal window, run tensorboard --logdir results from your working directory to observe the training process.

Self-play training results

In a stable training run, you should see the ELO gradually increase.

In the diagram below, the three inflexion points correspond to the agent:

Learning to serve
Learning to return the ball
Learning more competitive shots

Compared to our previous training results, I found that even after ~80M steps, the agents trained using self-play don't serve or return the ball as reliably. However, they do learn to hit some interesting shots, like hitting the ball towards the edge of the court:

If you discover any other interesting playstyles, let me know!

Wrap-up

Thanks for reading! I hope you found this post useful.

If you have any feedback or questions, feel free to post them on the Ultimate Volleyball Repo.

Original Link: https://dev.to/joooyz/competitive-self-play-with-unity-ml-agents-1nh6

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To