Blog: How AlphaStar became a StarCraft grandmaster


The following blog post, unless otherwise noted, was written by a member of Gamasutra’s community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.


 

AI and Games is a crowdfunded YouTube series that explores research and applications of artificial intelligence in video games.  You can support this work by visiting my Patreon page.


One of the biggest headlines in AI research for 2019 was the unveiling of AlphaStar – Google DeepMind’s project to create the worlds best player of Blizzard’s real-time strategy game StarCraft II.  After shocking the world in January as the system defeated two high ranking players in closed competition, an updated version was revealed in November that had achieved grandmaster status: ranking among the top 0.15% in Europe’s 90,000 active players.  So let’s look at how AlphaStar works, the underpinning technology and theory that drives it, the truth behind the media sensationalism and how it achieved grandmaster rank in online multiplayer.


You might be wondering why DeepMind is so interested in building a StarCraft bot?  Ultimately it’s because games – be they card games, board games or video games – provide nice simplifications or abstractions of real world problems and by solving these problems in games, there is potential for it to be applied in other areas of society.  This led AI researchers to explore games such as Chess and Go given they’re incredibly challenging problems to master.  This is largely due to the number of unique configurations of the board in each game – often referred to as a state space.  Research in Go estimates there are  (2 times 10^{170}) valid layouts of the board.  Meanwhile StarCraft is even more ambitious given the map size, unit types and the range of actions at both micro level for unit movement and macro level for build behaviours and strategy  – a topic I recently discussed when exploring the AI of Halo Wars 2.  Now even naive estimates by researchers on the number of valid individual possible configurations of a game of StarCraft suggest it is around (2 times 10^{1685}).  In the case of both Go and StarCraft, these are – on-paper – incredibly difficult problems to achieve expert level knowledge for an AI system.

Universities around the world have seen the potential of StarCraft as an AI research problem for well over ten years now.  It’s a game that has no clearly definable best strategy to win, requires you think about the effectiveness of your decisions in the long-term and without complete knowledge of the game world.  Plus you have to think about all of this and react in realtime.  Hence research competitions such as the StarCraft AI Competition and the Student StarCraft AI Competition (SSCAIT) have operated for the best part of a decade to try and solve the problem.  However, neither of these competitions have the support of StarCraft‘s creator Blizzard. 

Since its inception DeepMind has directed a tremendous amount of money and effort into researching two specific disciplines within artificial intelligence: deep learning and reinforcement learning.  To explain as simply as possible, Deep Learning is a process where a large convolution neural network attempts to learn how to solve a task based upon existing training data.  It reconfigures parts of the network such that it gives the correct answer to the training data it’s tested against with very high accuracy.  Meanwhile reinforcement learning is an AI that learns to get better at a particular task by learning through experience.  These experiences then update the knowledge the system stores about how good a particular decision is to make at that point in time.  Often you can use these techniques together: first the networks are modified to learn from good examples already recorded, and then the reinforcement learning kicks in to improve existing knowledge by solving new problems it comes up against.  These approaches have proven very effective in a variety of games projects for DeepMind: first creating AI that can play a variety of classic Atari games, then defeating 9-dan professional Go player Lee Sedol with AlphaGo and the creation of AlphaZero that achieved grandmaster status in Chess, Shogi and Go.  And the next step was to take their expertise in these areas and apply it to StarCraft II.

The AlphaStar project is led by Professor David Silver and Dr Oriol Vinyals – a former Spanish StarCraft champion and co-creator of the Zerg ‘bot’ Overmind at the StarCraft AI competition in 2010.  The team behind AlphaStar is comprised of over 40 academic researchers, not to mention additional support throughout DeepMind and Blizzard in order to build the tools and systems needed for the AI to interface with the game.  Once again, another massive endeavour with Google money helping to support it.  So let’s walk through how AlphaStar works and how it achieved grandmaster status.


AlphaStar has – at the time of writing – had two major releases unveiled in January and November of 2019.  The core design of how AlphaStar is built and learns is fairly consistent across both versions.  But AlphaStar isn’t just one AI, it’s several that that learn from one another.  Each AI is a deep neural network that reads information from the games interface and then outputs numeric values that can be translated into actions within the game: such as moving units, issuing build or attack commands etc.  It’s configured in a very specific way such that it can handle the challenges faced in parsing the visual information of a StarCraft game alongside making decisions that have long-term ramifications.  For anyone who isn’t familiar with machine learning this is a highly specific and complex set of decisions that I’ll refrain from discussing here, but for those who want to know more all the relevant links are in the bibliography below.

Now typically when you start training neural networks, they’ll be given completely random configurations, which means the resulting AI will make completely random decisions and it will take time during training for them to learn how to change their anarchic random behaviour into something that’s even modestly intelligent.  When you’re making an RTS bot, that means figuring out even the most basic of micro behaviours for units, much less the ability to build them or more coordinated strategies using groups of them at a time.  So the first set of AlphaStar bots or agents are trained using Deep Learning by taking real-world replay data from human StarCraft matches provided by Blizzard. Their goal to reproduce specific behaviours they observe from the replay data to a level of accuracy. Essentially they learn to imitate the players behaviour – both micro actions and macro tactics – by watching these replays.  Naturally, the replays are based on high-level play within the game, but of course the data is anonymised, so we don’t know whose these players are.  Once training is completed, these AlphaStar agents can already defeat the original Elite AI built by Blizzard for StarCraft 2 in 95% of matches played.

But learning against the human data is just the start of the learning process. It’s not enough to replicate our behaviour, they need to find a way surpass it – and they’ll do that by playing against each other and learn from this experience.  The technique adopted, population-based reinforcement learning embraces a common principle in computational intelligence algorithms where you can improve the best solutions to a problem by have them compete with one another – effectively creating an arms race dynamic between multiple competing agents.  To address this, DeepMind created the AlphaStar League, where several of these pre-trained AlphaStar agents battle it out, enabling them to learn from one another.  

But there is an inherent danger that a machine learning algorithm can accidentally convince itself it has found the best StarCraft AI, especially if it evaluates how good it is by playing against other StartCraft AI that are also training. It might be a good player, but it might have lost good knowledge along the way, because all its competitors play very similarly and its trying to find a new strategy or tactic that will give it an edge.  Hence you could have three StarCraft AI bots A, B and C that are stuck in a situation where A can defeat B in a match, B can defeat C but C defeats A.  Because the strategy behind A is so specialised it’s only good against a certain type of opponent.  This was evident in early training, where ‘cheese’ strategies such as rushing with Photon Cannons or Dark Templars dominated the learning process: it’s a risky move and won’t always work.  Hence there’s a need for dominant strategies to be overcome within the training process.

DeepMind address this by creating two distinct group of AI agents in the league – the Main Agents and the Exploiters.  The Main Agents are the ones I’ve already been talking about: AI that are learning to become the best StarCraft 2 players. New main agents are added to the league based on learned experience, while existings ones are kept around to help ensure information isn’t lost along the way and will be pitted against the new players periodically in combat.  Meanwhile exploiters are AI agents in the league whose job isn’t to become the best StarCraft player, but to learn the weaknesses that exist within the Main Agents and beat them.  By doing so, the Main Agents will be forced to learn how to overcome any weakness found by the exploiters which will improve their overall ability.  This will prevent them from creating weird specialist strategies that will actually prove to be useless in the long-run.  There are two types of exploiter: the Main Exploiters, that targets the latest batch of Main Agents to be created.  And the League Exploiters: whose goals are to find exploits across the entire league and punish them accordingly.

The entire AlphaStar League process is trained using Google’s distributed compute power running in the cloud, using their proprietary Tensor Processing Unit’s or TPUs.  The actual training is broken up into two batches, one for each version of AlphaStar that DeepMind have published.  So now that we know the inner workings, let’s look at each version.  How it was evaluated and what differentiates them from one another.


The first version of AlphaStar was trained within the league for 14 days, using 16 TPUs per agent resulting in 600 agents being built.  Each agent experienced the human equivalent of 200 years of StarCraft playtime.  Already surpassing any human equivalent.  To test them out, DeepMind invited two professional players to their London offices in December 2018: first Dario Wunsch aka “TLO“, followed by Grzegorz Komincz known as “MaNa“, both of whom play for the esports organisation Team Liquid.

Playing a typical 1v1 match-up of 5 games under pro match conditions, AlphaStar both defeated TLO and MaNa handsomely – making it the first time that an AI successfully defeated a professional StarCraft player.  Now while this was a significant achievement, there was still a lot of improvements that needed to be made to the system.  Given that many concessions were made in the design choices for AlphaStar and the test-matches at that time.

First of all, version 1 of AlphaStar was only trained to play as Protoss and was evaluated against Protoss-playing opponents.  As StarCraft players will know, while a given pro player will typically focus on only one species, they do need to be aware of and can counteract strategies from Terran and Zerg players.  As a result TLO was at a disadvantage during these first test matches, given while he does rank grandmaster level for Protoss, he plays professional as the Zerg.  However this was mitigated somewhat by MaNa who is one of the strongest professional Protoss players outside of South Korea. 

Secondly, version 1 of AlphaStar could only play on one map of the game: Catalyse LE.  This means that the system had not learned how to generalise the strategies it was learning such that it could apply them across different maps.   The third alteration was that the original AlphaStar bots did not look at the game through the camera: they had their own separate vision system that allowed for it to see the entire map.  While fog of war was still enabled, it did allow for AlphaStar to have an advantage of other players by letting it see the rest of the visible world.  Now DeepMind insists that this was actually a negligible feature, given that the bots behaviour suggest they were largely focussed on areas of the map like a human was, but it was still something that needed to be removed for version 2.

What is undoubtedly the cheekiest part of this whole experiment: is that in order to keep TLO and MaNa on their toes, they never played the same bot twice across the 5 matches.  As I mentioned earlier, AlphaStar is technically a collection of bots learning within the league.  Hence after each match, the AlphaStar bot was cycled out.  With each of them at that time optimised for a specific strategy.  This meant that TLO and MaNa couldn’t exploit weaknesses they’d spotted in a previous match.

But interestingly, despite all these advantages over TLO and MaNa, the one area that many would anticipate the bot to have the upper hand is the actions-per-minute or APM: the number of valid and useful actions a player can execute in one minute of gameplay.  During these matches MaNa’s APM averaged out at around 390, while TLO was just shy of 680, but AlphaStar had a mean of 277.  This is significantly lower and it’s for two reasons: first that because it’s learning from humans, it’s largely duplicating their APM.  In addition, AlphaStar’s ability to look at the world and then act has a delay of around 350ms on average, compared to the average human reaction time of 200ms.


With the first version unveiled and its success noted, the next phase was to eliminate the limitations of the system and have it play entirely like a human would.  This includes being able to play on any map, with any race and using the main camera interface as human players would.  With some further improvement to the underlying design, the AlphaStar league ran once again but instead of running for 14 days, this around it ran for 44 days, resulting in over 900 unique AlphaStar bots being created during the learning process.  During training the best three main agents – one per race: Terran, Protoss or Zerg – were always retained, with three main exploiters (again one for each race) and six league exploiters (two for each race) forcing the main agents to improve.  But this time instead of playing off against professional e-sports players, the bots being trained would face off against regular players in StarCraft II’s online ranked multiplayer.

Three sets of AlphaStar agents were chosen to play off against humans.  The first batch being the bots that had only completed the supervising learning from the anonoymised replay data – referred to as AlphaStar Supervised – and then two sets of Main Agents that were trained in the AlphaStar league called AlphaStar Mid and AlphaStar Final.  AlphaStar Mid are the Main Agents from the league after being trained for a total of 27 days, while AlphaStar Final is the final set after 44 days of training.  Given each Main Agent only plays as one racce, this allows for a separate Match Making Rating or MMR for each AI to be recorded.

To play off against humans, AlphaStar snuck into the online multiplayer lobbies of Battle.net: Blizzard’s online service and through ranked matchmaking would face off against an unassuming human player, provided they were playing on a European server.  While Blizzard and DeepMind announced that players could opt-in to play against DeepMind after patch  4.9.2 for StarCraft II, the matches were held under blind conditions.  Meaning that AlphaStar didn’t know who it was playing against, but also human players would not be told they were playing against the AI, just that there is a possibility it could happen to them when playing online.  This anonymity was largely to prevent humans recognising it was AlphaStar, discovering it’s weaknesses and then chasing it down in the matchmaking to exploit that knowledge, which could really known it down in the ratings – given it can’t learn for itself outside of the league.  To establish their MMR, the supervised agents played a minimum of 30 matches each, while the mid agents ran for 60 games.  With the final agents to the mid agents MMR as a baseline, then playing for an additional 30 games.

The AlphaStar supervised bots wound up with an average MMR of 3699, which puts it in the top 84% of all human players and that’s without use of the reinforcement learning.  However the big talking point is that AlphaStar Final’s MMR for each race places it within Grandmaster rank on the StarCraft 2 European servers: 5835 for Zerg, 6048 for Terran and 6275 for Protoss.  Of the approximately 90,000 players that play StarCraft 2 on the European servers, this places AlphaStar within the top 99.85% of all ranked play.


All that said, there is still work to be done.  There are challenges in ensuring the learning algorithm can continue to adapt and grow with further interactions with humans.  Addressing the need for human play data and also addressing some of the more quirky strategies that AlphaStar has exhibited.  Sadly I’m no StarCraft expert, so I can’t really speak to that in detail, but it sounds like there are still plenty of future avenues for this research to take.  Not to mention taking the challenge to the South Korean e-sports scene, which is significantly stronger than it is in the rest of the world.  Who knows, we may well see new experiments and innovations being stress tested against the player base in the future.

But in the meantime I hope you’ve got a clearer idea of how AlphaStar works and why it’s such a big deal for AI research. This write-up is – in many respects – a massive simplification of the system and I have provided links to the research papers and resources that cover the topic in a lot more detail!