Thrasybulos
Warlord
- Joined
- May 4, 2023
- Messages
- 257
As I've already said, Sullla's AI Survivor (btw, let me add mine to the thanks for running that) is a concept which mixes several aspects:
- It's an AI competition, aiming at ultimately providing a ranking of the AIs.
- It's a predicton contest.
- It's a Live Show.
The Alternate Histories for Sullla's AI Survivor series provide information aiming at showing the different ways a game could have gone, at and determining how predictable the outcome actually was.
In other words, they're about the prediction contest : they are replays of the exact same setups, leaving essentially unanswered the question of whether the results should be explained in terms of AI strength or in terms of the context of the game (starting position, neighbours).
That lead me to the notion of providing another set of Alternate Histories, but this time with a focus on the first aspect: ranking the AIs.
In order to achieve that, I'll be replaying the maps... while shuffling around the AIs on the map.
As with the Alternate Histories, I'll be playing 20 iterations of each game (same map, same AIs), but with a different permutation of the starting positions for each game.
That should provide an idea of the relative strength of each leader in that game's subset, regardless of its starting position and specific set of neighbours.
It'll also fulfill a secondary objective: determining how balanced (or unbalanced) each map was, and thus how much that influenced the game outcome.
Instead of Sullla's Power Rating (5 pts for a win / 2 pts for a runner-up result / 1 pt per kill), I'll use an Elo rating to rank the AIs.
Each match (or "run": one of the 20 replays of an AI Survivor game) is scored liked a mini closed tournament, where each participant gets a result (win = 1, loss = 0, or point split for an intermediate result) vs each participant.
The Elo system provides the expected result between two opponents, based on their rating.
There is an arbitrary coefficient involved, set at 400 for Chess for instance. This is the "sensitivity" of the system. While I expect to confirm that some AIs perform better than others, I don't expect the performance level differences to be drastic. So in order to have a wider range of values, I used 800 for that coefficient here.
To determine whether a player's rating needs to be adjusted, the expected result is compared with the actual result of the game... which we need to calculate.
Actual result (score) calculation:
The winner of a match gets a win (1-0) vs each of its opponents.
The other leaders which don't get eliminated get a win vs each of the eliminated civs.
That leaves us with scoring the survivors amongst themselves, and the eliminated civs amongst themselves.
I could have simply gone for a draw (0.5-0.5), and honestly, that probably would have been enough.
But since the point split doesn't have to be 50/50, I thought it would be nice to distinguish between a civ which made to the end with a couple of tundra cities vs a civ which became an overwhelming juggernaut well on its way to Domination but losing to a successful Culture attempt by a much smaller civ.
While not perfect, the in-game score could certainly serve for that purpose.
So say two civs survive in addition to the winner, one with 1,000 pts and the other with 3,000 points: the result point will be split 0,25-0,75 (1,000/(1,000+3,000) vs 3,000/(1,000+3,000)), which seems fair enough.
Now how about the dead civs? Ideally we'd use something like the highest in-game score they'd reached, but getting that information would require a python mod or a tool to extract it from the replay file... way too much effort. What we have is the turn of elimination. But the extreme values tend to be too close (T100 elimination vs T300 elimination would result in a 0.25-0.75 split while it should be more pronounced. So I basically squared the values before the comparison(*), which yields results closer to what I would expect.
(*) The exact calculation sets a "turn score" by adding the turn number each turn to the score.
So the turn score = 1 + 2 + 3 + ... + Turn of elimination = (T Elim)*(T Elim + 1)/2.
So:
- Game winner gets a perfect score.
- Survivors get a win vs eliminated civs, score with other survivors based on their in-game score.
- Eliminated civs score amongst themselves based on their turn of elimination.
Elo progression:
A player's Elo rating is adjusted according to the difference between its actual score vs its expected score.
Every AI will have an initial Elo rating = 1600.
Winning a 6-player game means a score of 5, vs an expected score of 2.5.
So in that case, it's an Elo gain of (5 - 2.5) * K.
"K" is an arbitrary factor.
The bigger it is, the faster a player reaches its "true" rating, but the more volatile the system is.
Since there are 6-player games (5 opponents) and 7-player games (6 opponents), I had initially gone for K=6 in 6-player games, and K=5 for 7-player games, yielding an identical value (total K = 30 in both cases).
After the season, and in light of the results, I made the following adjustments:
- I halved the value for the opening round and wildcard pool games.
- I used a quarter of the value for the playoffs games.
- I used a fifth of the value for the championship.
Also, the new Elo for an AI is bulk-calculated after the 20 runs of a game, and not adjusted after each run.
I initially used the latter method (which has the advantage limiting dramatic gains or losses), but that led to an AI with a better total score having a lower rating than another which had had a better late run. And while that would make sense for human players or true AIs, that makes little sense here where the "AIs" are just a bunch of static parameters with no capacity for learning whatsoever.
Tournament Format:
Season 1 will be based on Season 4 of AI Survivor.
I'll be following AI Survivor's format, with some changes:
- No change to the Opening Round.
- The Playoffs will have different participants (those who score best here, as opposed to those which made it in the live games).
- The Championship will have different participants and use a different map.
- The bigger change concerns the Wildcard game : for Season 1, it's replaced with a Wildcard League. The 36 AIs which don't make it to the playoffs are made into two pools of 18 AIs. For each pool, 3 6-player games are played, and the best two move on to a 6-player Pool Finals. The winner of that game gets a wildcard to the Playoffs.
The idea is to get more play-time for the weaker AIs, and to send two "Elo-bags" to the playoffs.
Tournament Rules:
I'll use the same settings and rules as AI Survivor, with two exceptions.
- The big one: No UN.
Diplomatic Victory is disabled.
Four reasons :
1- This is an attempt at ranking the AIs, and the UN is a feature they can't use. If there is any kind of logic programmed there, it's virtually undistinguisable from an RNG call.
The AI won't call for the UN victory when it could win, it'll keep calling for the vote when it cannot win. It won't pursue any kind of gameplan involving a Diplo win. And it'll randomly put resolutions up for voting, and cast its vote randomly on them (voting "No" for Free Speech when running the Culture Slider and still in Bureaucracy?? Voting "No" to end a war where it's getting slaughtered??).
2- In the same vein, the UN shouldn't help make up for an AI bad decisions. If it's choking under unhealthiness because it researched and built every source of pollution while skipping Biology and Medecine, it shouldn't get bailed out by the UN.
3- What's that nonsense about UN peacekeeping and banning nukes? We want blood!
4- Technical reason: on my ancient PC, pop-up dialog calls at the end of the AIs turn processing slow the game dramatically. I think a war has been declared, it's just another Open Border ask... or a UN vote call then result.
No UN ends up speeding up games which run late by a noticeable margin.
- Enforced peace when an AI is at war with a opponent reduced to a single city hidden behind another civ with closed borders.
This a bug, plain and simple, and it can completely alter the outcome of a game.
The AI won't sign peace, it won't plot another war, it won't launch an amphibious assault (same landmass): it gets stuck.
That situation had me consider disabling barbs altogether (not the only reason, but it happens 90-95% of the time because of an early barb city capture behind enemy lines).
In the end, I left them in because I think that no barbs could significantly alter the AI's performance.
And I got lucky: the situation didn't arise as frequently as with my standard Alternate Histories runs. But then a frankly ludicrous instance happened (with *two* AIs locked in war with a single-city AI), and I decided to broker peace through the worldbuilder, and to make it a rule henceforth.
"Test Protocol":
- Each game is run from the worldbuilder save (for practical reasons, and to get new peaceweights each time).
- Permutations are performed by changing the team number for the AIs, not by moving their units around. That means turn order is tied to each starting position, not to each AI.
- No Great Spy infiltrations to unlock demographics. Ok, they do have an impact (I seem to observe far less instances of an AI completely tanking its early eco - guess being free of that 20% spending on Espionnage helps ; conversely, those AIs which choose to spend on Espionnage target their actual opponents), but probably nothing major. The main reason is that they're a hassle: since I'm running the game from the wb file, I would have to re-add them each time.
They serve two purposes:
- Contact with the AIs: done through the wb file instead.
- Enabling graphs: done though a simple change to CIV4EspionageMissionInfo.xml, assigning 0 to the cost of the see demographics mission. Which permanently enables them as long as you have at least one EP spent vs a civ. So I just run the espionnnage slider for the first turn of the game, and I'm done.
For each game, I'll also provide an archive containing:
- An Excel file with the detailed game results (the macro is used for the Elo calculations).
- A second Excel files with graphs about the game.
- Minimap pictures of the game start and end.
- The worldbuilder files used for the 20 runs.
- The replay files for each run.
- It's an AI competition, aiming at ultimately providing a ranking of the AIs.
- It's a predicton contest.
- It's a Live Show.
The Alternate Histories for Sullla's AI Survivor series provide information aiming at showing the different ways a game could have gone, at and determining how predictable the outcome actually was.
In other words, they're about the prediction contest : they are replays of the exact same setups, leaving essentially unanswered the question of whether the results should be explained in terms of AI strength or in terms of the context of the game (starting position, neighbours).
That lead me to the notion of providing another set of Alternate Histories, but this time with a focus on the first aspect: ranking the AIs.
In order to achieve that, I'll be replaying the maps... while shuffling around the AIs on the map.
As with the Alternate Histories, I'll be playing 20 iterations of each game (same map, same AIs), but with a different permutation of the starting positions for each game.
That should provide an idea of the relative strength of each leader in that game's subset, regardless of its starting position and specific set of neighbours.
It'll also fulfill a secondary objective: determining how balanced (or unbalanced) each map was, and thus how much that influenced the game outcome.
Spoiler :
Now, if my maths are not too rusty, there are 6! = 720 possible permutations for a 6-player game, and 7 times as many for a 7-player game.
So 20 games is only a small subset of the possibilities, but I'll try to make it as fair as possible.
Doing 18 reruns of each 6-player game and 21 reruns of each 7-player game would allow each leader to play each position 3 times exactly, but that would cause issues with the first objective, so I'll leave it at 20 runs and do for the best.
So 20 games is only a small subset of the possibilities, but I'll try to make it as fair as possible.
Doing 18 reruns of each 6-player game and 21 reruns of each 7-player game would allow each leader to play each position 3 times exactly, but that would cause issues with the first objective, so I'll leave it at 20 runs and do for the best.
Instead of Sullla's Power Rating (5 pts for a win / 2 pts for a runner-up result / 1 pt per kill), I'll use an Elo rating to rank the AIs.
Spoiler :
Now, of course, the Elo System is far from a perfect fit:
- it is designed for 1v1 games, these are FFA games
- it assumes the outcome of a game depends solely on the strengths of the players involved, while here we know that external/random factors (starting position, opponents' peaceweights, religion spreads, etc.) play a big role.
For the second point, well, let's just hope that playing a lot of games with different contexts will somewhat make those external and random elements cancel each other.
As for the first point, a lot of multiplayer game systems designers have provided their answer to that. I looked at what they did, and came with the implementation described afterwards.
Why not simply use Sullla's system ?
Having another system will allow to compare them (I'll still be keeping track of the Power Rating). Also, there are potential shortcomings with the Power Rating (doesn't take into account the opponents' strength being the main one, but also very punishing for opening round eliminations, and the infamous kill steals agalore...).
- it is designed for 1v1 games, these are FFA games
- it assumes the outcome of a game depends solely on the strengths of the players involved, while here we know that external/random factors (starting position, opponents' peaceweights, religion spreads, etc.) play a big role.
For the second point, well, let's just hope that playing a lot of games with different contexts will somewhat make those external and random elements cancel each other.
As for the first point, a lot of multiplayer game systems designers have provided their answer to that. I looked at what they did, and came with the implementation described afterwards.
Why not simply use Sullla's system ?
Having another system will allow to compare them (I'll still be keeping track of the Power Rating). Also, there are potential shortcomings with the Power Rating (doesn't take into account the opponents' strength being the main one, but also very punishing for opening round eliminations, and the infamous kill steals agalore...).
Spoiler Elo Implementation :
Each match (or "run": one of the 20 replays of an AI Survivor game) is scored liked a mini closed tournament, where each participant gets a result (win = 1, loss = 0, or point split for an intermediate result) vs each participant.
The Elo system provides the expected result between two opponents, based on their rating.
There is an arbitrary coefficient involved, set at 400 for Chess for instance. This is the "sensitivity" of the system. While I expect to confirm that some AIs perform better than others, I don't expect the performance level differences to be drastic. So in order to have a wider range of values, I used 800 for that coefficient here.
To determine whether a player's rating needs to be adjusted, the expected result is compared with the actual result of the game... which we need to calculate.
Actual result (score) calculation:
The winner of a match gets a win (1-0) vs each of its opponents.
The other leaders which don't get eliminated get a win vs each of the eliminated civs.
That leaves us with scoring the survivors amongst themselves, and the eliminated civs amongst themselves.
I could have simply gone for a draw (0.5-0.5), and honestly, that probably would have been enough.
But since the point split doesn't have to be 50/50, I thought it would be nice to distinguish between a civ which made to the end with a couple of tundra cities vs a civ which became an overwhelming juggernaut well on its way to Domination but losing to a successful Culture attempt by a much smaller civ.
While not perfect, the in-game score could certainly serve for that purpose.
So say two civs survive in addition to the winner, one with 1,000 pts and the other with 3,000 points: the result point will be split 0,25-0,75 (1,000/(1,000+3,000) vs 3,000/(1,000+3,000)), which seems fair enough.
Now how about the dead civs? Ideally we'd use something like the highest in-game score they'd reached, but getting that information would require a python mod or a tool to extract it from the replay file... way too much effort. What we have is the turn of elimination. But the extreme values tend to be too close (T100 elimination vs T300 elimination would result in a 0.25-0.75 split while it should be more pronounced. So I basically squared the values before the comparison(*), which yields results closer to what I would expect.
(*) The exact calculation sets a "turn score" by adding the turn number each turn to the score.
So the turn score = 1 + 2 + 3 + ... + Turn of elimination = (T Elim)*(T Elim + 1)/2.
So:
- Game winner gets a perfect score.
- Survivors get a win vs eliminated civs, score with other survivors based on their in-game score.
- Eliminated civs score amongst themselves based on their turn of elimination.
Elo progression:
A player's Elo rating is adjusted according to the difference between its actual score vs its expected score.
Every AI will have an initial Elo rating = 1600.
Winning a 6-player game means a score of 5, vs an expected score of 2.5.
So in that case, it's an Elo gain of (5 - 2.5) * K.
"K" is an arbitrary factor.
The bigger it is, the faster a player reaches its "true" rating, but the more volatile the system is.
Since there are 6-player games (5 opponents) and 7-player games (6 opponents), I had initially gone for K=6 in 6-player games, and K=5 for 7-player games, yielding an identical value (total K = 30 in both cases).
After the season, and in light of the results, I made the following adjustments:
- I halved the value for the opening round and wildcard pool games.
- I used a quarter of the value for the playoffs games.
- I used a fifth of the value for the championship.
Also, the new Elo for an AI is bulk-calculated after the 20 runs of a game, and not adjusted after each run.
I initially used the latter method (which has the advantage limiting dramatic gains or losses), but that led to an AI with a better total score having a lower rating than another which had had a better late run. And while that would make sense for human players or true AIs, that makes little sense here where the "AIs" are just a bunch of static parameters with no capacity for learning whatsoever.
Tournament Format:
Season 1 will be based on Season 4 of AI Survivor.
I'll be following AI Survivor's format, with some changes:
- No change to the Opening Round.
- The Playoffs will have different participants (those who score best here, as opposed to those which made it in the live games).
- The Championship will have different participants and use a different map.
- The bigger change concerns the Wildcard game : for Season 1, it's replaced with a Wildcard League. The 36 AIs which don't make it to the playoffs are made into two pools of 18 AIs. For each pool, 3 6-player games are played, and the best two move on to a 6-player Pool Finals. The winner of that game gets a wildcard to the Playoffs.
The idea is to get more play-time for the weaker AIs, and to send two "Elo-bags" to the playoffs.
Tournament Rules:
I'll use the same settings and rules as AI Survivor, with two exceptions.
- The big one: No UN.
Diplomatic Victory is disabled.
Spoiler Rationale :
Four reasons :
1- This is an attempt at ranking the AIs, and the UN is a feature they can't use. If there is any kind of logic programmed there, it's virtually undistinguisable from an RNG call.
The AI won't call for the UN victory when it could win, it'll keep calling for the vote when it cannot win. It won't pursue any kind of gameplan involving a Diplo win. And it'll randomly put resolutions up for voting, and cast its vote randomly on them (voting "No" for Free Speech when running the Culture Slider and still in Bureaucracy?? Voting "No" to end a war where it's getting slaughtered??).
2- In the same vein, the UN shouldn't help make up for an AI bad decisions. If it's choking under unhealthiness because it researched and built every source of pollution while skipping Biology and Medecine, it shouldn't get bailed out by the UN.
3- What's that nonsense about UN peacekeeping and banning nukes? We want blood!
4- Technical reason: on my ancient PC, pop-up dialog calls at the end of the AIs turn processing slow the game dramatically. I think a war has been declared, it's just another Open Border ask... or a UN vote call then result.
No UN ends up speeding up games which run late by a noticeable margin.
- Enforced peace when an AI is at war with a opponent reduced to a single city hidden behind another civ with closed borders.
Spoiler Rationale :
This a bug, plain and simple, and it can completely alter the outcome of a game.
The AI won't sign peace, it won't plot another war, it won't launch an amphibious assault (same landmass): it gets stuck.
That situation had me consider disabling barbs altogether (not the only reason, but it happens 90-95% of the time because of an early barb city capture behind enemy lines).
In the end, I left them in because I think that no barbs could significantly alter the AI's performance.
And I got lucky: the situation didn't arise as frequently as with my standard Alternate Histories runs. But then a frankly ludicrous instance happened (with *two* AIs locked in war with a single-city AI), and I decided to broker peace through the worldbuilder, and to make it a rule henceforth.
"Test Protocol":
- Each game is run from the worldbuilder save (for practical reasons, and to get new peaceweights each time).
- Permutations are performed by changing the team number for the AIs, not by moving their units around. That means turn order is tied to each starting position, not to each AI.
- No Great Spy infiltrations to unlock demographics. Ok, they do have an impact (I seem to observe far less instances of an AI completely tanking its early eco - guess being free of that 20% spending on Espionnage helps ; conversely, those AIs which choose to spend on Espionnage target their actual opponents), but probably nothing major. The main reason is that they're a hassle: since I'm running the game from the wb file, I would have to re-add them each time.
They serve two purposes:
- Contact with the AIs: done through the wb file instead.
- Enabling graphs: done though a simple change to CIV4EspionageMissionInfo.xml, assigning 0 to the cost of the see demographics mission. Which permanently enables them as long as you have at least one EP spent vs a civ. So I just run the espionnnage slider for the first turn of the game, and I'm done.
For each game, I'll also provide an archive containing:
- An Excel file with the detailed game results (the macro is used for the Elo calculations).
- A second Excel files with graphs about the game.
- Minimap pictures of the game start and end.
- The worldbuilder files used for the 20 runs.
- The replay files for each run.
Last edited: