Don't trust the "likely outcome" combat predictor

CaptainPatch

Lifelong gamer
Joined
Sep 6, 2007
Messages
832
Location
San Rafael, CA, USA
I play Marathon games, so I have LOTS of combats. In my current game, I paid closer attention than I usually do, and it struck me that the actual outcomes were seldom as good as what the forecast predicted. So I decided to do an experiment.

I was conducting a battle where it was predicted that I would inflict 49 Damage against a barbarian defender. [A wounded (@45%) Swordsman vs Composite Bowman in an encampment.] Save game. Conduct combat to see the actual outcome. Load the Save and conduct the battle again. Rinse and repeat. For the first nine runthroughs, the outcome ranged from as low 40 -- 40! Nine under??? -- to 47. It wasn't until the tenth runthrough that I finally got a value of 49. It took another three attempts before I finally exceeded the predicted 49.

Now I always thought that the predicted value was the average that would occur. That it was the tippy-top of a bell curve distribution. That means that values lower or higher than the predicted value would be harder to get as the standard deviation became greater as you moved lower (left on the bell curve) or higher (right on the bell curve. It would also imply that the likelihood of getting a _58_ was just as great as the likelihood of getting that 40. Similarly, it would be just as likely to get a 51 as it would be to get a 47. Yet, in 11 of 13 random samplings, the results were all left of center while there was only ONE result that was right of center.

I conclude that either the utilized Random Number Generator has some unforeseen bias that skews the outcome bell curve to be significantly left of center -- most results WILL be less than what was predicted -- or else the programmers deliberately lied to us by making all likely outcome predictions overly optimistic.
 
I'm quite certain that I've seen many instances of the damage exceeding the predicted. I don't know what to tell you though, as I didn't record the numbers in some sort of experiment. There are certain instances though where I do think the game overestimates the expected damage output.

One case where I'm quite certain this is the case is when attacking an embarked land unit with a ranged attack, especially with planes. Frequently, it will show that predicted damage = 100 or something, and then it winds up doing a much more reasonable amount of damage like 30.

It's entirely possible that there are other such scenarios. I understand that attacking with a wounded unit (as you did with a 45-HP swordsman) affects the damage output somehow. Maybe the predictor underestimates that effect, which results in the bell-curve shifting slightly to the left.

But I want to emphasize that your experiment is hardly conclusive. First of all, 10 is a VERY small sample size. Even with a perfectly fair coin, if you flip it 10 times there's a 1% chance that you get tails 9 times. Now 1% is not a lot but maybe you got that 1%. Secondly, you only tested one scenario. What about attacking with a fully-healed swordsman? What about ranged attacks? Naval warfare? Sea-to-land bombardments? City bombardment of embarked units? Air-to-air interception? Bombing runs against a city? Hwach'a vs. Turtle Ship? There are so many scenarios, and your conclusion makes a claim about all of them, when it really only tested one scenario.

Lastly, your suggestion that the programmers deliberately "lied" is such an unreasonable assertion that it defies my comprehension. You have scant evidence of even the much milder accusation of a slight bug on the damage predictor, and there is not even a proposed motive for such a "lie". Do you imagine the programmers kicking back in the office, laughing at all the "suckers" who believed their damage predictor? How ridiculous is that? They want to sell their product, and tricking their consumers with some silly ruse does nothing to accomplish their goal.
 
I did not conduct any experiment but here is my hypothesis. Since I played many games on Emperor, Immortal and Diety in a very shrot period time, at least 10 of each in less than 3 months, I can see a tendency and the impession it left is pretty fresh in my memory.

Higher the level you play, more often the damage you deal will be under the prediction, lower the level is, the more often it will be over.

If you collect data from enough fights, lets say 25, over all 8 levels, you will end up with 200 entries that will form a perfect bell shape graph.
 
The thing about actual results differing from predicted results being a function of the Difficulty setting is understandable. It's the stated prediction that the game/programmers present that I find objectionable. If the calculations get skewed because of a higher difficulty, then the mechanism of the game offering a predicted outcome should have been altered to reflect those more likely lower results. The whole idea of even having the game show a prediction was/is to give the player useful information upon which to make decisions. If the actual outcomes fairly consistently are substantially deviated from the numbers given to the player, then that "data" becomes increasingly less worthwhile.

It seems to me that the AI has been given the tools necessary to make a prediction in the first place. But if what the AI predicts is for whatever passes for the Normal or Average Difficult setting ONLY, then that is the ONLY place predictions should be made. It would simply be fair to have the prediction routine make its calculations and stated predicted (average likely) outcomes be representative of the current Difficulty setting.
 
The thing about actual results differing from predicted results being a function of the Difficulty setting is understandable. It's the stated prediction that the game/programmers present that I find objectionable. If the calculations get skewed because of a higher difficulty, then the mechanism of the game offering a predicted outcome should have been altered to reflect those more likely lower results. The whole idea of even having the game show a prediction was/is to give the player useful information upon which to make decisions. If the actual outcomes fairly consistently are substantially deviated from the numbers given to the player, then that "data" becomes increasingly less worthwhile.

It seems to me that the AI has been given the tools necessary to make a prediction in the first place. But if what the AI predicts is for whatever passes for the Normal or Average Difficult setting ONLY, then that is the ONLY place predictions should be made. It would simply be fair to have the prediction routine make its calculations and stated predicted (average likely) outcomes be representative of the current Difficulty setting.

u bring it to the point
Thats exactly how it works
On prince your attacks have a +30% to -30% damage range and the prediction is on average correct

on settler u get a +30% to -10% range but it stil shows u the prediction for 0% so the prediction is on average 10% to low

on diety u got a +10% to -30% range but it still shows u the prediction for 0% so the prediction is on aferage 10% to low

it is a know problem that it does not show the correct prediction but if u know about it just play aware of it
 
If a likely outcome gives you a 70% chance of victory, keep in mind that there is still 30% chance of being defeated.

It doesn't magically mean your unit will crush the enemies little damage taken. It easily can get routed, as long as there's that 1% chance of defeat remaining = not 100% guarantee.

This is what most people fail to grasp.

Good chance of winning yes, but still very small chance of getting defeated. If you get defeated, chalk it up to murphy's law occuring in that battle and move on.
 
If a likely outcome gives you a 70% chance of victory, keep in mind that there is still 30% chance of being defeated.
The thing is that the calculator does NOT show the percentage chance of winning or losing. It shows the most probable amount of loss of strength of both attacker and defender.

When it is all said and done, the computer is a calculator. It CAN take a formula, factor in the variables and changes introduced by the Difficulty setting, and produce a value that is, in fact, the TRUE average (most likely) outcome. Offering TRUE averages of likely outcome would make the "advice" being provided significantly more useful. Players should NOT be required to know on their own what the skew factors are for each Difficulty setting. For one thing, that is information that is NOT available in any of the game's literature; it's information that those (relatively) few players that can actually open and examine the raw code to see what the effects of Difficulty changes are.

The bottom line is that the programmers did have the ability to tweak the program to make the combat outcome predictor be the actual Average, no matter the setting is. They chose NOT to do that. That to me makes it appear like the programmers intentionally allow deceptive info to be presented.

**************
Something that I just recently noticed. Prior to combat, you can hover the mouse cursor over the target unit and you are given a value. But after you initiate combat, if you once again place the cursor over the target unit, the predicted values change. Usually lowering both attacker and defender losses by 2-3 points. (On the Emperor setting anyway.)
 
Nah, a estimation is still a estimation of what will happen. Crazy things can happen when fighting starts.

It is only deceptive when it gives you an 100% guarantee and does the opposition.

100 Riflemen vs 100 Musketmen.
40 of your riflemen survives the fight because 40 musketmen went beyond the call of duty and brought down 40 riflemen with them while other 60 musketmen could only bring down 20 riflemen with them.

http://armylive.dodlive.mil/index.p...who-served-above-and-beyond-the-call-of-duty/

And there was stuff happening in real life like where one guy kicked the ass out of 60+ enemy soldiers on his own and such during WW2. Or making an entire platoon surrender to a single soldier.

Generally, its better to bring 3 to 1 odds to your fight if you want to crush the enemy.
 
Nah, a estimation is still a estimation of what will happen. Crazy things can happen when fighting starts.

It is only deceptive when it gives you an 100% guarantee and does the opposition.
You are missing the point. The combat predictor is NOT giving a guaranteed outcome; it's (supposedly) showing what the average probable outcome is likely to be . The actual outcome is VERY likely to vary from the average. However, since it supposedly the average, the probability of a final result being X points over should be the same as the probability of being X points under the posted predicted average outcome. But that is NOT what the results are indicating. What seems to be happening is that the predicted results are shown for when the game is played on the Prince (?) Difficulty level. If the setting is higher, the actual results tend to be lower than predicted = player inflicts less damage but also tends to suffer greater damage in return. Alternatively, if the Difficulty setting is lower than Prince, the actual results tend to be greater than predicted. no matter what the setting, there is still a range of possible outcomes , which would allow for something like a Rats of Tobruk scenario.

My point is that the program could very easily have been showing accurate average outcome estimates, no matter what the Difficulty setting. The prediction is based on a mathematical formula. Changing the Difficulty setting simply tweaks the formula in one direction or the other, depending on which way the Difficulty setting was shifted. It is just as easy for the program to ascertain how the tweaks affect Probable Outcome values, for each setting.

So why didn't the programmers bother to show those adjusted values rather than just showing the Princesetting values for ALL Difficulty settings? I can only think of two reasons, neither of which suggest Good Things about the programmers and their QA Department.
 
The combat predictor is NOT giving a guaranteed outcome
True, and it doesn't claim to. It's useful in showing you approximate results and, more importantly, what factors lead to that result. Too many times I've asked, "Why is my rifle stalemating with that longsword... Oh, that river extends one tile beyond what the map seems to suggest."
You use it for what it is, a ballpark estimate of what will happen, and don't make decisions that will get you in trouble. If the city has a sliver of health left, don't attack it with your melee unit, hit it one last time with a ranged unit, even if it means waiting an extra turn, it's just more experience. Or bring an extra unit to change the result from "possible" to "definite." Plus, I find that in the vast majority of cases where the variance attributed to the combat calculator makes the difference in the combat, even success is bad because it puts my unit in a precarious position.
 
Even if the programmers do calculate a "true" average they probably used a normal gaussian distribution to alter that average. The width of the distribution means many values are likely and I doubt 13 observations is enough to "prove" they lied. That's just being dramatic. You'll need a much more thorough test to prove this as what you saw is still pretty likely due to chance.

I'm only gonna be convinced the average is biased if you run enough samples of different battles. 13 tries is not nearly enough for your observance to be statistically significant. You said it went as low as 9 below? if that's right there are at least 19 possibilities per roll. You only tested 13 times. In general you want your sample size to be more then the number of possibilities otherwise you have no clue what distribution your results represent. I'd recommend running at least 100 runs for 19 possibilities.

you saw 11 outcomes under the prediction and 2 above ONCE. That sounds like a perfectly likely outcome. How much were they below? How much above? Then I could tell you more.

You saw variation up to 9 so assume the value can be + or - 9 on each side. (probably scales with damage mean really, variation is lower for small numbers) Now there are lots of possibilities, many of which are lower. and in fact, with such a wide distribution I wouldn't be surprised if there was a 45% chance it would score lower every time. So is 11/13 unlikely to observe? not really.

Without the real numbers for each run I can't do a p-test for you to tell you how likely something is but in biostatistics we only say an observation is significant if the p-value is under 1%. I'm not gonna go into the math here but basically this means that there is only a 1% chance that what we observe could be due to just random chance/noise (the null hypothesis). The reason we want it to be so low is still, even at 1% chance 1 out of 100 times we test we'd see that just due to chance. The lower the better.

You saw 11 values under the prediction but didn't give me all the values so here's a simpler test that assumes the true mean (non-discretized):

likelihood below: 50%
likelihood above: 50%

so we have 2 possibilities for 13 observations. What is the probability we see at least 11 below, and 2 at or above?

Well there are 2 options for every roll with this test so we have 2^13 = 8092 permutations

how many would look like yours or even worse?

seeing 13 below: 1
seeing 12 below: 13 (the one above could be any of the 13)
seeing 11 below: 12+11+10+... = 13*6 = 78 (2 above in all possible orders)

so the likelihood of seeing what you saw or worse is approximately: 92/8092 = 1.13% (pretty low but still could happen)

This is considering all the combinations over time with 13 samples and the likelihood of a certain combination GLOBALLY, but actually in reality every single time you run the test there is a 50% chance it'll be lower. It is quite common to see long strings of one or the other in random numbers. It often doesn't mean anything.

Granted this is a very rough calculation but I hope it gets the point across. There is a 1.13% chance what you saw is pure chance even if the mean is perfectly even (50/50 above below). That means 1/90 players would see the same thing when testing. You need a lot more tests to angrily blame the programmers for messing up the code. I would also recommend sampling from many different KINDS of battles. If there is a coding error it probably isn't all the time but only under special situations.

It's possible the predictor, which is designed to work with a TON of factors involved is just a little off on that one battle due to being bad at estimated a factor (or neglecting it). There are many things that affect combat: rough, unit typing, rivers, barb modifiers, civ modifiers, policy modifiers, promotions, unhappiness in the empire...etc. So calculating a simple "mean" is not as straightforward as you think, especially if each component has some randomness built in. Also you have to remember this is a computer so they have to "round" to an integer. Roundoff error often creates a small bias in things like games.

So what is the takeaway here. Well first off running 13 attacks and seeing 11 underneath the estimate and running here and complaining just makes you look silly. Just because something happened once doesn't mean anything significant. If you see a similar amount underneath on a second and third test of similar length you'd have a much stronger case that there is an under-represented bias. As I calculated using simple math that might happen in 1 out of every 90 tests. Getting more technical: you could increase your sample size to something like 100 as with a high attack of 49 and a spread of 9 you ran less tests then there are possibilities. Which means your result can't possibly be used to plot a probability distribution and tell anything. Lastly, you need to run this test in more then one battle before you claim the creators messed up the predictor for all battles. You are cherry picking--you noticed it being off once and reported it.
 
You use it for what it is, a ballpark estimate of what will happen, and don't make decisions that will get you in trouble.
Just how are you defining "ballpark estimate"? To me that means "Probably about that give or take X." To me that means "the average is likely to be _."

Now, the combat predictor is giving you its "ballpark estimate". What would you conclude if, upon closer examination, you discover that the actual results were predominantly lower (or higher)? Wouldn't you expect that if you are being given a ballpark estimate that it should be closer to what the actual outcomes are? That being that there would tend to be approximately as many over outcomes as there are under outcomes. If instead the actual results in 80% under results to just 20% over results, there is obviously some bias in play that the combat predictor is not taking into account. But whatever that bias is, it could have been factored in such a way that the value the predictor renders sits between approximately equal numbers of overs and unders.

@danaphanous

I understand Statistics pretty well. (Math minor) I realize that 13 samples are really quite insignificant. But I have been paying close attention to the variance between predicted outcomes and actual outcomes and what I see is that on the Emperor setting, the actual outcomes are trending towards @80% under. But my real question to you is, just how many samples would you say constitutes a "significant" sampling? Hundreds? Thousands? Tens of thousands? Hundreds of thousands? If you demand a high enough sample size, you guarantee that no one would exert that much effort in proving their argument.

Of course, any sample would have to be for just ONE situation. Because each situation involves differing variables (terrain adjustments, strength adjustments, promotion adjustments, etc.)

So, how many samples do you require (with documentation provided of course) would it take to convince you that there is a deliberate bias between predicted outcomes and the actual likely outcomes?
 
I realize it's difficult time-wise due to how slow reloading is, but I feel like it's fair to the developers to do a proper test. If you've done more that you haven't reported and feel confident in the observed effect then fair enough. I just thought the 13 you talked about was far too low given the amount of possible outcomes.

I'd probably want to test a few different types of combat with different modifiers involved. Then we can narrow the error down a bit if there is one. Maybe 5 different battles, making sure each one had a few different kinds of modifiers. I'd be interested in if the error is higher for certain battles. Maybe the error is for specific kinds of situations.

If your variance is as large as you said then we'd probably want at least 40 samples per battle. That would be twice the sample size of the number of possible outcomes you observed (9*2+1). Otherwise I doubt a true normal distribution will occur with the samples. But it's all hypothetical really, I'm guessing the width of the distribution scales with the damage so it might be more significant only with high-damage units.

If we plot the results and the distribution mean is clearly off the prediction--we can do a p-test to see how significant the different is. If the result is 1% or less I'd say you're probably right. Then again, I don't feel like doing this so it probably won't be done :D

What's your major out of curiousity? Mine is engineering.
 
Maybe 5 different battles, making sure each one had a few different kinds of modifiers.

If your variance is as large as you said then we'd probably want at least 40 samples per battle.

What's your major out of curiosity? Mine is engineering.
Doable.

B.S. degrees in Business Mgt and English. Minors in Behavioral Science, Business Admin, Communications, Computer Programming (Fortran), Econ, Math, Personnel, Phys Ed, and Psych. [I took an extra year, averaged 18 credits a semester, and fleshed out the General classes requirement into multiple minors.) I also snuck in Coaching Certificates in seven different sports. [Yes, you could say I was a "professional student" for five years.] I started a MBA five years later, but had to stop after three semesters because my father was diagnosed with lung cancer and the family needed every cent I could contribute towards his out-of-pocket expenses. Never got around to finishing that up.
 
The bottom line is that the programmers did have the ability to tweak the program to make the combat outcome predictor be the actual Average, no matter the setting is. They chose NOT to do that. That to me makes it appear like the programmers intentionally allow deceptive info to be presented.

These situations are (usually) negligence rather than overt effort to deceive. Sometimes the negligence is pretty extreme (IE ranged attack =/= ranged attack in vanilla), but it's hard to believe the outsourced the programming to I wanna be the guy staff.

If a likely outcome gives you a 70% chance of victory, keep in mind that there is still 30% chance of being defeated.

Irrelevant to OP's test case and description.

It is only deceptive when it gives you an 100% guarantee and does the opposition.

This isn't correct. If I give you an "80% probability of winning", and you run 1000 trials, you expect something around 80%. If you instead win only 300 battles out of 1000 trials, it is wrong to conclude that I "wasn't deceptive because I didn't show 100%". The displayed probability in that case was *grossly* distorted from the actual probability.

OP's post, while the sample size is small, would demonstrate a dishonest UI if his experiment holds over more trials...and that's not ok in a strategy title.

I'm only gonna be convinced the average is biased if you run enough samples of different battles. 13 tries is not nearly enough for your observance to be statistically significant. You said it went as low as 9 below? if that's right there are at least 19 possibilities per roll. You only tested 13 times. In general you want your sample size to be more then the number of possibilities otherwise you have no clue what distribution your results represent. I'd recommend running at least 100 runs for 19 possibilities.

Your sampling can be limited if the results are really lopsided. For example, if he had 15 straight outcomes of exactly 40 (and it wasn't due to non-random seed) we should be very, very suspicious that the displayed "average" out come is accurate already. He didn't, so we'd want a larger size, but it's worth pointing out that if he's getting in the neighborhood of 40+ trials and 95% of them are to the left of the projected bell curve estimate, that estimate is probably wrong and we should have low confidence in it already.

Note that in EU IV, for a long time the displayed % in claim fabrication really were wrong. We were getting outcomes like expecting to be caught on the order of 1/5 to 1/10 as frequently as was happening. They actually patched it later, but there were a lot of people coming to defense with "confirmation bias" even when that conclusion was completely inappropriate :p.

Civ V's UI history, while better than that nonsense, is not exactly stellar. This *is* the same game that would move your catapult rather than attacking with it in vanilla, after all, despite telling you it would attack. Bad UI display programming wouldn't be the most shocking outcome in the world.
 
I think what this discussion boils down to is that the presentation of an argument matters as much as the argument itself. An error in the UI, whether in the underlying calculation or in the visual output (or both!) is entirely plausible given the complexity of a game like Civilization V and the inevitability of human error. What is not entirely plausible is that the error was intentional or malicious in nature, as implied by the OP, or that the small number of trials can be used to make any substantive statement.

Had the OP read something like, "Hey, I noticed an interesting discrepancy with the combat predictor. I did some preliminary empirical research and got X result from Y trials in Z situation, which seems odd.", people would probably have responded by saying, "Whoa, that's interesting, somebody should do more research to see if that is consistent across N many trials." Instead the OP was, bluntly, melodramatic, using bolded and underlined words and making unfounded assertions. That tends to lead to a certain degree of backlash.

I agree that the data presented here are interesting, and hopefully somebody with more spare time than me can put a large sample size together for a similar situation. Ideally CaptainPatch could upload a save file so that someone could test under the exact same conditions.
 
This isn't correct. If I give you an "80% probability of winning", and you run 1000 trials, you expect something around 80%. If you instead win only 300 battles out of 1000 trials, it is wrong to conclude that I "wasn't deceptive because I didn't show 100%". The displayed probability in that case was *grossly* distorted from the actual probability.
Man, I wish this were true... because then I'd be rich. I used to play poker and my biggest strength was getting people to put all their chips in when they had Kings and I had Aces (81% favorite.) I'll concede that many card players are subject to "selective memory," but I was so good at putting myself in the heavy favorite (75-90% favorable) position, and won less than a third of them. Plus, it's completely possible, however unlikely, to have a coin flip "heads" 900 times out of 1000, even 990 out of 1000. This doesn't mean it's "wrong" or "deceptive" to say that a coin-flip is a 50/50 shot.

Regarding the reply to the "ballpark estimate" post, yeah, I see what you're saying. I think they even mentioned earlier that as you progress in difficulty, the range of favorable to unfavorable results shifts in the direction of the unfavorable results. Is this deceptive? I don't think you can definitively conclude so, especially since you can't effectively deceive someone who knows you're trying to deceive them. It may make things harder, which frankly someone who selects options like "Immortal" or "Deity" when options like "Prince" are available, has removed him/herself from appropriately griping about. My original point is that if this bothers anyone to such an extent, then that person must feel that they are "owed" a certain result, which implies that there's some level of "guarantee" in the predictor, and there isn't. Even TMIT;s response (which, by the way, big fan, best LPer bar-none, and freely admit you're much better at the game than I am) seems to suggest some level of "being owed a certain result."

As for the sample set discussion, agree that 11 is a little weak, but advise caution in how you proceed. Frankly, it sounds like you guys are about to spend a lot of time and effort for very marginal benefit.

As to whether it should have more accurate predictions... meh. I mean maybe there should be no RNG in combat at all - A strength 18 unit will dispatch a strength 12 unit in exactly 3 turns and incur exactly 50% damage in the process... But where's the fun in that?
 
You can't really be owed a result in probability, but it's reasonable to expect a probability estimate to be as accurate as possible. The problem is the misleading ui. If your average expected damage is known to be lower in one case than another, ie difficulty, then identical display between the two is bad design...simple fake difficulty in a strategy title.

Contrast with just updating the ui for that...how does that change anything? All it does is remove trial and error play just to learn the rules.

And I do have doubts that in 1000 games where you had pocket aces vs kings, you lost 2/3 :).
 
... it's completely possible, however unlikely, to have a coin flip "heads" 900 times out of 1000, even 990 out of 1000. This doesn't mean it's "wrong" or "deceptive" to say that a coin-flip is a 50/50 shot.
Here's an anecdote that really doesn't prove anything, but does demonstrate how Probability could be bent.

Back during my college days in the early '70s, we had a game club that played a wide variety of mostly strategy games. We had one member that had the uncanny ability to roll whatever number he needed at critical points in his games. It was so uncanny that we suspected that he was somehow manipulating the dice. For his part, he insisted that he never knowingly did anything to make sure he got the numbers he needed. Because of him we went through a variety of "hands off" dice-rolling measures: Pop-a-matics, dice cups, dice jars, etc. None of those altered his ability to get whatever number he needed. The classic example of mind-over-matter was the $5 test: "Here, Lance, is a $5 bill. It's yours if you can roll six sixes in a row."

"What's the catch? What happens if I fail to roll six sixes?"

"No catch. The only 'penalty' is that if you fail to roll six sixes, you don't get the $5."

He then proceeded to roll six sixes in a row. During the sequence, we changed dice twice for hands on, and ended with first the pop-a-matic and finally with the die-jar.

I guess he really needed $5 at the time.

There was one big Napoleonic miniatures battle we had with another club. We decided to use Lance as our "secret weapon". He was put in charge of ONE unit in the game, a howitzer unit that we spray-painted gold just for the occasion. All he had to do to ruin our opponents' day was to roll a 5 or a 6 every time he fired his howitzer. He had the opportunity a dozen times during the match, but only managed to roll one 5 and one 6 during the game.

We really should have known better. Lance really was not "into" Napoleonic miniatures. (One of the reasons we only gave him the one unit.)
 
Top Bottom