Reloading in GOTMs

Status
Not open for further replies.
"False positives"? We don't make exclusion decisions lightly. False positives are unlikely - borderline cases get sent a "warning"
A false positive is not the same as a borderline case, Ainwood.

The first is when suspicion is raised - beyond borderline - while in reality there was no fault, the second is when it is close to where a line was drawn.
What you do once suspicion has been raised is another matter.

Going back and applying new techniques to old results makes sorry reading.
Yes, well, this was always obvious, and we didn't need any techniques to know. I remember a discussion about a Russian site that was so plagued by cheating to the point that the staff just quit and people here took it for granted that it was 'a Russian thing'. That struck me as pretty ignorant at the time.
 
Ribannah said:
I remember a discussion about a Russian site that was so plagued by cheating to the point that the staff just quit and people here took it for granted that it was 'a Russian thing'.
I haven't heard anything about that. The topic about cheating is being razed on our site from time to time (just as on CFC), but I never heard any mention of a cheating 'plague'. Everything is pretty much the same as here, I think.
 
I totally support Ainwoods attempts to detect replaying and other suspicious moves. I agree with those who are saying that there is no point to the GOTM if you can cheat the rules. Easy for me to say as I'm an odd gamer in that I rarely replay sections of games anyway. I find that I get confused about what I've actually done in *this* game rather than the previous one(s):blush: Maybe replaying is a bad habit that people get in to? As someone said previous (Dynamic Spirit?) I've had some excellent games trying to recover from a bad position - not just in Civ but other strategy and RPG's too.

Despite that, though, I've still received a 'warning' message.:( I have only played 1 GOTM and I'm quite sure that I never replayed at all. Not sure whether to send private email to find exactly what the problem was, as I expect they have more than enough work to do anyway. I struggled with HOF mod at first, didn't switch on many features till late-game. Also had lots of short sessions. As a total newbie I probably made a stupid mistake somewhere. I'm not defending my game, as I'm sure the HOF trace is accurate, but I do need to avoid being disqualified through ignorance.

The point is that I believe it's my responsibility to prove that I played fair, so I shall be *very* careful to be above reproach when I start the next game. It's a small effort, really, compared to the efforts of the GOTM organisers.
 
II've still received a 'warning' message.:( I have only played 1 GOTM and I'm quite sure that I never replayed at all. Not sure whether to send private email to find exactly what the problem was, as I expect they have more than enough work to do anyway. I struggled with HOF mod at first, didn't switch on many features till late-game. Also had lots of short sessions. As a total newbie I probably made a stupid mistake somewhere. I'm not defending my game, as I'm sure the HOF trace is accurate, but I do need to avoid being disqualified through ignorance.
QUOTE]

I received a polite warning in GOTM6 which was due to lots of short sessions. In subsequent GOTMs I've tried to play longer sessions and haven't received a warning since. That was probably it.
 
A false positive is not the same as a borderline case, Ainwood.
I wasn't saying that they were.

The first is when suspicion is raised - beyond borderline - while in reality there was no fault, the second is when it is close to where a line was drawn.
I would consider a false positive to be where we have considered that a person has cheated, when, in fact, they have not. As I said, we are fairly certain of our evidence before we make a judgement.

I am fairly confident that we are 'drawing the line' in such a place that we won't get any false positives - because where there is doubt, we give people the benefit.
 
My general observations on all this discussion:

1) "Do-overs" aren't allowed in real life generally, sporting events in particular (Coach: "The defense anticipated that pass play and intercepted the ball. I'm throwing my "do-over" flag and calling a running play") or Civ IV multiplayer games. Accordingly, I find no merit to any contention that reloading should be allowed.

2) A flat rule of "no reloading" (with a small exception for computer crashes) is easiest for the staff to implement. You should keep in mind that the staff are volunteers. If their work gets too complicated, they may burn out and we won't have a GOTM.

3) A flat "no reloading" rule is also fairest to the vast majority of players who don't reload. Allowing some "reloading" or comparing reloaded results to non-reloaded results is unworkable since there are different degrees of reloading. How do you distinguish between a game where one or two lost battles are reloaded (which may or may not be critical battles) and a game where numerous lost battles are reloaded? By the way, when is the last time anyone reloaded because their suicide cat killed the defending longbow despite the 1% chance of success?

4) While false positives are certainly a possibility, we should trust the staff to only disqualify those games which they have reasonable certainty have been reloaded and provide a warning if they have a suspicion. If a game is disqualified and the player doesn't believe it justified, I'm sure the staff carefully re-reviews the game in light of the player's explanation. Ainwood's and AlanH's posts on the policy set forth a reasonable policy in my view. If possible to do so without disclosing the staff's methods for detecting reloading, I would appreciate a list of the types of innocent behavior (such as a lot of short sessions) which can look like reloading. Given the lack of anyone publicly claiming their game was unjustly disqualified, I seriously doubt there are many, if any, false positives.

5) Some of the disqualifications are probably first time GOTM players who don't realize there is a "no reloading" rule. I expect that for the vast majority of those players, they have no problem after being notified of the rule.

6) As a couple of others have posted, I used to reload a lot before discovering the GOTMs. I've found that following the "no reloading" rule is more enjoyable for me for two reasons. First, when I win, I have greater satisfaction that I prevailed without reloading. I've realized that a reloaded victory is tainted. Second, it has greatly improved my game. When a bad result occurs, I have to look for ways around it. Further, if a strategic or tactical decision led to the bad result, I'm less likely to repeat the mistake.

7) I upgraded the RAM in my computer to play Civ IV.

8) For those who think they can't live, especially on higher levels, without reloading, you'll survive. Just do your best, read the spoiler threads and pick up pointers on how to improve your game. Set yourself a goal of doing just a little bit better each time. For me, I'm currently 191st in the Civ IV GOTM global rankings and my goal is to move up each game. You can move up the global rankings even with a low scoring game. In GOTM9 I went from 203rd to 191st with a 1,842 point game after resigning because I had made serious strategic and tactical mistakes and Ghandi was about to overrun my empire with his highly technologically advanced military. The key, I think, is playing and submitting each month as there are a lot of players who don't submit losses.
 
The nature of a false positive is that there is no doubt, hence the word positive. It is often not the certainty of the evidence that causes this, but the conclusions that you draw. The measurement may be absolutely perfect, but it may not reflect what you think it reflects.

If you rely on evidence there will always be false positives. Even in the best judicial systems this is so. There are innocent people sent to jail every single day. This is usually accepted if a judicial system is, for all its imperfections, serving 'the greater good', an argument also used by the 3OTM staff when they introduced a cut-off point for the number of turns per session. However, in view of what you now know about the 3OTM, that argument must be re-evaluated.

Load detection by itself does more damage than good. It is indirect evidence. In order to draw conclusions with a high standard of validity, you need direct evidence.
 
My general observations on all this discussion:

1) "Do-overs" aren't allowed in real life generally, sporting events in particular (Coach: "The defense anticipated that pass play and intercepted the ball. I'm throwing my "do-over" flag and calling a running play") or Civ IV multiplayer games. Accordingly, I find no merit to any contention that reloading should be allowed.

2) A flat rule of "no reloading" (with a small exception for computer crashes) is easiest for the staff to implement. You should keep in mind that the staff are volunteers. If their work gets too complicated, they may burn out and we won't have a GOTM.

3) A flat "no reloading" rule is also fairest to the vast majority of players who don't reload. Allowing some "reloading" or comparing reloaded results to non-reloaded results is unworkable since there are different degrees of reloading. How do you distinguish between a game where one or two lost battles are reloaded (which may or may not be critical battles) and a game where numerous lost battles are reloaded? By the way, when is the last time anyone reloaded because their suicide cat killed the defending longbow despite the 1% chance of success?

4) While false positives are certainly a possibility, we should trust the staff to only disqualify those games which they have reasonable certainty have been reloaded and provide a warning if they have a suspicion. If a game is disqualified and the player doesn't believe it justified, I'm sure the staff carefully re-reviews the game in light of the player's explanation. Ainwood's and AlanH's posts on the policy set forth a reasonable policy in my view. If possible to do so without disclosing the staff's methods for detecting reloading, I would appreciate a list of the types of innocent behavior (such as a lot of short sessions) which can look like reloading. Given the lack of anyone publicly claiming their game was unjustly disqualified, I seriously doubt there are many, if any, false positives.

5) Some of the disqualifications are probably first time GOTM players who don't realize there is a "no reloading" rule. I expect that for the vast majority of those players, they have no problem after being notified of the rule.

6) As a couple of others have posted, I used to reload a lot before discovering the GOTMs. I've found that following the "no reloading" rule is more enjoyable for me for two reasons. First, when I win, I have greater satisfaction that I prevailed without reloading. I've realized that a reloaded victory is tainted. Second, it has greatly improved my game. When a bad result occurs, I have to look for ways around it. Further, if a strategic or tactical decision led to the bad result, I'm less likely to repeat the mistake.

7) I upgraded the RAM in my computer to play Civ IV.

8) For those who think they can't live, especially on higher levels, without reloading, you'll survive. Just do your best, read the spoiler threads and pick up pointers on how to improve your game. Set yourself a goal of doing just a little bit better each time. For me, I'm currently 191st in the Civ IV GOTM global rankings and my goal is to move up each game. You can move up the global rankings even with a low scoring game. In GOTM9 I went from 203rd to 191st with a 1,842 point game after resigning because I had made serious strategic and tactical mistakes and Ghandi was about to overrun my empire with his highly technologically advanced military. The key, I think, is playing and submitting each month as there are a lot of players who don't submit losses.
Very good post, I agree with pretty much everything and it's well written. Also occurred to me that, as you said "[g]iven the lack of anyone publicly claiming their game was unjustly disqualified, I seriously doubt there are many, if any, false positives." We don't know exactly what the staff uses for vetting, it does seem that they are doing a good job of determining the intentional re-plays from those resulting from accidents. And I also added more RAM and a new graphics card to play Civ 4.
 
The nature of a false positive is that there is no doubt, hence the word positive.

This is entirely wrong. "False positive" is a standard term in engineering and statistics, with a standard meaning. It means that a particular test reports a positive result, when the truth is negative. Wikipedia has a good, elementary summary of the usage:

http://en.wikipedia.org/wiki/Type_I_and_type_II_errors

When using any test that can generate false positives, there is always doubt when the test generates a positive result, because the positive result might be either a false positive or a true positive. Some tests generate many more false positives than true positives (for example, explosives screening at airports generates thousands of false positives for every true positive, because very few of the passengers are actually carrying explosives). The usual course of action when a positive result is generated is to use a more sensitive test (e.g., search the passenger's luggage) to determine whether the positive result is a false positive or a true positive.
 
The nature of a false positive is that there is no doubt, hence the word positive. It is often not the certainty of the evidence that causes this, but the conclusions that you draw. The measurement may be absolutely perfect, but it may not reflect what you think it reflects.
"False positive" was not my term.

Yes, there can be 'false positives', but my point is that I think we have managed the risk of this very well to the point where the number of 'false positives' will be acceptable.
 
Just a few observations:

"False positives"? We don't make exclusion decisions lightly. False positives are unlikely - borderline cases get sent a "warning" - in fact, warning is probably a bit harsh a term. They get sent a note that we have some concerns about their submission, offering some options to try and make their submissions more robust (request that they take more care, technical support, setting autosaves to every turn, reminder of what the rules are etc). They have a chance to explain themselves.

Exclusions:
We are confident (after a *lot* of testing) that the system we are using is robust. We will still listen to people's explanations, and reconcile their explanations against our evidence.
If you are saying that the warnings (yes, I got one too) are more in the line of friendly advice than a message that "we think you cheated but we can't prove it", that is a useful clarification for the recipients. Given the way exclusions and warnings were discussed in close proximity in the first post and the email text, their more benign nature was not so apparent (at least to me). The more harsh impression of their meaning, and posts suggesting they were pretty common, made me worry about false positive issues. Especially with people posting about two strikes and you are banned. If warnings carry no implication of guilt, then false positive is not an issue for the warnings.

I am also reassured to hear that you have looked at the specificity of the system (assuming that is what you mean by robust) regarding exclusions.

In my descriptions of five types of games submitted, the first two (cheaters, I wish we could find them all) and the last one (clean games that look clean) are straighforward.

The fourth type, players who have not replayed any decision in their game, but somehow look suspicious, I would hope is a theoretical group only, and that someone who has not replayed anything always is identified as a clean game.

It is the replays without any attemp to modify outcomes, usually due to crashes, that seems to be the other difficult area. I am assuming that my warning came from my crash nightmare in GOTM 11, which is well documented in this thread: http://forums.civfanatics.com/showthread.php?p=4700014#post4700014 But since the warning covers potentially three games, at present I don't know for sure. I don't know what would have been an issue in 10 or 12. In retrospect, as crashes became more regular after 1950, I probably should have stopped play until I had it resolved (perhaps a future specific recommendation, if not rule, for GOTM?).

My reaction of saving at end of each turn manually, and even within turns (so there would be less to replay if there was a crash), may have made it look worse. If so, maybe a rule of no intra-turn saves?

Addendum: There is another way to deal with crashes late in the game on less powerful computers, and that is to save, exit and reboot periodically (maybe every 10 turns). By doing this under the players control, nothing needs be replayed (although the game is reloaded a lot). But might that show up as suspicious behavior? Hard to know whether that is a viable option or not. If crashes are less of an issue in SGOTM, maybe the fact that turnsets are only 20 turns (or 10 later in game) is protective there (proof of principle?).

Mandating save on exit in HOF would eliminate reloads for exiting without saving.

If you don't want to raise some of these to rule, then perhaps at least to the "ignore at your peril" level ...

dV
 
This is entirely wrong. "False positive" is a standard term in engineering and statistics, with a standard meaning. It means that a particular test reports a positive result, when the truth is negative.

@DaviddesJ: I am glad you and I have found one thing we can agree on ;) Your description is perfect.


When using any test that can generate false positives, there is always doubt when the test generates a positive result, because the positive result might be either a false positive or a true positive. Some tests generate many more false positives than true positives (for example, explosives screening at airports generates thousands of false positives for every true positive, because very few of the passengers are actually carrying explosives). The usual course of action when a positive result is generated is to use a more sensitive test (e.g., search the passenger's luggage) to determine whether the positive result is a false positive or a true positive.
Since we are teaching here, I think you meant to say that the second test needs to be more specific? And the ratio of true positives to false positives is also highly determined by how rare is the thing you are testing for. Even a good test, applied to a populaton where the object of the test is rare, creates lots more false positives that true positives.

dV
 
Since we are teaching here, I think you meant to say that the second test needs to be more specific?

Ideally, you would want the second test to be more specific, and have higher power. A test that always says "no" is more specific, but it's not much good because it has no power. And, depending on the degree of correlation, it's not even necessarily important that the second test be more specific. E.g., suppose you have an explosives screening test that has a false-positive rate of 1/1000 (and a relatively low false-negative rate). It would be perfectly good to put the false positives through a second test that has a false-positive rate of 1/100 (i.e., less specific), as long as the second test is relatively uncorrelated with the first test (so that it excludes almost all of your false positives).
 
Ideally, you would want the second test to be more specific, and have higher power. A test that always says "no" is more specific, but it's not much good because it has no power. And, depending on the degree of correlation, it's not even necessarily important that the second test be more specific. E.g., suppose you have an explosives screening test that has a false-positive rate of 1/1000 (and a relatively low false-negative rate). It would be perfectly good to put the false positives through a second test that has a false-positive rate of 1/100 (i.e., less specific), as long as the second test is relatively uncorrelated with the first test (so that it excludes almost all of your false positives).
The test that always says no has 0% sensitivity, so as you say is not good for anything. Your explosives example already starts at 99.9% specificity (= 1 per 1000 false positive rate), a rate I would drool over in diagnostic tests. But suppose you screen 10 million and 10 passengers, where 1 in a million has a bomb, and sensitivity is 100%. You get 10 true positives, and 10,000 false positives.

Now if you run your 99% specificity test (1/100 false positive rate) on the 10,010 positives, let's still give it 100% sensitivity, you get 10 true positives and still 100 false positives. Unless you run a test 3, are you going to arrest the 100 innocent? Of your 110 positives now, only 9% are guilty. If your second test had a 1/10,000 false positive rate, then you would only have 1 false positive, and of 11 positive now 91% are guilty.

Problem is finding a test with 1/10,000 that does not have poor sensitivity. Because in my example, once sensitivity gets down to 90%, you will have one false negative. And that is one blown up airplane.

I think you will like this question: what is the specificity and sensitivity of a coin toss as a diagnostic test? :crazyeye:

Do we need a statistics thread ...:D

dV

Explosives,... airplanes,... should we be saying that out loud?
Don't worry, its just a harmless academic discussion ...
FREEZE! HANDS IN THE AIR!
What the ...?
It's ... the SWAT team !!
That's just great! Harmless my ....

With homage to Compromise
 
But suppose you screen 10 million and 10 passengers, where 1 in a million has a bomb

There were 750 million airline boardings in the US this year. If 1 in a million had a bomb, we would be in a lot of trouble!

Now if you run your 99% specificity test (1/100 false positive rate) on the 10,010 positives, let's still give it 100% sensitivity, you get 10 true positives and still 100 false positives. Unless you run a test 3, are you going to arrest the 100 innocent?

No, of course not. You just do additional tests, that add more and more information. If you see what looks like a bomb in an x-ray, and the passenger also tests positive for chemical traces of explosives, then you're certainly going to examine his bags really carefully!
 
This thread just proved my beliefs again that we have experts or, at least, doctoral students in almost all areas of science and technology in this GOTM forum.

I was to suggest a method to penalize player rankings based on the number of turns they actually reloaded, thereby giving a small tolerance on keyboard mis-strokes (such as mis-declaring to Monty who is much stronger than you), barbarian take your capital on 1.1% odds and other small odd events that could really discourage a casual player to play through. But now it seems we have too high moral codes in the community to allow for such trade-offs.

So, only one question now, after the added workload on reloading judgments, do our respected staffs still think that we can publish the results in a timely manner?
 
There were 750 million airline boardings in the US this year. If 1 in a million had a bomb, we would be in a lot of trouble!
Indeed! Picked 1 in a million to make my math easy, not as a representation of an empircal finding. But if there are even fewer bombs, the positive predictive value (what percent of test positives are true positives) gets worse and worse ...

No, of course not. You just do additional tests, that add more and more information. If you see what looks like a bomb in an x-ray, and the passenger also tests positive for chemical traces of explosives, then you're certainly going to examine his bags really carefully!
Absolutely! And the point of adding all of these tests is to have a series of tests that has enough specificity to not be arresting innocent people, while still not letting any bombs through (perfect sensitivity). Quite the challenge to achieve !

And the coin toss ...?

dV
 
So, only one question now, after the added workload on reloading judgments, do our respected staffs still think that we can publish the results in a timely manner?
We've set-up processes to make them faster.

I have a few things to do first, but I hope to publish to get us up-to-date by the end of the weekend.

After that catch-up, we'll publish within 5 days of the event closing.
 
Originally Posted by Ribannah: "The nature of a false positive is that there is no doubt, hence the word positive."
This is entirely wrong. "False positive" is a standard term in engineering and statistics, with a standard meaning. It means that a particular test reports a positive result, when the truth is negative.
What we have here is a situation where the measurement is entirely correct, without statistical error, but the model may not be. Statistical error types are not an issue here, but regularity conditions are.
 
Yes, there can be 'false positives', but my point is that I think we have managed the risk of this very well to the point where the number of 'false positives' will be acceptable.
What kind of number would be acceptable, weighed against what is gained?
 
Status
Not open for further replies.
Back
Top Bottom