We have been going through setting up systems to allow us to publish GOTM / WOTM results in a timely manner. In parallel with this, we have been improving our systems to detect people reloading.
It is great to improve your ability to detect replaying (more precise, as discussed above in the thread), as long as you don't increase collateral damage in the process (aka increase false positives). Those familiar with sensitivity, specificity, positive and negative predictive values know where I am heading with this. Once we start to assign meaningful consequences to the results of the detection system, we need to look at its performance.
But first, it appears there are five types of games submitted to GOTM:
1. Games with cheating, by players who care about score and try to be covert
2. Games with cheating, by players who don't care about score, and are perfectly comfortable coming here and telling us that they cheat
3. Games with replaying due to some inadvertant circumstance (out of players control (crash) or some oops! (like a forgot to save)), where no attempt to change outcomes has been made.
4. Games with no replaying, but there is some feature that makes the staff suspicious of replaying (perhaps frequent save and RESUME events?)
5. Games with no replaying that look clean.
So step one is to decide which of these types of games we want to include in the valid submission category.
Next, we need to decide how well our detection or diagnostic system performs. In particular, it must be able to separate a resume event (nothing replayed) from a replay event.
Suppose that our detection system can correctly identify 90% of cheating games as cheating games. That would be 90% sensitivity. 10% of cheating games go undetected, for a 10% false negative rate.
Suppose our system correctly identifies 90% of valid games as valid. That is 90% specificity. 10% of valid games get identified as cheating, the false positive rate. Not great if we are banning on strike 2.
It gets worse ... now suppose out of 110 submissions, 10 are cheaters and 100 are valid in truth. What will our 90% sensitive and 90% specific system tell us?
9 cheaters are labelled cheaters, 1 slips by. 90 valids are labeled valids, and 10 valids are labeled cheaters. And because there are 10 times more valids than cheaters in reality, of the 19 labelled cheaters, only half really are cheaters!!
So we really need a system with a false positive rate of 1% or less (99% or more specificity) to avoid penalizing the innocent. Do we have that?
It would be in everyone's best interst to reduce false positives. If there are behaviors that raise red flags that can be avoided (such as manual save within a turn, perhaps?), then players ought to know what they are. If more than two or three replays due to a crash won't be accepted, that should be known, as on the fourth crash one then knows not to submit (risk of a strike). If crash incidents need to be reported, that needs to be explicit (seems vague at present).
Yes, this is all a pain in the

But if the mods choose to go to intensive enforcement with harsh consequences, I think they bring this duty on themselves. If we figure this out in detail once, and post it for all to see, it might avoid rehashing it over and over again, which seems to be the usual course of events now (and thus, more efficient in the long run?).
dV