Well, I'm sorry, but I don't really know what the parameters are for testing with 10,000 Units vs 10,000 Units. Were they all individually produced? Were the successful Units then used a second time? Did someone reload a save 10,000 times or uniquely produce 10,000 individual Units. etc etc
AFAIK, the various CFCers who tested this, set up combat-testing scenarios using the Editor, ran the battles, then varied the unit parameters one by one, rerunning the same tests each time, to see how (if at all) those parameter-changes altered the battle outcomes.
Using the Editor, you can easily set up a test where each side starts with, say, 10000 preplaced Warriors (with zero unit-maintenance cost so there are no funding-difficulties), and then pit them against each other on various different terrain types and/or city-sizes (1000 battles per parameter-test should be sufficient). This would only take one 'turn' -- and you could use the results to count not just outright victories, but also HP remaining for the victorious units, promotions awarded, etc. Then for the next test you can give each side e.g. 10000 Archers and do the same thing again (or e.g. pit Archers vs. Spears). If you comb through the Civ3 utilities section of CFC, (I believe that) there is actually such a combat-testing scenario available to download -- I've read about it somewhere, but I couldn't find it quickly myself.
Given that the Civ3 program is a fixed (non-evolving) system, and given that the testers knew what (most of) the inputs were (A/D stats, hitpoints, terrain defence bonuses, etc.), and what outputs they'd measured, deriving a formula(e) which would produce the observed outputs from those inputs, would be relatively straightforward for anyone with a sufficient knowledge of calculus and statistics. Having derived the formula(e), predictions can then be made for as-yet untested parameter values, and appropriate tests set up and run accordingly. If the predicted results are actually observed in the tests (within acceptable error-limits), it is reasonable to assume that the formula(e) were/are correct.
Yes, doing all this would have required lots of time/ patience on the testers' part -- but having done so, I think their results can be considered to be reliable.