I don't think some of you guys realize how erroneous such an approach is. This isn't how proper statistics is done, especially regression analysis. You do not start with a belief (I.E. Filthy is a skilled player, thus it's best to use his games only), then collect data based on it. Even if it were true, there could still be big difference in gameplay between skilled players, yet we would have no way of knowing this. A large dataset of different sources could even show that the correlation of starting luxuries to wins is statistically insignificant for the vast majority of players, we would also have no way of knowing.
I recognize the point you are trying to make (and agree with you to an extent!), but to imply that trying to learn something from a particular sample dataset or case study is erroneous or wrong is a bit ridiculous. As I've already stated, I did not start with the belief "Filthy is a skilled player, so I'll use his data", I took what is honestly the only viable dataset available. No other player has uploaded all their wins and losses from a set of games that all play the same conditions (I even wrote to him to check that the games were complete!). We have the choice of this data... or nothing!
I want to address your point about selection bias in detail, and I'll use the mountain result as an example.
Based on the analysis I did, the correct conclusion to draw is:
- Starting next to a mountain increased the chance of FilthyRobot winning games
If I wanted to say:
- Starting next to a mountain increases the chance of most players winning games
That would
technically be a hypothesis, as we only have data from one player.
Selection bias might mean we are wrong. Let's imagine that FilthyRobot is the only player clever enough to build an observatory when next to a mountain (a crude and ridiculous example but hopefully you get the point). We would therefore be wrong to assume that the data applies to all players!
So in that sense you are correct, and up to here I agree with you, but the rational thing to do is to place the results we see in the context of what we know about Civilization V. We all know about the big bonus from observatories, so it is not ridiculous or wrong to propose a broader conclusion that applies to all players. Now, if my article were a formal scientific article, I could state these precise conclusions in the results section and then explore what they might mean in the discussion. But my article isn't (and nor should it be), so I chose to try and integrate my results within the bigger picture to make it an enjoyable read that people could relate to. After all, trying to understand what results from a particular study might mean
is part of the scientific process!
As an example from Biology, John Gurdon showed that if you take the nucleus from the skin of a tadpole and put it in a frog's egg cell, the egg is viable and able to turn into an adult frog. He could have stopped his paper there and said "We conclude this works in frogs, we cannot say anything about other animals or cell types". However, he went on to propose that this underpins all complex life, and also a much broader conclusion that DNA contains sufficient information to make an organism (this was a big deal at the time).
Although that might seem off topic, I'm trying to illustrate that case studies of particular examples can still tell us something, and that a valuable part of the process is placing findings in the bigger picture. With the things my article identified, none of them are exactly surprising, right? If the analysis had said settling next to a mountain makes you more likely to lose, then you might have a point in questioning the validity of the approach, but does anything in the results really suggest to you that the conclusions don't apply to most players?
Just to make sure we're on the same page, I'm also very clearly not proposing "the player that starts next to a mountain is more likely to win", that would be ridiculous. I am saying that all other things being equal, I expect the same player to be more likely to win starting next to a mountain than not.
Once again, we have the choice of this data... or nothing, and I'd choose trying to learn something (whatever the caveats) over doing no analysis every time
