Just want to point out here that it's a somewhat shallow interpretation to say that Aztecs are bad due to their lower winrate. Warmonger and aggressive AIs as a whole do worse, likely from antagonizing their neighbors and getting ganged up on: offense leader flavors have moderate negative correlations with winrate, and seeing a graph of combined offense flavors against winrates shows a pretty clear pattern here - warmongers exist to make trouble and add challenge for a human player, but this AI playstyle does not do a lot of winning. This is something you mention as well, which is accurate and I think is a fairly desirable for a 4x game
Overall I agree with this sentiment regarding civ balance. It's useful for identifying exceptional cases and overall trends but certainly provides a very narrow view into civ balance from a human perspective. For this reason, I think it's more useful in combination with human feedback and other metrics than as a standalone.
However, I would argue that there are many other facets of the AI data that are more or less directly applicable to the human player experience and either have been or are currently being used to improve the mod from a more quantitative point of view:
- Victory Conditions: If a certain victory type is too easy or much faster than the others, this is a negative for the player - for example, when culture victories made up about 70% of all games, this obviously limits player choice in difficult spots (since this might be the only attainable option) as well as making games containing CV focused civs much more difficult than those without
- Technology Research Times: When some eras are much longer or shorter than others, this is also a negative for the player - not having the opportunity to use cool units because their upgrade comes in 5 turns was a common problem when atomic+ techs were a third of the cost they are now. Sure, the AI average won't exactly match the science output of a human player but it gives a much better and less arbitrary baseline for science output at various points of the game
- AI Handicap Bonuses: While obviously not directly applicable to humans, AI bonuses are very visible to a human player, and manifests in different ways, such as extremely quick city growth, being exceptionally ahead or behind in techs or policies, or just certain civs consistently having way more yields to work with (particularly those that focus on great people). Being able to analyze handicap and instant yield sources by their sources and amounts is invaluable for balancing bonuses across triggers and smoothening out their power over the course of a game and between civs, making for a more consistent and fair experience for a human player
That being said, I think there's also some data from AI games that have little value for balancing for a human player, policy choices being a great example. Policy winrates have more to do with what civs usually pick them (ex. Tradition has high diplomatic victory percentages, and the top diplo civs: Siam, Austria, Netherlands go Tradition every game. Coincidence?) or the policies contained therein being an engine for churning out more AI handicap yields, which are both completely unapplicable to balancing policies for a player.