Is the forum dying or in a renaissance: a statistical survey

While I might agree with some of your conclusions the presentation raises doubts about the validity of the train of thoughtleading me to that agreement. In less time than it took to read the text, 3 fundamental problems with the graph were apparent. The cumulative effect of those problems is to disguise or distort what you want to show. This weakens the argument. Even worse there are statements in the text that contradict the data as it is is visually represented.

Please post the actual data used to make the graph. Posting the numbers would strengthen your argument a lot if it can counteract the shortcomings of the presentation.

It would be unfortunate to lose support for a worthy cause by polemical defense of easily corrected shortcomings of presentation. I truly hope that we can avoid the difficulty some people had last year in distinguishing between critiques of presentation, disagreements of opinion and personal attacks.

What are the specific issues you have? I have an idea what the 3 fundamental problems are from your second post, but would've had nothing but rather poor guesses to go off of from only this post... which isn't very helpful at improving it in the future.

Knowing where in particular you believe the statements contradict the data would also be helpful. My conclusions are only based on what I'm aware of, and the data isn't perfect. I'd hoped that some other forumgoers could fill in some of the gaps, and indeed Cyc has helped do that for the Democracy Games in particular.

The original actual data I used was attached to the first post, and I've now updated it to the latest I have. The differences are two more recent data points, and some additional calculations to make the info in the latest post already available. Only one person downloaded the initial version, however, so the demand for it seemed mediocre at best.

Well, I agree with what appears to be the stats for OT. As Blue Monkey states, they may not be presented properly, but in essence, they do show the mirror effect to the other activity of the forums. It is kind of funny to me, how the OT went static there at the release of Civ5, where players were tryin to figure out the game. And then when Civ5 failed, so did OT. And then, as you say, with the elections at hand, voicing one's opinion of them became predominent, especially with no new Civ game to follow.

I still think it's a shame they took NES out of Civ3, but money-wise, it was probably a good move to emphasize the rise of Civ4. Politics....

I'm glad my conclusions aren't totally bonkers!

I'm curious about the NES's - I don't know a whole lot about them. To me, most of them didn't seem particularly related to Civ3 - some more than others, and perhaps most were inspired by Civ3 (and some directly based on a game), but they seemed distinct from most of the stories and tales, which usually had a direct connection to a Civ3 game. However, I never got into the NES's as I did the stories and tales, so perhaps the grouping was more logical than I realized.

It's impossible to draw any valid conclusions from the graph as displayed. Nothing can be determined about relationships of subforums within CFC, let alone correlate that with external events.

Statistics are numbers, not colored lines. So far we don't have the numbers posted. There is no evidence whatsoever of a causal relationship of any kind. Since the data points displayed do not align with the intervals on the x-axis, without the numbers there is not even firm evidence of a correlation.

People have every right to their opinions. Mistaking opinion for fact is quite another thing. This doesn't imply any ill intent - no suggestion of manipulation for propagandistic goals. OTOH drawing conclusions from specious evidence can only lead to misunderstanding.

I can make a graph comparing the various subfora, but would rather not do so if it's going to just result in a large amont of negative energy. The data for the various subfora is all in the .xlsx, as it has been.

The x-axis is partially a result of not being an Excel expert, and could be improved by someone with more expertise in Excel graphs. But as the data is at irregular intervals, I'm not sure how practical it would be to make a label on the x-axis for every point with data. Interpolation, as well as using the .xlsx when more precision is needed, was used in trying to figure out what the likely causes of changes in behavior were.

The evidence is rather limited to begin with, so it may never be possible to make it sufficiently non-specious. However, I think it's still interesting to draw conclusions from what we do have. The data points have been available since the initial post for anyone looking for facts; the conclusions I've drawn are of course opinions. I've posted them in part for those that aren't aware of what happened when - someone might not be aware that Beyond the Sword came out when the Civ4 spike occured, and that's likely useful information to know. It might not actually be the cause, but I find it interesting to learn what others think may have caused spikes in traffic, and assumed others might be as well. Someone else might have a competing theory, which would also be interesting to know.

I'd love to know how to mark particular events in time on the graph that may not be in the data being graphed (such as a way to indicate when Conquests was released, or the presidential election started), but am not savvy enough in Excel to know how to do that.

As a non-Civ-related analogy, there was a large spike in searches for waffles in 2004, according to Google Trends. My suspicion is that was due to John Kerry's presidential campaign (and perceptions thereof), but I can't prove that as a fact. Perhaps it was actually due to a bunch of waffle shows on the Food Network. Similarly, the evidence isn't fantastic, but I think it's sufficient to suggest, "I think John Kerry's candidacy had something to do with this" if asked than replying "I don't know."

I should perhaps re-iterate that I am not a statistician, and the goal is to capture the essence, as Cyc alluded to, not make a research paper out of it (any progress on that, Delta Strife?). You're welcome to contribute your expertise, including new content and improved presentations.

----------------------------

On a possibly unrelated note, I'm still uncertain whether posting this in a more general forum would be good (more perspectives, might be interesting to more people), or bad (stirring up too much "My Civ version is better than your Civ version" sentiment). And if the good would outweigh the bad, which forum that would be. OT is the most Civ-neutral, but might also result in a lot of potentially interested parties missing out, and this is more about that Civ forums than OT.
 
Comments on constructing line charts

I didn't include the specifics simply because I wasn't sure this was the appropriate place to get into what amounts to a lesson on visual display of quantitative information. There's a short book called How To Lie With Statistics that is actually a good and relatively quick way to learn how to accurately display data graphically. It's written for the lay person. It's been around long enough (1954) that many public libraries have it.

The three fundamental problems:
  1. Proportions: The vertical axis is elongated compared to the horizontal. If it was drawn on graph paper the intervals would appear to be rectangles rather than squares. This has the effect of exaggerating slopes - peaks and drops appear more extreme.
  2. Intervals: The majority of data points don't line up with the labels on the indices. For example look at all the data points that fall between 400,000 & 500,000 (vertical axis). This makes it difficult to compare points from one line segment to another. The labelled points on the horizontal axis are at irregular intervals - not the same time elapsed between each. This distortion compounds the effect of the x/y misproportion. For valid analysis the intervals need to be uniform. The data itself appears to have been collected at irregular intervals - which is okay - but the intervals on the chart need to be evenly spaced.
  3. Data Set: the C5 line has only six data points - understandable since it did not exist over the course of the whole data set. However, more effective analysis limits analysis to data sets where each line has approximately the same number of points. Compare C5 to either (a) the six data points on the other lines which are in the same calendar time frame, or (b) the six data points of other lines covering the same part of a game development cycle - prerelease, release, etc. In the case of (a) the chart would consist of only the lines from C5's first data point on to the present. In the case of (b) the C5 & C4 lines would be shifted so that their first six data points. To do the kind of trend analysis attempted here more than six data points on the C5 line would be needed.

These sorts of pitfalls are common to all situations involving line charts. It's my understanding that the data available to work with has some inherent problems such as being collected at irregular intervals. That can be rectified by careful construction of the chart itself.

Then there are the problems of analysis. These are to be found in the text rather than the chart.
  • Correlation = in this case things happening at the same time. That's all it means. It's correct to note that one line dipped and another rose over the same period of time. There is no supposition of any cause/effect relationship.
  • Causality = one thing causes another. This requires a lot more evidence than correlation. It's a common mistake to assume a causal relationship when there is only a correlation demonstrated. This often framed as "the sun came up because the rooster crowed". A rise in OT posts at the same time as the presidential elections is only a correlation. Causality could only be demonstrated by something like a content analysis - how many posts (proportionally) were about the election. It's still only a correlation, but a more reasonable supposition of causality would be something specific to CFC - such as release of a new version of the game. True causality is demonstrated by, for example, the dip in C3 when NES was moved.

This may seem very nit-picky. But in the absence of hard data any conclusions are mere opinion. Anything not found within the data set should not be the basis of interpretations of causality. Also it's important to be careful not to ignore data that contradicts the intended analysis. Looking at the graph it appears that C3 rises and falls with approximately the same rhythm as C4 & C5. The characterization of the C3 forums' as in slow decline relative to other subforums also does not take account of the similar trends for C4 & OT.

I'm not suggesting that anyone needs to do an exhaustively precise data analysis. This would involve a lot of data collection. Such as tracking shift of established members' posts to new subforums when a game is released. And correcting for population as suggested above. not to mention things like content analysis to help determine causality. I'm only suggesting that some basic precautions be taken such as careful chart construction and being more reserved in drawing conclusions from the data. Placing analysis and interpretation/conclusions in separate paragraphs helps on that last point.
 
Charts can be aids to analysis or discussion. But the numbers are the facts on which to base that analysis.

I've looked at the file attached to the OP. I didn't see a chart like the one in post 37. I'm not clear about which data from the spreadsheet were used to construct that chart.

Is it the columns titled "CivX Total" - X being the game version? If that's the case the numbers belie the chart. For example, according to the spreadsheet C3 has seen a slow but steady increase since 4/5/08 (1,547,673 posts) and as of 2/24/13 is higher (1,841,954 posts) than the previous peak on 10/11/07 (1,818,084 posts).

Or is it the numbers on the "Trends" spreadsheet - which shows some negative numbers? Unexplained negative numbers on some but not all of the forums looks suspicious.
 
Charts can be aids to analysis or discussion. But the numbers are the facts on which to base that analysis.

I've looked at the file attached to the OP. I didn't see a chart like the one in post 37. I'm not clear about which data from the spreadsheet were used to construct that chart.

Is it the columns titled "CivX Total" - X being the game version? If that's the case the numbers belie the chart. For example, according to the spreadsheet C3 has seen a slow but steady increase since 4/5/08 (1,547,673 posts) and as of 2/24/13 is higher (1,841,954 posts) than the previous peak on 10/11/07 (1,818,084 posts).

Or is it the numbers on the "Trends" spreadsheet - which shows some negative numbers? Unexplained negative numbers on some but not all of the forums looks suspicious.

Starting with this post since it's shorter...

The latest chart is constructed from the "Trends" spreadsheet, which in turn is based on the "CFC" spreadsheet, and the chart is on the "New Post Rate" worksheet, which was supposed to live after "Trends" but currently resides before.

The CivX Total chart uses the CFC spreadsheet, which shows posts as of a certain date, which is cumulative over all of CFC's existence, and thus always increasing (except the NES move, deleted posts, and posts moved out of public view).

The Trends spreadsheet shows differences between intervals of data collection, weighed to posts per year so that varying interval sizes have less of an impact. The percentages on that chart aren't meaningful (they were copy-pasted), and aside from that, I believe the only negative data point is Civ3 on 04/05/08, due to the NES move, which I haven't attempted to correct for.

Comments on constructing line charts

I didn't include the specifics simply because I wasn't sure this was the appropriate place to get into what amounts to a lesson on visual display of quantitative information. There's a short book called How To Lie With Statistics that is actually a good and relatively quick way to learn how to accurately display data graphically. It's written for the lay person. It's been around long enough (1954) that many public libraries have it.

The three fundamental problems:
  1. Proportions: The vertical axis is elongated compared to the horizontal. If it was drawn on graph paper the intervals would appear to be rectangles rather than squares. This has the effect of exaggerating slopes - peaks and drops appear more extreme.
  2. Intervals: The majority of data points don't line up with the labels on the indices. For example look at all the data points that fall between 400,000 & 500,000 (vertical axis). This makes it difficult to compare points from one line segment to another. The labelled points on the horizontal axis are at irregular intervals - not the same time elapsed between each. This distortion compounds the effect of the x/y misproportion. For valid analysis the intervals need to be uniform. The data itself appears to have been collected at irregular intervals - which is okay - but the intervals on the chart need to be evenly spaced.
  3. Data Set: the C5 line has only six data points - understandable since it did not exist over the course of the whole data set. However, more effective analysis limits analysis to data sets where each line has approximately the same number of points. Compare C5 to either (a) the six data points on the other lines which are in the same calendar time frame, or (b) the six data points of other lines covering the same part of a game development cycle - prerelease, release, etc. In the case of (a) the chart would consist of only the lines from C5's first data point on to the present. In the case of (b) the C5 & C4 lines would be shifted so that their first six data points. To do the kind of trend analysis attempted here more than six data points on the C5 line would be needed.

These sorts of pitfalls are common to all situations involving line charts. It's my understanding that the data available to work with has some inherent problems such as being collected at irregular intervals. That can be rectified by careful construction of the chart itself.

Then there are the problems of analysis. These are to be found in the text rather than the chart.
  • Correlation = in this case things happening at the same time. That's all it means. It's correct to note that one line dipped and another rose over the same period of time. There is no supposition of any cause/effect relationship.
  • Causality = one thing causes another. This requires a lot more evidence than correlation. It's a common mistake to assume a causal relationship when there is only a correlation demonstrated. This often framed as "the sun came up because the rooster crowed". A rise in OT posts at the same time as the presidential elections is only a correlation. Causality could only be demonstrated by something like a content analysis - how many posts (proportionally) were about the election. It's still only a correlation, but a more reasonable supposition of causality would be something specific to CFC - such as release of a new version of the game. True causality is demonstrated by, for example, the dip in C3 when NES was moved.

This may seem very nit-picky. But in the absence of hard data any conclusions are mere opinion. Anything not found within the data set should not be the basis of interpretations of causality. Also it's important to be careful not to ignore data that contradicts the intended analysis. Looking at the graph it appears that C3 rises and falls with approximately the same rhythm as C4 & C5. The characterization of the C3 forums' as in slow decline relative to other subforums also does not take account of the similar trends for C4 & OT.

I'm not suggesting that anyone needs to do an exhaustively precise data analysis. This would involve a lot of data collection. Such as tracking shift of established members' posts to new subforums when a game is released. And correcting for population as suggested above. not to mention things like content analysis to help determine causality. I'm only suggesting that some basic precautions be taken such as careful chart construction and being more reserved in drawing conclusions from the data. Placing analysis and interpretation/conclusions in separate paragraphs helps on that last point.

I've added that book to my Amazon wish list so I don't forget it - might pick it up from a local library, or the next time I make an order that qualifies for free super saver shipping. Looks like it could be a useful book, and at an affordable price and length.

I hadn't thought about proportions, although you're right that proportions certainly can tell a different tale. I'm always wary when charts don't start at zero on the vertical axis (if the data is always positive); perhaps I'm less sensitive otherwise. I'd just usen what Excel gave me by default and didn't attempt to fine-tune it. As it turns out, there are about 20 more pixels between vertical tick marks than horizontal ones. I don't know off the top of my head how to correct this in Excel. Although even if I could, couldn't it still be manipulated by making the x-intervals cover longer periods of time, and thus making the slopes steeper? Even with that caveat, I see how it would appear odd and suspicious on graph paper to have non-square rectangles; perhaps I'm less scutinizing of virtual graph paper.

The intervals aren't ideal. You're right that there are a lot of points between the few vertical labels; I've probably erred on the side of too few labels given the relatively large size of the chart. I've now figured out how to display five times as many gridlines (slightly more gray ones every 20,000 between the 100,000s), which does help in comparing points. The horizontal ones I'm aware are not ideal. Although technically they're all about one year, four months, and 14 days apart (sometimes a day or two off, since some months are short), the presentation is not very good. I'm not sure why Excel isn't giving me date-friendly options for that axis; I'd much prefer to have, for example, every-12-months major intervals with 3-month tick marks (depending on what worked well with the timespan and vertical tick marks).

The data set remark makes sense. Although I think it might actually be more complex, since the first six points from Civ3/Civ4, beginning with a similar pre-release time, would likely cover a different amount of time, and there's also the factor that Civ5's expansion pack came out at a different interval than Civ3/Civ4, which followed an similar approximately-yearly cadence after the vanilla versions. Despite that, I like the idea of moving the data over to correspond with release dates.

It appears I was a bit casual with correlations between the data and Civ expansions, implying they had an impact in many cases and essentially saying Beyond the Sword did. Which I suppose is suspect statistically. I still suspect there was a causal relationship in many of these cases, but indeed it can't be proven. However, it probably is reasonable to keep trends and what I expect caused them in separate paragraphs in the future.

All-in-all, it does seem rather nitpicky, but I can see now many of the places where I may have unintentionally led astray. I find I prefer the specifics approach; I learned from this post, whereas the two previous ones were rather vague criticism and thus frustrating.
 
I think you've made a valiant effort to cope with a faulty data set. Yuo've certainly been more open-minded & perceptive than some others here at CFC that were involved in a discussion about survey design. The following remarks - as with the previous posts - are not meant to criticize your intent or the work put into it. Take them as part an ongoing process of improving all of our understanding of trends over the life of CFC.
The latest chart is constructed from the "Trends" spreadsheet, which in turn is based on the "CFC" spreadsheet, and the chart is on the "New Post Rate" worksheet, which was supposed to live after "Trends" but currently resides before.
My question was meant to be about which specific column(s) were used to construct the most recent chart. Knowing that I could have made a couple of small charts to illustrate the longer explanatory post. Would have made it easier to understand.

The CivX Total chart uses the CFC spreadsheet, which shows posts as of a certain date, which is cumulative over all of CFC's existence, and thus always increasing (except the NES move, deleted posts, and posts moved out of public view).

The Trends spreadsheet shows differences between intervals of data collection, weighed to posts per year so that varying interval sizes have less of an impact. The percentages on that chart aren't meaningful (they were copy-pasted), and aside from that, I believe the only negative data point is Civ3 on 04/05/08, due to the NES move, which I haven't attempted to correct for.
Cumulative figures add on to but don't remove from a total. Subtracting NES at the time of the move gives a false data point. The correction would really need to be done by going back & subtracting from each data point the NES posts from the equivalent period. Not correcting for it blows any trend analysis involving C3. The whole C3 line prior to the removal would be lower rather than dropping at a single point. Since NES posts varied from interval to interval, as the figures stand there's no way to tell if C3 was getting more or less traffic at any given point prior to the move.



I've added that book to my Amazon wish list so I don't forget it - might pick it up from a local library, or the next time I make an order that qualifies for free super saver shipping. Looks like it could be a useful book, and at an affordable price and length.
One of the things I like about the book is the illustrations of well and poorly designed charts for the same data.

I'm always wary when charts don't start at zero on the vertical axis (if the data is always positive); perhaps I'm less sensitive otherwise. ... As it turns out, there are about 20 more pixels between vertical tick marks than horizontal ones. ... Although even if I could, couldn't it still be manipulated by making the x-intervals cover longer periods of time, and thus making the slopes steeper?
... there are a lot of points between the few vertical labels; I've probably erred on the side of too few labels given the relatively large size of the chart. I've now figured out how to display five times as many gridlines (slightly more gray ones every 20,000 between the 100,000s), which does help in comparing points. ... (depending on what worked well with the timespan and vertical tick marks).
It's OK to start from a non-zero base line. As long as the starting point allows for all data points and the intervals are equal. For example an axis marked off at intervals of 100 could start at 5000 rather than zero if the lowest data point on that axis was 5000. In your chart it would make sense to drop out C1 & C2 - their numbers are an order of magnitude lower than the others. Put them in a separate chart. Scale it differently but comparably - maybe by hundreds rather than thousands of posts. Add notes to indicate important points such as the release of new versions of Civ. That would allow them to be useful for an overall trend analysis rather than just being nearly flat lines at the bottom of the chart.


... I'd much prefer to have, for example, every-12-months major intervals with 3-month tick marks ... Although I think it might actually be more complex, since the first six points from Civ3/Civ4, beginning with a similar pre-release time, would likely cover a different amount of time, and there's also the factor that Civ5's expansion pack came out at a different interval than Civ3/Civ4, which followed an similar approximately-yearly cadence after the vanilla versions. Despite that, I like the idea of moving the data over to correspond with release dates.
Imho it would be okay to add together the collected data to get more even intervals. The important thing is that the intervals on the chart be even. That might mean that one line skips a data point - no data was collected at that time - so that a line is straight where the other lines jog. It's still more accurate than varying the scale on an axis.

It's a bit trickier to minimize distortions but the time intervals don't need to be measured by days in the calendar: ... at release, at expansion announcement, at expansion release, at new version announcement, at new version release, ... So long as the same intervals are used for each line on the chart and the intervals on the chart are equally spaced.

I still suspect there was a causal relationship in many of these cases, but indeed it can't be proven. However, it probably is reasonable to keep trends and what I expect caused them in separate paragraphs in the future.
It's reasonable to speculate on causality within the data space. Especially when a repeating pattern can be shown. Such as dips in posts when a newer version is released. If outside events such as the election are suggestive then it would be important to show a similar pattern during each election cycle. It would still be only a correlation, but at least suggestive. The trick would be excluding other possible factors.


All-in-all, it does seem rather nitpicky, but I can see now many of the places where I may have unintentionally led astray. I find I prefer the specifics approach; I learned from this post, whereas the two previous ones were rather vague criticism and thus frustrating.
I'm happy to be more specific. When I've tried on other occasions I felt castigated for expecting accuracy. Always feel free to ask for further explanation. I'm also open to having my own mistakes pointed out.

The advantage of being careful with data/charts is that it makes your case stronger. You can't be accused of massaging/manipulating the evidence. I'm not a statistician by the way. Statistics was required in grad school. My field is visual semiotics (analysis of form & meaning). Critiques of things like charts presented with ulterior motives in newspapers, etc. was part of my early research.

There may be ways to get around things like the NES complication. If we continue the discussion we may find a way to get a better chart. Maybe I can help, but I'll need some specifics on things like which columns of data were used to make it.
 
Back
Top Bottom