Based on the number of edits, Wikipedia appears to be written by a small number of people. Aaron Swartz did some testing and came to the conclusion most content in the final revision comes from people who don’t even have a login.
I have been looking for a way to determine the percentage each author contributed to a particular page in Wikipedia or MediaWiki, something I assumed would be trivial, but it turns out is much more complicate. For performance reasons, MediaWiki (the software that powers Wikipedia) saves the entire text of a new revision–not just what has changed so every edit results in a completely new copy of the page being saved. This means to see who contributed what, requires going back through each edit and comparing it to see what was included in the final version.
I started looking at Aaron’s method to see if it might be useful. What he did is briefly described here. To the best of my understanding this is the basic process he used:
Find the longest matching string between the first revision and the final revision from the Wikipedia dumpfile. Mark that string in the final revision as having come from the author of the first revision. Continue this with the first revision and next largest matching string until there are no more matching strings that haven’t been marked. Move to the second revision and repeat the process. When you finish, you should have a version where every character is marked based on where it came from.
So lets look at an example:
Bob Revision 1: The fox jumped the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.
So our final version is:
The quick brown fox jumped over the lazy hound.
Now lets find the longest matching string between the final version and Bob’s initial edit. We’ll mark Bob’s contributions in Red:
The fox jumped the hound.
The quick brown fox jumped over the lazy hound.
Ok so that is the longest string, now lets find the second longest:
The fox jumped the hound.
The quick brown fox jumped over the lazy hound.
Repeat:
The fox jumped the hound.
The quick brown fox jumped over the lazy hound.
And once more:
The fox jumped the hound.
The quick brown fox jumped over the lazy hound.
Going through the same process marking Joe’s revision in blue produces:
The quick brown fox jumped over the lazy hound.
So we can easily see that Joe contributed 18 non space characters and Bob contributed 21 non space characters that made it into the final revision. This is just fine if you are simply adding information. It gets a bit more tricky when you are removing words because parts of words that have been removed will match words that have been added later.
Consider this scenario:
Bob Revision 1: The icky quaint rowdy fox jumped over the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.
Now if we do the same process we get:
The quick brown fox jumped over the lazy hound.
Even though the words “icky”, “quaint” and “rowdy” added by Bob have been removed in the final version, they are still matching parts of words. ICK of “icky” is matching the last part of quICK. The QU of “quaint” is matching the first part of QUick, etc.
This approach is strongly biased toward the person who started the article or contribute early on–particularly if they added a lot of text that was later removed. What would happen if the person who originally edited the article also pasted in a copy of the alphabet several hundred times at the bottom of their text? It would match everything added later regardless of who added it. Now people probably aren’t doing that, but if they add a bunch of text that eventually gets removed, their removed text will still match a lot of text in subsequent revisions.
Still, this approach isn’t unreasonable and probably gives fair results as long as someone isn’t specifically trying to game the system. Aaron’s method of recursively looking for the longest string is important if you need to see who did what. If you just need to know how many characters each person contributed (and you are fine with the level of accuracy discussed above), there is a much more efficient approach.
More efficient method
The trick is to realize that this method is going to attribute any character in the final revision to the earliest revision to introduce that character. So if revision 1 has three a’s, three a’s of the final revision will be credited to the author of the first revision–regardless of where they occur. Because of this, there is no advantage of recursively matching the longest string unless you are trying to produce an annotated version showing who wrote what block of text. Even then you run into the problem shown above where subsequent edits don’t get full credit.
In other words, we can get the same results simply by counting the number of times each letter appears in each revision starting with the earliest revision. That revision then receives credit for the occurrence of those letters in the final version as long as those letters haven’t been already credited to another revision.
Here is a simple example:
Revision 1: AB BA BAB
Revision 2: AB BAB BAB BABA
Revision 1: A = 3 B = 4
Revision 2: A = 5 B = 7
So the first version gets credit for:
A = 3 B = 4 total: 7 characters or 7/12th of the final version
Revision 2 gets credit for:
A = 2 B = 3 total: 5 characters or 5/12ths of the final version
More accurate methods
As shown, this approach gives an extraordinary amount of weight to early contributors–particularly if they are verbose–regardless of how much of their actual content makes its way into the final version. Its greatest strength is that it handles cases where text is moved from one position to another. It gets this strength at the expense of giving “false credit” to people who contribute in earlier revisions.
Levenshtein Distance
Another possibility would be to use the Levenshtein Distance. This basically counts the number of changes necessary to convert one string into another. I did some testing using Levenshtein Distance doing the following:
Starting at the oldest revision, calculate the Levenshtein Distance from the final revision. This number represents how many characters from the revision appear in the final. Moving to the next oldest revision, calculate the Levenshtein distance, but discount any credit already given to previous revisions.
This handles inserts, where words are inserted between existing words, but it doesn’t handle situations where words, sentences or paragraphs have their positions substituted. If you have ABC, the revision that changes it to CBA gets credit for the contents of C and A even though all it did was move text around.
History Flow and the sentenced based method
To get a more accurate picture you have to use a slightly longer unit than characters. The two simplest ways would be to use words or sentences instead of individual characters. IBM has done some analysis using a tool called History Flow that uses the sentence as the fundamental unit. As they point out, a revision that adds a comma will get credit for the entire sentence that contains the comma.
Line based approach
Jeff Atwood uses the “line” as his fundamental unit. This is a very reasonable approach if you are working with code or something where you are likely to have a lot of new lines. However, for long paragraphs it is a bit problematic. Either they get treated as a single line and you have the same issue with the comma as IBM’s method applied to an entire paragraph or you break the paragraph into lines at specific points and adding in content can reorder the entire paragraph making it all appear new.
Word based approach
A good balance might be to use individual words as the fundamental unit being compared. This drastically reduces the “false credit” problem associated with character based matching while minimizing the “comma problem”. There is still going to be a bit of “false credit” especially for common words. If someone writes 500 words to start an article, all of their original text is deleted and new text added, they are going to get credit for a number of words like “the”, “and”, etc. If what they wrote was one topic, they will get credit for even more because they are likely to have used a lot of keywords that will be in the final revision. Simply pasting in the dictionary a few times would give them significant credit for text that will not appear in the final revision. Still it represents a very reasonable approach, particularly if people aren’t trying to game the system.
Spam and methods
In this type of analysis of a Wikipedia or a different MediaWiki, one crucial thing to consider is spam. Character based and word based analysis is going to be heavily skewed by spam entries–even if they are immediately reverted. Sentence based approaches are probably going to be more accurate if revisions contain spam while word based methods are likely to be more accurate in closed systems where spam isn’t an issue.