Katy and the Wolf

When Katy was two we were watching a Planet Earth DVD where a white wolf catches a terrified Caribou calf. When the wolf caught the baby reindeer we were concerned about what Katy would think. Turns out she wasn’t worried about the deer. She was sitting there cheering for the wolf!

Advanced Link Manager Review

Content may be king, but without any links you won’t get any traffic to your site. A link manager can help you do the following:

  • Identify websites that are currently linking to you.
  • Keep track of reciprocal links and notify you if a link disappears.
  • Track the link popularity of incoming links based on their Page Rank and Alexa value.

I have been using Advanced Link Manager by Caphyon–the same company that makes Advanced Web Ranking I reviewed previously.

ALM does a surprisingly good job of finding incoming links.  You are probably familiar with using Google and Yahoo to find links by searching for something like:

link:www.example.com/page.html

If you’ve ever used this, you’ll also know that the search engines don’t like to show you everything they know.  Google in particular only shows a small number of links which can make it very difficult to know exactly where you have links coming from. ALM will give you one single consolidated view of the links identified by all of the major search engines.  This gives you a much more accurate view of where your incoming existing links are. In addition, ALM will verify the links by visiting the website and making sure that there is indeed a link to your page, keep track of the follow/nofollow status of the link and note the anchor text.  This information is very useful because it helps you identify not only who is linking to you, but how they are linking and what terms are being used.

ALM is also very well suited for link building.  It lets you keep track of who you have sent link requests to and tracks the results.  It even has a feature to send emails requesting links–something that seems very prone to abuse, but could have legitimate uses if used sparingly. It even includes a POP3 client so you can manage incoming emails and track responses directly in the application.  This type of capability would probably be especially useful with the server version where multiple people may be working on link building activities for the same domain and the same time.  The tracking and email features can help insure that multiple employees are contacting the same person over and over again.

Reciprocal Link Tracking

The reciprocal link tracking lets you track pretty much any type of link exchange.  It goes well beyond a simple “you link to my site and I’ll link back to yours.  You can track agreements like: “I will link my site A to your site B, if you will link your site X to my site Y.”

Link Partners
Link Partners Screen

ALM has a feature in the enterprise version that will help you identify other sites related to your topic (link partners) where you may be able to obtain links or do a link exchange. This feature allows you to sort links by various attributes and then add them to a list of sites you are targeting in an attempt to obtain links. The other tools lets you track your efforts and results toward getting those links and the built in email client means you don’t even have to leave the tool for your entire link building process.

Link Growth Over Time Graphs

One thing that both Advance Link Manager and Advanced Web Ranking get right is the idea of tracking things over time. Without this feature it is very hard to tell whether you are improving or not.  The graphs showing your rank (for AWR) and your link count (for ALM) make it very easy to measure your progress–especially if you only go in and use the tools once every few days or are tracking a large number of sites and pages where it is easy to get confused.

So how do you use a link manager to help increase your traffic?  Obviously the link exchanges mentioned above are one way, but the tool can be very useful even for someone who isn’t interested in exchanging links.  Keep track of who links to you can help you develop relationships with people interested in your site and in you niche by identifying people who have already linked to you. Establishing a relationship–even something as simple as sending a thank you email–can help make sure your link stays active and can encourage more links in the future.

Who Writes Wikipedia Algorithm

Based on the number of edits, Wikipedia appears to be written by a small number of people.  Aaron Swartz did some testing and came to the conclusion most content in the final revision comes from people who don’t even have a login.

I have been looking for a way to determine the percentage each author contributed to a particular page in Wikipedia or MediaWiki, something I assumed would be trivial, but it turns out is much more complicate. For performance reasons, MediaWiki (the software that powers Wikipedia) saves the entire text of a new revision–not just what has changed so every edit results in a completely new copy of the page being saved.  This means to see who contributed what, requires going back through each edit and comparing it to see what was included in the final version.

I started looking at Aaron’s method to see if it might be useful.  What he did is briefly described here. To the best of my understanding this is the basic process he used:

Find the longest matching string between the first revision and the final revision from the Wikipedia dumpfile.  Mark that string in the final revision as having come from the author of the first revision.  Continue this with the first revision and next largest matching string until there are no more matching strings that haven’t been marked.  Move to the second revision and repeat the process.  When you finish, you should have a version where every character is marked based on where it came from.

So lets look at an example:

Bob Revision 1: The fox jumped the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.

So our final version is:

The quick brown fox jumped over the lazy hound.

Now lets find the longest matching string between the final version and Bob’s initial edit.  We’ll mark Bob’s contributions in Red:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Ok so that is the longest string, now lets find the second longest:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Repeat:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

And once more:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Going through the same process marking Joe’s revision in blue produces:

The quick brown fox jumped over the lazy hound.

So we can easily see that Joe contributed 18 non space characters and Bob contributed 21 non space characters that made it into the final revision.  This is just fine if you are simply adding information.  It gets a bit more tricky when you are removing words because parts of words that have been removed will match words that have been added later.

Consider this scenario:

Bob Revision 1: The icky quaint rowdy fox jumped over the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.

Now if we do the same process we get:

The quick brown fox jumped over the lazy hound.

Even though the words “icky”, “quaint” and “rowdy” added by Bob have been removed in the final version, they are still matching parts of words.  ICK of “icky” is matching the last part of quICK.  The QU of “quaint” is matching the first part of QUick, etc.

This approach is strongly biased toward the person who started the article or contribute early on–particularly if they added a lot of text that was later removed.  What would happen if the person who originally edited the article also pasted in a copy of the alphabet several hundred times at the bottom of their text?  It would match everything added later regardless of who added it. Now people probably aren’t doing that, but if they add a bunch of text that eventually gets removed, their removed text will still match a lot of text in subsequent revisions.

Still, this approach isn’t unreasonable and probably gives fair results as long as someone isn’t specifically trying to game the system.  Aaron’s method of recursively looking for the longest string is important if you need to see who did what.  If you just need to know how many characters each person contributed (and you are fine with the level of accuracy discussed above), there is a much more efficient approach.

More efficient method

The trick is to realize that this method is going to attribute any character in the final revision to the earliest revision to introduce that character.  So if revision 1 has three a’s, three a’s of the final revision will be credited to the author of the first revision–regardless of where they occur. Because of this, there is no advantage of recursively matching the longest string unless you are trying to produce an annotated version showing who wrote what block of text. Even then  you run into the problem shown above where subsequent edits don’t get full credit.

In other words, we can get the same results simply by counting the number of times each letter appears in each revision starting with the earliest revision.  That revision then receives credit for the occurrence of those letters in the final version as long as those letters haven’t been already credited to another revision.

Here is a simple example:

Revision 1: AB BA BAB
Revision 2: AB BAB BAB BABA

Revision 1: A = 3 B = 4
Revision 2: A = 5 B = 7

So the first version gets credit for:
A = 3 B = 4 total: 7 characters or 7/12th of the final version

Revision 2 gets credit for:
A = 2 B = 3 total: 5 characters or 5/12ths of the final version

More accurate methods

As shown, this approach gives an extraordinary amount of weight to early contributors–particularly if they are verbose–regardless of how much of their actual content makes its way into the final version. Its greatest strength is that it handles cases where text is moved from one position to another.  It gets this strength at the expense of giving “false credit” to people who contribute in earlier revisions.

Levenshtein Distance

Another possibility would be to use the Levenshtein Distance.  This basically counts the number of changes necessary to convert one string into another. I did some testing using Levenshtein Distance doing the following:

Starting at the oldest revision, calculate the Levenshtein Distance from the final revision.  This number represents how many characters from the revision appear in the final.  Moving to the next oldest revision, calculate the Levenshtein distance, but discount any credit already given to previous revisions.

This handles inserts, where words are inserted between existing words, but it doesn’t handle situations where words, sentences or paragraphs have their positions substituted.  If you have ABC, the revision that changes it to CBA gets credit for the contents of C and A even though all it did was move text around.

History Flow and the sentenced based method

To get a more accurate picture you have to use a slightly longer unit than characters.  The two simplest ways would be to use words or sentences instead of individual characters.   IBM has done some analysis using a tool called History Flow that uses the sentence as the fundamental unit. As they point out, a revision that adds a comma will get credit for the entire sentence that contains the comma.

Line based approach

Jeff Atwood uses the “line” as his fundamental unit. This is a very reasonable approach if you are working with code or something where you are likely to have a lot of new lines.  However, for long paragraphs it is a bit problematic.  Either they get treated as a single line and you have the same issue with the comma as IBM’s method applied to an entire paragraph or you break the paragraph into lines at specific points and adding in content can reorder the entire paragraph making it all appear new.

Word based approach

A good balance might be to use individual words as the fundamental unit being compared.  This drastically reduces the “false credit” problem associated with character based matching while minimizing the “comma problem”. There is still going to be a bit of “false credit” especially for common words. If someone writes 500 words to start an article, all of their original text is deleted and new text added, they are going to get credit for a number of words like “the”, “and”, etc.  If what they wrote was one topic, they will get credit for even more because they are likely to have used a lot of keywords that will be in the final revision.  Simply pasting in the dictionary a few times would give them significant credit for text that will not appear in the final revision. Still it represents a very reasonable approach, particularly if people aren’t trying to game the system.

Spam and methods

In this type of analysis of a Wikipedia or a different MediaWiki, one crucial thing to consider is spam. Character based and word based analysis is going to be heavily skewed by spam entries–even if they are immediately reverted.  Sentence based approaches are probably going to be more accurate if revisions contain spam while word based methods are likely to be more accurate in closed systems where spam isn’t an issue.