Technology – Page 8

Advanced Link Manager Review

Content may be king, but without any links you won’t get any traffic to your site. A link manager can help you do the following:

Identify websites that are currently linking to you.
Keep track of reciprocal links and notify you if a link disappears.
Track the link popularity of incoming links based on their Page Rank and Alexa value.

I have been using Advanced Link Manager by Caphyon–the same company that makes Advanced Web Ranking I reviewed previously.

ALM does a surprisingly good job of finding incoming links. You are probably familiar with using Google and Yahoo to find links by searching for something like:

link:www.example.com/page.html

If you’ve ever used this, you’ll also know that the search engines don’t like to show you everything they know. Google in particular only shows a small number of links which can make it very difficult to know exactly where you have links coming from. ALM will give you one single consolidated view of the links identified by all of the major search engines. This gives you a much more accurate view of where your incoming existing links are. In addition, ALM will verify the links by visiting the website and making sure that there is indeed a link to your page, keep track of the follow/nofollow status of the link and note the anchor text. This information is very useful because it helps you identify not only who is linking to you, but how they are linking and what terms are being used.

ALM is also very well suited for link building. It lets you keep track of who you have sent link requests to and tracks the results. It even has a feature to send emails requesting links–something that seems very prone to abuse, but could have legitimate uses if used sparingly. It even includes a POP3 client so you can manage incoming emails and track responses directly in the application. This type of capability would probably be especially useful with the server version where multiple people may be working on link building activities for the same domain and the same time. The tracking and email features can help insure that multiple employees are contacting the same person over and over again.

Reciprocal Link Setup — Reciprocal Link Tracking

The reciprocal link tracking lets you track pretty much any type of link exchange. It goes well beyond a simple “you link to my site and I’ll link back to yours. You can track agreements like: “I will link my site A to your site B, if you will link your site X to my site Y.”

ALM has a feature in the enterprise version that will help you identify other sites related to your topic (link partners) where you may be able to obtain links or do a link exchange. This feature allows you to sort links by various attributes and then add them to a list of sites you are targeting in an attempt to obtain links. The other tools lets you track your efforts and results toward getting those links and the built in email client means you don’t even have to leave the tool for your entire link building process.

One thing that both Advance Link Manager and Advanced Web Ranking get right is the idea of tracking things over time. Without this feature it is very hard to tell whether you are improving or not. The graphs showing your rank (for AWR) and your link count (for ALM) make it very easy to measure your progress–especially if you only go in and use the tools once every few days or are tracking a large number of sites and pages where it is easy to get confused.

So how do you use a link manager to help increase your traffic? Obviously the link exchanges mentioned above are one way, but the tool can be very useful even for someone who isn’t interested in exchanging links. Keep track of who links to you can help you develop relationships with people interested in your site and in you niche by identifying people who have already linked to you. Establishing a relationship–even something as simple as sending a thank you email–can help make sure your link stays active and can encourage more links in the future.

Who Writes Wikipedia Algorithm

Based on the number of edits, Wikipedia appears to be written by a small number of people. Aaron Swartz did some testing and came to the conclusion most content in the final revision comes from people who don’t even have a login.

I have been looking for a way to determine the percentage each author contributed to a particular page in Wikipedia or MediaWiki, something I assumed would be trivial, but it turns out is much more complicate. For performance reasons, MediaWiki (the software that powers Wikipedia) saves the entire text of a new revision–not just what has changed so every edit results in a completely new copy of the page being saved. This means to see who contributed what, requires going back through each edit and comparing it to see what was included in the final version.

I started looking at Aaron’s method to see if it might be useful. What he did is briefly described here. To the best of my understanding this is the basic process he used:

Find the longest matching string between the first revision and the final revision from the Wikipedia dumpfile. Mark that string in the final revision as having come from the author of the first revision. Continue this with the first revision and next largest matching string until there are no more matching strings that haven’t been marked. Move to the second revision and repeat the process. When you finish, you should have a version where every character is marked based on where it came from.

So lets look at an example:

Bob Revision 1: The fox jumped the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.

So our final version is:

The quick brown fox jumped over the lazy hound.

Now lets find the longest matching string between the final version and Bob’s initial edit. We’ll mark Bob’s contributions in Red:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Ok so that is the longest string, now lets find the second longest:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Repeat:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

And once more:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Going through the same process marking Joe’s revision in blue produces:

The quick brown fox jumped over the lazy hound.

So we can easily see that Joe contributed 18 non space characters and Bob contributed 21 non space characters that made it into the final revision. This is just fine if you are simply adding information. It gets a bit more tricky when you are removing words because parts of words that have been removed will match words that have been added later.

Consider this scenario:

Bob Revision 1: The icky quaint rowdy fox jumped over the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.

Now if we do the same process we get:

The quick brown fox jumped over the lazy hound.

Even though the words “icky”, “quaint” and “rowdy” added by Bob have been removed in the final version, they are still matching parts of words. ICK of “icky” is matching the last part of quICK. The QU of “quaint” is matching the first part of QUick, etc.

This approach is strongly biased toward the person who started the article or contribute early on–particularly if they added a lot of text that was later removed. What would happen if the person who originally edited the article also pasted in a copy of the alphabet several hundred times at the bottom of their text? It would match everything added later regardless of who added it. Now people probably aren’t doing that, but if they add a bunch of text that eventually gets removed, their removed text will still match a lot of text in subsequent revisions.

Still, this approach isn’t unreasonable and probably gives fair results as long as someone isn’t specifically trying to game the system. Aaron’s method of recursively looking for the longest string is important if you need to see who did what. If you just need to know how many characters each person contributed (and you are fine with the level of accuracy discussed above), there is a much more efficient approach.

More efficient method

The trick is to realize that this method is going to attribute any character in the final revision to the earliest revision to introduce that character. So if revision 1 has three a’s, three a’s of the final revision will be credited to the author of the first revision–regardless of where they occur. Because of this, there is no advantage of recursively matching the longest string unless you are trying to produce an annotated version showing who wrote what block of text. Even then you run into the problem shown above where subsequent edits don’t get full credit.

In other words, we can get the same results simply by counting the number of times each letter appears in each revision starting with the earliest revision. That revision then receives credit for the occurrence of those letters in the final version as long as those letters haven’t been already credited to another revision.

Here is a simple example:

Revision 1: AB BA BAB
Revision 2: AB BAB BAB BABA

Revision 1: A = 3 B = 4
Revision 2: A = 5 B = 7

So the first version gets credit for:
A = 3 B = 4 total: 7 characters or 7/12th of the final version

Revision 2 gets credit for:
A = 2 B = 3 total: 5 characters or 5/12ths of the final version

More accurate methods

As shown, this approach gives an extraordinary amount of weight to early contributors–particularly if they are verbose–regardless of how much of their actual content makes its way into the final version. Its greatest strength is that it handles cases where text is moved from one position to another. It gets this strength at the expense of giving “false credit” to people who contribute in earlier revisions.

Levenshtein Distance

Another possibility would be to use the Levenshtein Distance. This basically counts the number of changes necessary to convert one string into another. I did some testing using Levenshtein Distance doing the following:

Starting at the oldest revision, calculate the Levenshtein Distance from the final revision. This number represents how many characters from the revision appear in the final. Moving to the next oldest revision, calculate the Levenshtein distance, but discount any credit already given to previous revisions.

This handles inserts, where words are inserted between existing words, but it doesn’t handle situations where words, sentences or paragraphs have their positions substituted. If you have ABC, the revision that changes it to CBA gets credit for the contents of C and A even though all it did was move text around.

History Flow and the sentenced based method

To get a more accurate picture you have to use a slightly longer unit than characters. The two simplest ways would be to use words or sentences instead of individual characters. IBM has done some analysis using a tool called History Flow that uses the sentence as the fundamental unit. As they point out, a revision that adds a comma will get credit for the entire sentence that contains the comma.

Line based approach

Jeff Atwood uses the “line” as his fundamental unit. This is a very reasonable approach if you are working with code or something where you are likely to have a lot of new lines. However, for long paragraphs it is a bit problematic. Either they get treated as a single line and you have the same issue with the comma as IBM’s method applied to an entire paragraph or you break the paragraph into lines at specific points and adding in content can reorder the entire paragraph making it all appear new.

Word based approach

A good balance might be to use individual words as the fundamental unit being compared. This drastically reduces the “false credit” problem associated with character based matching while minimizing the “comma problem”. There is still going to be a bit of “false credit” especially for common words. If someone writes 500 words to start an article, all of their original text is deleted and new text added, they are going to get credit for a number of words like “the”, “and”, etc. If what they wrote was one topic, they will get credit for even more because they are likely to have used a lot of keywords that will be in the final revision. Simply pasting in the dictionary a few times would give them significant credit for text that will not appear in the final revision. Still it represents a very reasonable approach, particularly if people aren’t trying to game the system.

Spam and methods

In this type of analysis of a Wikipedia or a different MediaWiki, one crucial thing to consider is spam. Character based and word based analysis is going to be heavily skewed by spam entries–even if they are immediately reverted. Sentence based approaches are probably going to be more accurate if revisions contain spam while word based methods are likely to be more accurate in closed systems where spam isn’t an issue.

Google Apps Premier SLA

The paid version of Google Apps includes a service level agreement that guarantees that the applications will be up 99.9% of the time. It works like this: If you are down for more than 0.01% of the time, they will add some days to the end of your contract–if you ask for it. The maximum amount you can get in a month is 15 days for 5% downtime.

5% downtime translates into about 36 hours. So if the service is down for more than 1.5 days each month, you’ll get an extra half a month. If the service is down for that long ever month and you ask for the credit, your contract will last 50% longer.

So basically if you are down for 100% of the time, you’ll only get an extra 15 days of service. Of course they are unlikely to be down for that long, but it does point out that the SLA really isn’t something that costs them anything to offer. That doesn’t mean it is a bad idea. The guarantee helps establish expectations and so far Google seems to do a reasonable job of meeting those expectations.

Smart Pricing

Smart pricing is the way Google keeps advertisers happy with using AdWords on properties outside of search. Adwords lets advertisers put a small image on their conversion page in order to track which ads produce sales. Google uses this information to determine how much to pay publishers who are displaying Adsense on their site. Basically, Google will adjust the amount they pay downward for Adsense accounts that don’t convert very well to customers for the Adwords advertisers.

There isn’t a lot of information explaining smart pricing, but it appears to work like this. Lets say Google normally charges an advertiser $1.00 for a click on an ad from your site and they split the money with you 50/50. (We don’t know how much this split is, so this is just a hypothetical number.) However, none of that traffic has converted to sales. Google may then only charge the advertiser $0.50 and split it with you 75/25. So now you may only get $0.125 from a click whereas you’d be getting $0.50 under the original scenario.

Originally smart pricing was said to work on an account basis. So a bunch of low quality traffic on Site A that produces clicks and no sales can reduce the revenue you receive from Site B that has high quality traffic that produces sales. Google usually tries to do things in real time, so it wouldn’t surprise me if smart pricing has become a lot smarter. Possible changes:

Per page or per site smart pricing.
Smart pricing based on where the traffic comes from or other attributes.
More dynamic changes to smart pricing — if every Tuesday your traffic doesn’t convert to sales, maybe you’ll be paid a lot less on that particular day.

If you want to get an idea of what is possible, look at the Insights section of your Google Analytics account. It shows the types of comparisons with historical trends that are probably going to be used as part of the smart pricing calculations.

Wildcard DNS and Rackspace Cloud

I’ve been using Rackspace Cloud Sites for awhile and so far I’ve been very happy. However, their biggest problem seems to be their inability to support wild card DNS. Basically if you want to have *.domain.com all handled by their cloud servers, you can’t do it. I think this is going to start hurting them because WordPress 3.0 has the multi site capabilities built in. For example, if you register a domain for your city like gotham.com, you could let people create their own subdomain blogs automatically. So you could have joker.gotham.com and batman.gotham.com just like Blogger and WordPress.com, but targeted to your specific audience.

However, to do this in Rackspace Cloud Sites, you have to manually create an alias for each new domain–you can’t let it handle things automatically. I’ve gone round and round with them asking for them to consider changing this. At first they told me that they couldn’t because there would be no way to know which node should handle an incoming request since it could be handled by a number of different physical machines. They were saying that not only would the system not support it in its current form, but it would but there was no way to change it so it wasn’t worth even requesting that they consider adding the capability in the future.

I didn’t quite buy this, but it sounded like they were saying they couldn’t do this because a single IP address handles request for multiple domains. I had heard that enabling SSL on the account (another $20 per month) will give you the ability to have a dedicated IP address for your website. I asked if this was true and they confirmed it. So I asked if wildcard DNS would be possible if we added the SSL capabilities.

They still said it was still impossible. Keep in mind, I wasn’t asking if it would work today–I wanted to know what was possible if they were willing to make changes to their system so I could ask them to consider an enhancement to their service. I find it hard to believe that there is no possible way to make it handle wildcard DNS–even when you have a dedicated IP address.

WordPress can handle multi-site on their service if you don’t mind using the same domain for everyone. So instead of joker.gotham.com you’d use gotham.com/joker. For what I’m trying to do, I prefer the subdomain.