Tapestry 5 Links

I’ve been getting back into Tapestry and wanted to compile a list of the links that have been useful to me.

Mailing Lists

Wooki

Wooki is a project written in Tapestry that lets you collaboratively create online books. The full source is available and it makes for a great reference to see how how things are were setup and configured. The authors have also created some modules for installing tapestry applications, managing database migrations and more. Spread The Source is the blog associated with the project and offers Tapestry code and news.

Tapestry Hotel Booking

The authors of wooki are working on a reference application using Tapestry that will recreate the Seam reference application. They are also writing an online book detailing how it was created. The app will be a hotel booking program and the code is available.

Component Demos

  • Tapestry5Demo
  • Jumpstart – Demonstrates how to do various things and also can be used to “jumpstart” a Tapestry project by providing a working app with basic user management and security.
  • Appspot Component Test – A Google App Engine deployed version of the test app1 that is part of Tapestry core.

Startup Apps

These are some applications that make it easier to get started.  They help set everything up for you and sometimes give you basic user management or security.

  • Jumpstart – As previously mentioned.
  • AppFuse – Lets you create a basic web application by choosing between a number of different web frameworks.

Modules and Components for Tapestry

Blogs

Books

  • Tapestry 5: Building Web Applications – Nice step-by-step guide, but is geared for Tapestry 5.0 and there have been a lot of improvements.  Still well worth reading.
  • Tapestry 5 – This book is in German, but it is being translated and should be part of Mannings Early Access Program very soon.

Hudson EC2

Hudson is a continuous integration server that will watch a source code repository and automatically build and test the code whenever it is updated. There is an interesting plugin that lets it integrate with Amazon EC2. If you need more server resources to build and test the software, Hudson will simply bring a new instance up on Amazon, use it and then shut it down.

Since the actual source code in a project is typically not going to require much bandwidth to move around, this would work a lot better than trying to use EC2 for other on demand applications–like video rendering that are both processor and bandwidth intensive.

Tapestry Run on Startup

Tapestry has an @Startup annotation that allows you to mark items in your AppModule to run them at startup.  Here is an example.  I needed an application to check to see if there are any users and if not, add one.  The code below does this. The only difficulty was getting Hibernate to commit the transaction. Tapestry uses a HibernateSessionManager.  Once I had a reference to this, I was able to commit the transaction.  The @CommitAfter annotation does not work within the AppModule.

@Startup
public static void createAdminUser(Session session, HibernateSessionManager hsm) {
  //figure out how many users are contained in the system
  int numberOfUserAccounts = session.createCriteria(User.class).list().size();
  //if there are no users, go ahead and create the admin user
  if(numberOfUserAccounts < 1) {
    User user = new User();
    user.setUsername("admin");
    user.setPassword("adminpassword");
    user.setRoles("user,admin");
    session.save(user);
    //commit the hibernate transaction
    hsm.commit();
  }
}

Good and Bad Programming Languages

Often when someone says a particular programming language is bad, they are referring more to the common practice associated with that language than the language itself. Many times they are really complaining about their own poor programming habits more than the specific language. Sometimes these habits are shared by the entire culture built around a particular language.

Perl is a good example of this. People complain about how difficult it is to read and then proceed to write awful unreadable code. Perl can be very readable, but its terseness makes it easy for people to write huge lines of code that do 10 or 11 different things. You can do the same thing in Java, but most people try to avoid a single line that is 500 characters long because it is a pain to scroll back and forth sideways to read the code.

Sometimes the lack of a particular restriant can inspire horrible code.

Who Writes Wikipedia Algorithm

Based on the number of edits, Wikipedia appears to be written by a small number of people.  Aaron Swartz did some testing and came to the conclusion most content in the final revision comes from people who don’t even have a login.

I have been looking for a way to determine the percentage each author contributed to a particular page in Wikipedia or MediaWiki, something I assumed would be trivial, but it turns out is much more complicate. For performance reasons, MediaWiki (the software that powers Wikipedia) saves the entire text of a new revision–not just what has changed so every edit results in a completely new copy of the page being saved.  This means to see who contributed what, requires going back through each edit and comparing it to see what was included in the final version.

I started looking at Aaron’s method to see if it might be useful.  What he did is briefly described here. To the best of my understanding this is the basic process he used:

Find the longest matching string between the first revision and the final revision from the Wikipedia dumpfile.  Mark that string in the final revision as having come from the author of the first revision.  Continue this with the first revision and next largest matching string until there are no more matching strings that haven’t been marked.  Move to the second revision and repeat the process.  When you finish, you should have a version where every character is marked based on where it came from.

So lets look at an example:

Bob Revision 1: The fox jumped the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.

So our final version is:

The quick brown fox jumped over the lazy hound.

Now lets find the longest matching string between the final version and Bob’s initial edit.  We’ll mark Bob’s contributions in Red:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Ok so that is the longest string, now lets find the second longest:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Repeat:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

And once more:

The fox jumped the hound.
The quick brown fox jumped over the lazy hound.

Going through the same process marking Joe’s revision in blue produces:

The quick brown fox jumped over the lazy hound.

So we can easily see that Joe contributed 18 non space characters and Bob contributed 21 non space characters that made it into the final revision.  This is just fine if you are simply adding information.  It gets a bit more tricky when you are removing words because parts of words that have been removed will match words that have been added later.

Consider this scenario:

Bob Revision 1: The icky quaint rowdy fox jumped over the hound.
Joe Revision 2: The quick brown fox jumped over the lazy hound.

Now if we do the same process we get:

The quick brown fox jumped over the lazy hound.

Even though the words “icky”, “quaint” and “rowdy” added by Bob have been removed in the final version, they are still matching parts of words.  ICK of “icky” is matching the last part of quICK.  The QU of “quaint” is matching the first part of QUick, etc.

This approach is strongly biased toward the person who started the article or contribute early on–particularly if they added a lot of text that was later removed.  What would happen if the person who originally edited the article also pasted in a copy of the alphabet several hundred times at the bottom of their text?  It would match everything added later regardless of who added it. Now people probably aren’t doing that, but if they add a bunch of text that eventually gets removed, their removed text will still match a lot of text in subsequent revisions.

Still, this approach isn’t unreasonable and probably gives fair results as long as someone isn’t specifically trying to game the system.  Aaron’s method of recursively looking for the longest string is important if you need to see who did what.  If you just need to know how many characters each person contributed (and you are fine with the level of accuracy discussed above), there is a much more efficient approach.

More efficient method

The trick is to realize that this method is going to attribute any character in the final revision to the earliest revision to introduce that character.  So if revision 1 has three a’s, three a’s of the final revision will be credited to the author of the first revision–regardless of where they occur. Because of this, there is no advantage of recursively matching the longest string unless you are trying to produce an annotated version showing who wrote what block of text. Even then  you run into the problem shown above where subsequent edits don’t get full credit.

In other words, we can get the same results simply by counting the number of times each letter appears in each revision starting with the earliest revision.  That revision then receives credit for the occurrence of those letters in the final version as long as those letters haven’t been already credited to another revision.

Here is a simple example:

Revision 1: AB BA BAB
Revision 2: AB BAB BAB BABA

Revision 1: A = 3 B = 4
Revision 2: A = 5 B = 7

So the first version gets credit for:
A = 3 B = 4 total: 7 characters or 7/12th of the final version

Revision 2 gets credit for:
A = 2 B = 3 total: 5 characters or 5/12ths of the final version

More accurate methods

As shown, this approach gives an extraordinary amount of weight to early contributors–particularly if they are verbose–regardless of how much of their actual content makes its way into the final version. Its greatest strength is that it handles cases where text is moved from one position to another.  It gets this strength at the expense of giving “false credit” to people who contribute in earlier revisions.

Levenshtein Distance

Another possibility would be to use the Levenshtein Distance.  This basically counts the number of changes necessary to convert one string into another. I did some testing using Levenshtein Distance doing the following:

Starting at the oldest revision, calculate the Levenshtein Distance from the final revision.  This number represents how many characters from the revision appear in the final.  Moving to the next oldest revision, calculate the Levenshtein distance, but discount any credit already given to previous revisions.

This handles inserts, where words are inserted between existing words, but it doesn’t handle situations where words, sentences or paragraphs have their positions substituted.  If you have ABC, the revision that changes it to CBA gets credit for the contents of C and A even though all it did was move text around.

History Flow and the sentenced based method

To get a more accurate picture you have to use a slightly longer unit than characters.  The two simplest ways would be to use words or sentences instead of individual characters.   IBM has done some analysis using a tool called History Flow that uses the sentence as the fundamental unit. As they point out, a revision that adds a comma will get credit for the entire sentence that contains the comma.

Line based approach

Jeff Atwood uses the “line” as his fundamental unit. This is a very reasonable approach if you are working with code or something where you are likely to have a lot of new lines.  However, for long paragraphs it is a bit problematic.  Either they get treated as a single line and you have the same issue with the comma as IBM’s method applied to an entire paragraph or you break the paragraph into lines at specific points and adding in content can reorder the entire paragraph making it all appear new.

Word based approach

A good balance might be to use individual words as the fundamental unit being compared.  This drastically reduces the “false credit” problem associated with character based matching while minimizing the “comma problem”. There is still going to be a bit of “false credit” especially for common words. If someone writes 500 words to start an article, all of their original text is deleted and new text added, they are going to get credit for a number of words like “the”, “and”, etc.  If what they wrote was one topic, they will get credit for even more because they are likely to have used a lot of keywords that will be in the final revision.  Simply pasting in the dictionary a few times would give them significant credit for text that will not appear in the final revision. Still it represents a very reasonable approach, particularly if people aren’t trying to game the system.

Spam and methods

In this type of analysis of a Wikipedia or a different MediaWiki, one crucial thing to consider is spam. Character based and word based analysis is going to be heavily skewed by spam entries–even if they are immediately reverted.  Sentence based approaches are probably going to be more accurate if revisions contain spam while word based methods are likely to be more accurate in closed systems where spam isn’t an issue.