Inventing in software

To invent, you need a good imagination and a pile of junk.

— Thomas Edison

This is what is so facinating about programming. Your “pile of junk” consists of digital assest instead of physical matterial, so the raw materials are not limited by the normal laws of supply and demand. In software, you are limited only by your imagination.

Does Size Matter to Search Engines

Yahoo and Google are trying to one up each other on whose database contains more pages. Does it really matter? Isn’t relevancy more important? It depends on who your user is.

If most of your queries on search engines return hundreds or thousands of results, then it probably doesn’t matter who has the biggest index. As long as Google and Yahoo get the popular pages, you’ll most likely find what you are looking for. If you get a bunch of results from your search you are probably aren’t looking for a specific document on the web, so it doesn’t matter which page you get as long as it has the information you are after.

Most internet users fall in the above category. I tend to find that many of my searches return less than 20 results and sometimes only 2 or 3 or even 0. For people like me the number of pages searched is much more important. In fact for those types of searches the method used to order the results (determine the relevancy) isn’t really important. With a small number of pages, it is easy to scan through the list and find the most relevant entry.

Today I was working on a piece of equipment and it started giving me a less than helpful error message. I typed the manufacturer (in quotes) and the error message (also in quotes) into Google hoping to find out how to fix my problem. There were zero results. I tried it on Yahoo and got the same thing. After poking around in some forums I was able to find a post that described the problem on a page that was missing from both Yahoo and Google’s databases. The post contained all the phrases that I had searched for it just wasn’t in either of the search engine’s indexes.

Five years ago I taught a community college class about the internet. I used Geocities to put up the tests, class outline, etc. Today I tried search for:
site:geocities "juco internet class"
In Google I get zero results. With Yahoo I get a single result that links to the page I was searching for.

I’m sure that there are other items that I could find with Google that won’t show up in Yahoo. I’m not trying to say one engine is better than the other. The point is that the index size does matter if you are looking for a specific document. If Yahoo decides that they are going to maintain a larger index than Google, then there are going to be pages that can be found in the Yahoo index, but not in Google’s. If you are looking for one of these pages then size is very important.

Backing Up Subversion Automatically

Subversion is great, but like any data repository, it must be backed up regularly. Many people have tried to implement version control without really understanding how it works, only to later discover that their backup strategy wasn’t working.

The svn backup script I use is run every night as part of a cron job. Each morning I get an email telling me if everything went ok or not. Here is a list of what I want to happen with each backup:

  1. Dump all the data out of the repository
  2. Name the file with a timedate stamp in the filename. Something like YYYYMMDD-HHMM will work.
  3. gzip the file to save space
  4. Move a copy of the file to another server using scp

Seems pretty basic, but when I’m doing a backup by hand, I like to go a step further and verify the backup by creating a new repository, filling it with the backed up data and then checking it out. This lets me verify that my backup works and that I can get my code back if necessary. So for this verification stage I want to do the following:

  1. Pull the zipped file back down from the remote server
  2. Unzip it.
  3. Create a new repository
  4. Load all of my content into the new repository
  5. Checkout a copy of trunk into a directory
  6. Cleanup

The following perl script accomplishes everything I need in a svn backup script. When it is run with cron, I get a short email everyday telling me that it completed. The output is intentionally terse. If I get a long email I know something went wrong, but I don’t have to wade through a bunch of logging information if everything went as planned. If you want more output, take the -q off of the Subversion commands. The emails that cron sends me look like this if nothing went wrong:

Dumping Subversion repo /var/svn to my_backup-20050921-0100...
Backing up through revision 340...

Compressing dump file...

Created /home/admin/backups/my_backup-20050921-0100.gz

my_backup-20050921-0100.gz transfered to my.server.com

---------------------------------------
Testing Backup
---------------------------------------
Downloading my_backup-20050921-0100.gz from my.server.com
Unzipping my_backup-20050921-0100.gz
Creating test repository
Loading repository
Checking out repository
Cleaning up

If you want to use this on Windows, you’ll need to make a few changes. First the way we generate the time and datestamp for the file name will need changed. You’ll probably want to use something other than scp and gzip as well.

Here is the script. I hope some people find it useful.

my $svn_repo = "/var/svn";
my $bkup_dir = "/home/backup_user/backups";
my $bkup_file = "my_backup-";
my $tmp_dir = "/home/backup_user/tmp";
my $bkup_svr = "my.backup.com";
my $bkup_svr_login = "backup";

$bkup_file = $bkup_file . `date +%Y%m%d-%H%M`;
chomp $bkup_file;
my $youngest = `svnlook youngest $svn_repo`;
chomp $youngest;

my $dump_command = "svnadmin  -q dump $svn_repo > $bkup_dir/$bkup_file ";
print "\nDumping Subversion repo $svn_repo to $bkup_file...\n";
print `$dump_command`;
print "Backing up through revision $youngest... \n";
print "\nCompressing dump file...\n";
print `gzip -9 $bkup_dir/$bkup_file\n`;
chomp $bkup_file;
my $zipped_file = $bkup_dir . "/" . $bkup_file . ".gz";
print "\nCreated $zipped_file\n";
print `scp $zipped_file $bkup_svr_login\@$bkup_svr:/home/backup/`;
print "\n$bkup_file.gz transfered to $bkup_svr\n";

#Test Backup
print "\n---------------------------------------\n";
print "Testing Backup";
print "\n---------------------------------------\n";
print "Downloading $bkup_file.gz from $bkup_svr\n";
print `scp $bkup_svr_login\@$bkup_svr:/home/backup/$bkup_file.gz $tmp_dir/`;
print "Unzipping $bkup_file.gz\n";
print `gunzip $tmp_dir/$bkup_file.gz`;
print "Creating test repository\n";
print `svnadmin create $tmp_dir/test_repo`;
print "Loading repository\n";
print `svnadmin -q load $tmp_dir/test_repo < $tmp_dir/$bkup_file`;
print "Checking out repository\n";
print `svn -q co file://$tmp_dir/test_repo $tmp_dir/test_checkout`;
print "Cleaning up\n";
print `rm -f $tmp_dir/$bkup_file`;
print `rm -rf $tmp_dir/test_checkout`;
print `rm -rf $tmp_dir/test_repo`;

Eric Wilhelm has another subversion backup method that is worth checking out as well. His method is based on dumping out a backup at every X number of commits instead of based on a specific period of time. This has some advantages particularly with large repositories that don’t change very often.

Storing your Maven Repository in CVS/Subversion

Brett Porter has hacked together a tool that will let you use a CVS or Subversion repository as your maven repository.

Brett Porter – Storing your Maven Repository in CVS/Subversion
It’s pretty rough, but is a working prototype that makes Maven 1.1/2.0 downloads a checkout/update, and deploy is an add/commit. I see this would be useful for snapshot repositories, where you could use one filename instead of transforming the version, so getting the latest would literally be an svn update.

If you are using Subversion with Apache, it is pretty easy to achieve most of this. The problem that I’m faced with is the fact that Maven can’t handle repositories that use SSL and a login.

Currently, I’m using a separate server to host our Maven repository because the Subversion server is using SSL. I hope that Maven will eventually come up with a way to work around this, but right now it looks like most of their efforts are being spent on Maven 2.

Ignoring Build Problems

I ran across this blog post that is probably typical of many people who are managing software projects.

Musings of a Software Development Manager » Blog Archive » CruiseControl Warnings
I get about 48 email messages from Cruisecontrol each day for one of our projects. This is not something I’m proud of since this situation has existed for at least 4 weeks now, we’ve had a broken build. The problem stems from some nasty functional tests that no one wants to investigate and we’ve sort of let our process slip.

There is a simple solution to this. Turn off the tests that are failing. People’s first reaction to this is “Oh no, we can’t turn off the tests! They indicate that something is wrong. Eventually we’ll have time to fix it.”

If you are actually going to fix it go ahead, but if something has been broken for more than a week, chances are no one is going to fix it any time soon. You should turn it off so it starts building without errors again.

Why is this better? If your team gets 10 emails each day saying that something is broken, they are going to ignore it. No one is really responsible for all of the problems, so no individual really works on fixing it. However, if the build is working correctly and someone checks in code that breaks a unit test and everyone gets and email, that person is probably going to try to fix it because it shows that he is responsible for the problem.

Think of it another way. Lets say I have 3 smoke alarms, 1 gas alarm, 1 CO2 alarm, and a flooded basement alarm in my house and they all sound pretty much the same. Now lets say that the flooded basement alarm goes off and I decide that it isn’t important enough to fix the cause…. So I just let the alarm go off. How likely do you think I am to notice if another alarm goes off once I get used to ignoring the first alarm.

If I’m not going to fix the problem, the best thing I can do is disable the flooded basement alarm until I have a chance to fix it. After a week of ignoring the alarm and nothing bad happening, I’m not suddenly notice it and decide I should do something about it.

One of the first things I did when I started at my current job, is go through and renamed every test that failed our automatic build process as “pending”. By the time the test would run, I had disabled about 2/3 of the tests. Since they were failing we ignored them anyway, so marking them as pending didn’t change anything. Before they were turned off, it would have been impossible to notice if one of the tests that were previously working broke because of a change.

Over time we’ve turned most of the pending tests back on one at a time as we’ve had more time to fix the code or fix the test.

When your tests fail, it should be unusual. I setup our builds to break if any test fails. I’ve got a lava lamp above my cubicle and everyone in the company know what it means. If something breaks people start asking the developers about it until it gets fixed.