I was recently working on a simple application where the user will enter famous quotations. Obviously we want to avoid duplicates so I needed a way to check for quotations that were substantially similar before a new quote was added to the database.
The idea was to show the top 5 most similar quotes before letting the user save the new quotation to the db. I used Lucene for this which allowed me to punt on the more difficult task of figuring out if two quotes were similar or not. I left that up to Lucene and only had to worry about how to get my information in and out of Lucene in a usable manner.
Below is the interesting method that uses Lucene to build an index of all the quotes in the system and then returns the five quotes that are most similar to the new quote text. Obviously creating a new index each time a quote is added isn’t particularly efficient, but makes it easier to demonstrate how it works and processor efficiency isn’t much of an issue with this particular task.
public List<Quote> getSimilarQuotes() throws CorruptIndexException, IOException {
String quoteText = quote.getText();
logger.info("creating RAMDirectory");
RAMDirectory idx = new RAMDirectory();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_31, new StandardAnalyzer(Version.LUCENE_31));
IndexWriter writer = new IndexWriter(idx, indexWriterConfig);
List<Quote> quotes = session.createCriteria(Quote.class).list();
//Create a Lucene document for each quote and add them to the
//RAMDirectory Index. We include the db id so we can retrive the
//similar quotes before returning them to the client.
for (Quote quote : quotes) {
Document doc = new Document();
doc.add(new Field("contents", quote.getText(),Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", quote.getId().toString() ,Field.Store.YES, Field.Index.ANALYZED));
writer.addDocument(doc);
}
//We are done writing documents to the index at this point
writer.close();
//Open the index
IndexReader ir = IndexReader.open(idx);
logger.info("ir has " + ir.numDocs() + " docs in it");
IndexSearcher is = new IndexSearcher(idx, true);
MoreLikeThis mlt = new MoreLikeThis(ir);
//lower some settings to MoreLikeThis will work with very short
//quotations
mlt.setMinTermFreq(1);
mlt.setMinDocFreq(1);
//We need a Reader to create the Query so we'll create one
//using the string quoteText.
Reader reader = new StringReader(quoteText);
//Create the query that we can then use to search the index
Query query = mlt.like( reader);
//Search the index using the query and get the top 5 results
TopDocs topDocs = is.search(query,5);
logger.info("found " + topDocs.totalHits + " topDocs");
//Create an array to hold the quotes we are going to
//pass back to the client
List<Quote> foundQuotes = new ArrayList<Quote>();
for ( ScoreDoc scoreDoc : topDocs.scoreDocs ) {
//This retrieves the actual Document from the index using
//the document number. (scoreDoc.doc is an int that is the
//doc's id
Document doc = is.doc( scoreDoc.doc );
//Get the id that we previously stored in the document from
//hibernate and parse it back to a long.
String idField = doc.get("id");
long id = Long.parseLong(idField);
//retrieve the quote from Hibernate so we can pass
//back an Array of actual Quote objects.
Quote thisQuote = (Quote)session.get(Quote.class, id);
//Add the quote to the array we'll pass back to the client
foundQuotes.add(thisQuote);
}
return foundQuotes;
}