Part 3: Implementing parallel remote search
Initially I’d hoped to make use of Lucene’s support for parallel and remote (RMI) based search. With promising class names like RemoteSearchable and ParallelMultiSearcher things were looking well and, indeed, my first attempts at implementing remote search seemed to work well enough.
Search queries were sent over RPC and Hits objects (Lucene’s container for searchs results) were sent back. I expanded on this theme by using Lucene’s ParallelMultiSearcher class which uses several threads to query several Searchables in parallel. Pretty soon, however, I came across two problems when testing this setup:
- This setup is not very robust. When a search slave fails, it is pretty much impossible to get ParallelMultiSearcher to let you know which slave failed. This makes recovery difficult or at least inefficient.
- Hits objects use caching to improve performance. This means that one must maintain a connection to an open IndexReader if one wants to access the contents of a Hits object. This can be very wasteful over RPC and tends to break very easily especially in a system which has to reload indexes often.
In my solution I tried to address both these problems and in addition make SearchSlave easier to control and monitor.
Searching
Step 1: I defined a new interface for remote search, dubbed Controllable. This interface mimics Lucene’s Searchable interface but adds a few additional methods. Both Controllable and Searchable extend java’s Remote interface (the interface that allows remote access over RPC) but Controllable adds a few methods lacking in Searchable that make remote control of search slaves easier.
- Methods like ping() and status() allow for remote monitoring of slaves. These methods are usually accessed by the Servlet to verify the status of remote search slaves.
- Methods like close() and reload() allow for remote control of search slaves. These are used by the new class ControlSlave to shut down slaves or to have a slave reload its index.
- The rest of the methods are just copied over from Lucene’s Searchable, meant to be a minimal set of functions necessary for remote search.
Step 2: I created a modified version of ParallelMultiSearcher, called PRMSearcher (for Parallel, Remote, Multisearch) that is aware of the need to monitor remote search slaves and exposes the collection of remote searchers to its owner. This allows for monitoring individual slaves and recovery of an individual slave in case one should fail.
Step 3: I created the SimpleHit class and its corresponding collection SimpleHits. This is a version of Hits that does not employ caching. Yes, this probably means a hit on my performance as all hits must be read from the index but it also saves access over the network to get hit contents and makes the whole process less prone to failure. It also allows me to reload the IndexReader as often as I want without worrying about open Hits objects breaking.
Indexing
Making search parallel took some work on the indexing side as well. I opted to go with a partitioned design where the index is partitioned into several non-overlapping partitions. This allows me to run several search slaves in parallel on different machines and should, in theory at least, allow for close to linear scaling in size of index with constant performance. Another advantage of this solution is its relative simplicity. The next step up from that would improve robustness by having some overlap between partitions so that the entire index is still available if one search slave happens to go down. This solution, however, would require more complex handling of the incoming search results which is already a possible bottle-neck. For now, simple is good.
The IndexMaster in the initial design ran as part of the web application. Since the application is designed to run on several servers, some configuration control was needed to make sure that only one instance of the application would ever write to the index. This instance was dubbed the Index Master. Communication between the application and the Index Master is done by creating Search Jobs.
Search Jobs are simple database entries that let the Index Master know that new content is ready to be indexed. Later those same entries can be used as a log to track performance of the indexing process. The Index Master periodically checks for new search jobs which it then performs as a batch. Batch indexing can be a huge gain in performance on Lucene. Based on the afore-mentioned advice from Doug Cutting the Index Master performs a checkpoint on the index every so often, causing the index to be copied to a new directory from which the various search slaves can remote copy the relevant partition of the index.
Partitioning is done in a very simple manner. An IndexBalancer object is both a collection of indexes and a mechanism for deciding the index partition into which a specific piece of information should go. I started out with a random balancer which worked pretty well but soon switched to a more deterministic approach based on the modulus of a hash of the object’s ID. This makes accessing objects by ID more efficient, a necessary operation when trying to update or delete an object in the index.
One of the problems in this design is the multiplicity of asynchronous processes. By decoupling the indexing process from the main application, it becomes easier to control and recover from a failure but it is also much harder to debug as some processes are time dependent and hard to predict. I ended up creating a few method calls that allow direct access into the bowels of the indexing process just to make testing more efficient.
Next: Rethinking indexing.