[Gluster-devel] Duplicate request cache (DRC)

Mon Aug 20 14:27:58 UTC 2012

In RHS 2.1's road-map is the DRC(hereafter, cache), which has the following requirements in Docspace:

NFSv3 Duplicate reply cache for non-idempotent ops,
cluster aware, persistent across reboots. (performance, correctness)

* For persistence across reboots, one needs to implement DRC which caches
  the replies in files. However, this will significantly degrade the overall performance of
  non-idempotent operations (write, rename, create, unlink, etc).
  Having an in-memory cache eliminates the overhead of having to write each reply
  to persistent storage, but at the obvious cost of losing DRC on crashes and reboots.
  AFAIK, the Linux kernel's implementation is currently in-memory only.
  As such, we need to evaluate the actual impact on performance and weigh it against
  the advantages of having a persistent cache.

* For Cluster-aware DRC i.e, one where in if a server(say,A) goes down, another server (say, B)
  should take up the cache of the A to serve requests on behalf A. For this, both A
  and B should have a shared persistent storage for the DRC, along the lines of ctdb.
  One way of achieving shared persistent storage would be to simply use a gluster volume.

* Cache writes to disk/gluster volume could be the usual two ways: write-back and write-through.
  a. write-back: using this would help avoid the delay in waiting for synchronous writes to
     the cache, which would be significant given we need to do it for every request, and to
     glusterfs nevertheless (1 network round-trip). This actually provides a small window for
     failure if a cache write is lost in network transit just after the writing server goes down.
  b: write-through: using this would essentially add at least one more network round-trip to
     every non-idempotent request. Implementing this, IMO, is not worth the performance loss
     incurred.

We could implement the DRC this way:
1. Have the DRC turned OFF by default.
2. Implement DRC in three or five modes:
    * in-memory,
    * local disk cache(cache local to the server) and
    * cluster-aware cache (using glusterfs),
   last two of which could be write-back or write-through.
3. We also need to empirically derive an optimal default value for the cache size for each mode.

Choice of data structures i have in mind:
1. For in-memory:
   Two tables/hashed-tables of pointers to cached replies, one sorted/hashed on {XID,Client-Hostname/IP} pair,
   the other on time(for LRU eviction of cached replies). Considering that
   n(cache look-ups) >> n(cache hits), we need the fastest look-ups possible. I need suggestions
   for faster data structures for look-ups.

2. For on-disk storage of cache replies, i was thinking of a per-client directory with each
   reply being stored in a separate file and XID being the file name. This makes is easy for
   retrieval of cached replies by the fail-over server(s).
   One problem in cluster-aware drc with this approach is that if we have two clients from
   the same machine connected to different servers, XIDs may collide.
   This can be avoided by having the server ip/fqdn appended to the XIDs as file names.
   Also, having to cache multiple replies in one single file would be cumbersome.

We will start with in-memory implementation and proceed 
to further modes. 
I look forward to suggestions for changes and improvements on the design.

Thanks & Regards, 
Rajesh Amaravathi, 
Software Engineer, GlusterFS 
Red Hat