[Gluster-devel] Minimal Design Doc for Name Space Cache

Sat Feb 17 15:45:21 UTC 2007

,----[ Krishna Srinivas writes: ]
| Minimal Design Doc for Name Space Cache:
| ----------------------------------------------------------------
| 
| Name Space Cache is a directory tree structure of the GlusterFS but
| with empty files (not empty, file will contain the IP address of the
| server where the file exists)
| This helps us in two issues:
| 1) In case a server goes down, there are chances that a duplicate
| file name is created (creation will be allowed as the original file
| is not seen as the server has gone down) But if we have NSC info, we
| can make sure that duplicate file is not created.
| 2) open() will be faster as the NSC will have the info of the server
| where the file exists.
`----
Lets do a distributed implementation. Call it
distributed-redundant-name-space-cache. Every group of bricks (as 1,2
3,4 5,6 or 1,2,3 4,5,6 7,8 how ever user configures) will cache its
group's name space alone. If any brick goes down, its other members
still remember what files were held in it. 

DRNSC can be entirely implemented as a translator. When not loaded,
the filesystem should disable new file creation in the event of brick
failures. 

When we call it a cache, I hate to preserve it across restarts. Even
though cache is not meta-data and losing it means no loss of file
system information (you can rebuild it again), it is still scary to
preserve any state information, particularly frequently
modifying/growing ones. In the DRNSC model, cache should be entirely
in memory. Every time you restart the file system, it should be
initialized in the background parallely using the admin tool or a
daemon. Even if you want to restart a brick, cache re-initialization
is required only for that group, which is very fast.

It is also possible to initialize lazily. That is, name space is
initialized for a dir, only when it (or its files) is accessed first
time. This is superfast, but the caveat is, when a brick goes down,
file creation is disabled for dirs that were never accessed before. I
am not in favor of this enhancement. Because, background
initialization model is sufficient and completely fault tolerant.

,----
| NSC has two components, server and client. NSC Server can be run on
| one of the server nodes (nscd --root /home/nsc --port 7000) (or
| should it be a part of glusterfsd?). NSC client module can be a part
| of glusterfs. client vol spec can contain the line
| "name-space-cache-server <IP> <port>" in unify volume.
`----
Server side has a dictionary (hash table) service, which the DRNSC
translator can use to hold its name space cache. Index it with
file+path string with value as another sub-dict. As of now, there is
no need to store anything in it. Just for future use.

,----
| NSC client module will give the following functions to glusterfs
| (these functions
| will mostly be used by unify xlator)
| nsc_init(IP, port)
| nsc_fini()
| nsc_query(path) - returns the IP addr of the node where file exists.
| nsc_create(path, IP) called during creation of file
| nsc_unlink(path)
| nsc_rename(oldpath, newpath, newIP)
| 
| Unify create() calls nsc_create()
| Unify unlink() will call nsc_unlink()
| Unify rename() will call nsc_rename()
| 
| Unify init() or glusterfs init() will call nsc_init()
| Unify fini() or glusterfs fini() will call nsc_fini()
| 
| Unify open() will call nsc_query() to get the IP address of the node
| where the file exits. Then it will query all its child xlator to see
| which of them is associated with that IP address and call open on
| that xlator. (This can be implemented by introducing an mop
| function?)
| 
| Comments and suggestions please.
`----
DNSC should issue a parallel dictionary service lookup query to one member
from every group. Member selection within a group should be on
round-robin (skip the dead bricks) basis. 

We do not have to bump the major version now. Because adding DRNSC
will not break compatibility to older clients.

Now mailing list crowd is more than IRC. Thanks to Krishna for
switching the discussion to list. 

Happy Hacking,
-- 
Anand Babu 
GPG Key ID: 0x62E15A31
Blog [http://ab.freeshell.org]              
The GNU Operating System [http://www.gnu.org]