[Gluster-devel] afr's ns-brick and posix-locks

Thu Dec 6 19:48:50 UTC 2007

Daniel,
 thanks for sharing your thoughts. our 1.4 roadmap has the functionality of
glusterfs doing inode number allotting rather than using the posix fs inode
number. the design is in the early stages but since the topic came up, i
would like to share the idea.

the aim of the design is (from glusterfs internals point of view) -
1. access files/folders with the inode number as the handle rather than
path.
2. mechanism for high level translators (non-posix essentially) to have a
say in what inode number to be allotted for a new entry.
3. allow namespace volume to be spread across nodes.

a more well explained (yet not complete) "document" is here

http://www.gluster.org/docs/index.php/GlusterFS_1.4_Discuss

files and folders will appear as they are today in the posix backend (means
to access files with path).
there will be yet another way to access these files, based on the inode
numbers. there will be a top level directory (say ".inodes/") and each file
is hardlinked under that directory with its inode number as the filename.

so when you want to open a file, it can be either accessed with
/path/to/file or /.inodes/<INODENUM>

the INODENUM of each file is allotted by glusterfs, and written into its
extended attribute. the posix inode number is not used (it will be used for
self-heal within posix xlator to verify sanity checks lik - if the file's
xattr inode num number and its /.inode/NUM are actually the same hardlink in
the backend fs)

now afr will create a file in the first subvolume, and that will return with
an inode number. it will use this new inode number and ask the posix of 2nd
subvolume to crate a file with the same inode number. since alloting inode
numbers is within the scope of glusterfs, it has to just set the right
hardlinks and write xattrs.

now unify can also merge inodes from its subvolumes with a static math
formula by allotting each subvolume its own inode space. (this is what amar
meant by distributing inodes -- considering the glusterfs inode framework is
in place)

there are lot more details which i have not mentioned in the above
paragraphs, especially the nitty gritty details involved in self-heal (when
two subvolumes go out of sync, the algorithms to bring them back to sync in
AFR). we could discuss them offline if you are interested. there are
numerous other advantages with this new model for glusterfs internals to
make it work more efficiently.

Another note is that 1.4 (this framework change) is not going to be brought
in immediately. the dev team has currently stopped new development and the
immediate goal is to have 1.3 rock solid. bug fixes are the highest priority
right now. this also gives us time to keep re-evaluating the 1.4 design.

I'm reading your design and still letting the idea sync into my head.. in
the mean time let us know your thoughts.

avati

2007/12/7, Daniel van Ham Colchete < daniel.colchete at gmail.com>:
>
> Well, the idea is ugly, but I think it works.
>
> The main problem with inode numbers is that we have no control at all
> (with POSIX's open() anyway) of what the inode number of one file will be.
> That is the problem. Amar proposed a distributed namespace cache algorithm a
> few months ago that, in my humble opinion, fails because of that. You cannot
> make a union of some 64 bits space and expect to unify all of them on a 64
> bits space. Meaning you cannot have a distributed namespace cache for the
> inode numbers using posix filesystems to store those files because inode
> numbers can be anything between 1 and (2**64-1) on each and a union of two
> filesystems will have inode number colisions.
>
> The obvious solution is not to use posix filesystems to store the
> namespace cache. Just that. Everything bellow are simple facts comes out
> easily from it.
>
> Well, we'll store any information necessary to the namespace brick in an
> database format made specifically to that end. We can use a modified version
> of ext3 or xfs or reiser or make glusterfs's own. Think of it as a
> translator that have open(), close(), getxattr(), flock(), fcntl() and
> anything else necessary to each file's metadata, but will always return 0 on
> read() and write(). That database format will have a restricted inode number
> space (say 48 bits).
>
> To do the distributed magic we'll use the way IP addresses works. The
> first 16 bits of an inode number will be the namespace brick ID (maybe
> generated by the client each time the glusterfs if mounted). The last 48
> bits will be the pre-inode number given of the by the namespace brick. The
> namespace brick doesn't know what brick ID each client gave to it. Like
> unify, when open()ing a file, we look at all the namespace bricks to see
> witch one has the file's metadata. It will return you an pre-inode number
> with 48 bits. To have the file external inode number just do an AND
> operation:
>
> INODE NUMBER = [BRICK ID - 16 BITS] [ PRE-INODE NUMBER - 48bits]
>
> The bits division should be better evaluated. Maybe 10 / 54 or 9 / 55...
>
> So, with any operation that happens on the inode numbers it is really fast
> to determine the namespace cache brick to send the message.
>
> I think we wouldn't be able to use AFR here. So, the redundancy would be
> implemented specifically to that kind of translator. First, we have to make
> sure that, created files will have the same inode number on all the
> replicated bricks. For healing there are two options:
> 1 - file by file self-heal, using the currently implemented algorithms
> 2 - database dump and restore when getting back online.
>
> In the end, the amount of data copied would be almost the same, the first
> takes more time to complete and leaves the system vulnerable to data loss
> for more time. The second lock the file creating during the healing phase
> and the brick would have to be aware of the replication scheme.
>
> So, the final features are:
>  - it is distributed and can scale to every performance needs
>  - it doesn't limit the number of files gluster can handle, it still can
> handle 2**64 files (thats the Linux's kernel limitation).
>  - allows replication and no single point of failure.
>
> I think it is ugly but, do you think it could work?
>
> Best regards,
> Daniel
>
> On Dec 6, 2007 3:56 PM, Anand Avati < avati at zresearch.com> wrote:
>
> >
> >
> > >
> > > I've been a little bit out of GlusterFS lately but, what about the
> > > issue
> > > with inode numbers changing with the first server (in the AFR system)
> > > goes
> > > out making fuse crazy? How are things going with the distributed
> > > namespace
> > > cache? I had an idea about this, it is ugly but fixes the problem if
> > > it
> > > hasn't been fixed already.
> >
> >
> > Currently we use inode generation based workarounds. I'm intersted in
> > the idea :)
> >
> >
> > avati
> >
>
>

-- 
It always takes longer than you expect, even when you take into account
Hofstadter's Law.

-- Hofstadter's Law