[Gluster-devel] GFID to Path Conversion

Aravinda avishwan at redhat.com
Thu Dec 10 11:58:32 UTC 2015

Some more analysis wrt storage space,

"Since support was added to the Linux kernel, there is a hard limit of
64KiB for the size of each extended attribute value, however different
file systems impose additional constraints. For ext2/3/4 and btrfs,
each extended attribute is limited to a file system block (e.g. 4 KiB),
and all (including names and values) must fit together in a single
block. In XFS the names can be up to 256 bytes in length, terminated
by the first 0-byte, and the values can be up to 64KB of arbitrary
binary data. ReiserFS allows attributes of arbitrary size."

Created a shell script to set 100 xattrs for a file with basename
value as long as ~255.

# -------------------
for i in {1..100}
     f="very very very very loooooooooooooooooong file 
     h=$(echo $f | md5sum | awk '{print $1}');
     setfattr -n trusted.pgfid.3c3b44ab-f21f-4801-a0bc-5a337bd5047c.$h 
-v "$f" $file;
# -------------------

Let me know if anybody thinks space could be issue for storing these
information in xattrs.

Other experiments:
For POC, I created two python scripts one to create index and other
one to retrive value(gfid to path). I used MD5 for POC purpose.


python pgfid_index.py <BRICK_PATH> # Updates required xattrs for all files


python gfid_to_path.py <BRICK_PATH> <GFID># Returns Path for given GFID

Note: This script uses `user.pgfid` prefix for xattr instead of
`trusted.pgfid` for POC.

Once the design is finalized, I will update storage/posix code.

Backward compatibility:
Same interface will be used to retrive information. That is

gluster volume set test build-pgfid on
getfattr -n glusterfs.ancestry.path -e text /mnt/testvol/.gfid/<GFID>


If any other component directly accessing xattrs instead of using
getfattr interface, then that component need to be changed.(For
example, glusterfind)

One more step will be introduced after `volume set` to build the
index. Current implementation is healing pgfid xattrs on named lookup,
if we disable this feature then we have to provide seperate interface
to heal(For example, getfattr -n pgfid.heal <PATH>)


On 12/09/2015 11:17 AM, Aravinda wrote:
> Hi,
> Sharing draft design for GFID to Path Conversion.(Directory GFID to 
> Path is
> very easy in DHT v.1, this design may not work in case of DHT 2.0)
> Performance and Storage space impact yet to be analyzed.
> Storing the required informaton
> -------------------------------
> Metadata information related to Parent GFID and Basename will reside
> with the file. PGFID and hash of Basename will become part of Xattr
> Key name and Basename will be saved as Value.
>     Xattr Key = meta.<PGFID>.<HASH(BASENAME)>
>     Xattr Value = <BASENAME>
> Non-crypto hash is suitable for this purpose.
> Number of Xattrs on a file = Number of Links
> Converting GFID to Path
> -----------------------
> Example GFID: 78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038
> 1. List all xattrs of GFID file in the brick backend.
> ($BRICK_ROOT/.glusterfs/78/e8/78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038)
> 2. If Xattr Key starts with “meta”, Split to get parent GFID and 
> collect xattr value
> 3. Convert Parent GFID to path using recursive readlink till path.
> 4. Join Converted parent dir path and xattr value(basename)
> Recording
> ---------
> RENAME: Remove old xattr(PGFID1, BN1), Add new xattr(PGFID2, BN2)
> UNLINK: If Link count > 1 then Remove xattr(PGFID, BN)
> Heal on Lookup
> --------------
> Healing on lookup can be enabled if required, by default we can
> disable this option since this may have performance implications
> during read.
> Enabling the logging
> ---------------------
> This can be enabled using Volume set option. Option name TBD.
> Rebuild Index
> -------------
> Offline activity, crawls the backend filesystem and builds all the 
> required xattrs.
> Comments and Suggestions Welcome.
> regards
> Aravinda
> On 11/25/2015 10:08 AM, Aravinda wrote:
>> regards
>> Aravinda
>> On 11/24/2015 11:25 PM, Shyam wrote:
>>> There seem to be other interested consumers in gluster for the same 
>>> information, and I guess we need a god base design to address this 
>>> on disk change, so that it can be leveraged in the various use cases 
>>> appropriately.
>>> Request a few folks to list out how they would use this feature and 
>>> also what performance characteristics they expect around the same.
>>> - gluster find class of utilties
>>> - change log processors
>>> - swift on file
>>> - inotify support on gluster
>>> - Others?
>> Debugging utilities for users/admins(Show path for GFIDs displayed in 
>> log files)
>> Retrigger Sync in Geo-replication(Geo-rep reports failed GFIDs in 
>> logs, we can retrigger sync if path is known instead of GFID)
>>> [3] is an attempt in XFS to do the same, possibly there is a more 
>>> later thread around the same that discusses later approaches.
>>> [4] slide 13 onwards talks about how cephfs does this. (see cephfs 
>>> inode backtraces)
>>> Aravinda, could you put up a design for the same, and how and where 
>>> this is information is added etc. Would help review it from other 
>>> xlators perspective (like existing DHT).
>>> Shyam
>>> [3] http://oss.sgi.com/archives/xfs/2014-01/msg00224.html
>>> [4] 
>>> http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf
>>> On 10/27/2015 10:02 AM, Shyam wrote:
>>>> Aravinda, List,
>>>> The topic is interesting and also relevant in the case of DHT2 
>>>> where we
>>>> lose the hierarchy on a single brick (unlike the older DHT) and so 
>>>> some
>>>> of the thoughts here are along the same lines as what we are debating
>>>> w.r.t DHT2 as well.
>>>> Here is another option that extends the current thought, that I would
>>>> like to put forward, that is pretty much inspired from the Linux 
>>>> kernel
>>>> NFS implementation (based on my current understanding of the same) 
>>>> [1] [2].
>>>> If gluster server/brick processes handed out handles, (which are
>>>> currently just GFID (or inode #) of the file), that encode pGFID/GFID,
>>>> then on any handle based operation, we get the pGFID/GFID for the 
>>>> object
>>>> being operated on. This solves the first part of the problem where we
>>>> are encoding the pGFID in the xattr, and here we not only do that but
>>>> further hand out the handle with that relationship.
>>>> It also helps when an object is renamed and we still allow the older
>>>> handle to be used for operations. Not a bad thing in some cases, and
>>>> possibly not the best thing to do in some other cases (say access).
>>>> To further this knowledge back to a name, what you propose can be 
>>>> stored
>>>> on the object itself. Thus giving us a short dentry tree creation
>>>> ability of pGFID->name(GFID).
>>>> This of course changes the gluster RPC wire protocol, as we need to
>>>> encode/send pGFID as well in some cases (or could be done adding 
>>>> this to
>>>> the xdata payload.
>>>> Shyam
>>>> [1] http://nfs.sourceforge.net/#faq_c7
>>>> [2] https://www.kernel.org/doc/Documentation/filesystems/nfs/Exporting
>>>> On 10/27/2015 03:07 AM, Aravinda wrote:
>>>>> Hi,
>>>>> We have a volume option called "build-pgfid:on" to enable recording
>>>>> parent gfid in file xattr. This simplifies the GFID to Path 
>>>>> conversion.
>>>>> Is it possible to save base name also in xattr along with PGFID? It
>>>>> helps in converting GFID to Path easily without doing crawl.
>>>>> Example structure,
>>>>> dir1 (3c789e71-24b0-4723-92a2-7eb3c14b4114)
>>>>>      - f1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>      - f2 (f1e7ad00-6500-4284-b21c-d02766ecc336)
>>>>> dir2 (6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed)
>>>>>      - h1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>> Where file f1 and h1 are hardlinks. Note the same GFID.
>>>>> Backend,
>>>>> .glusterfs
>>>>>       - 3c/78/3c789e71-24b0-4723-92a2-7eb3c14b4114
>>>>>       - 0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>>       - f1/e7/f1e7ad00-6500-4284-b21c-d02766ecc336
>>>>>       - 6c/3b/6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed
>>>>> Since f1 and h1 are hardlinks accross directories, file xattr will 
>>>>> have
>>>>> two parent GFIDs. Xattr dump will be,
>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=1
>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=1
>>>>> Number shows number of hardlinks per parent GFID.
>>>>> If we know GFID of a file, to get path,
>>>>> 1. Identify which brick has that file using pathinfo xattr.
>>>>> 2. Get all parent GFIDs(using listxattr on backend gfid path
>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>> 3. Crawl those directories to find files with same inode as
>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>> Updating PGFID to be done when,
>>>>> 1. CREATE/MKNOD - Add xattr
>>>>> 2. RENAME - If moved to different directory, Update PGFID
>>>>> 3. UNLINK - If number of links is more than 1. Reduce number of link,
>>>>> Remove respective parent PGFID
>>>>> 4. LINK - Add PGFID if link to different directory, Increment count
>>>>> Advantageous:
>>>>> 1. Crawling is limited to a few directories instead of full file 
>>>>> system
>>>>> crawl.
>>>>> 2. Break early during crawl when search reaches the hardlinks 
>>>>> number as
>>>>> of Xattr value.
>>>>> Disadvantageous:
>>>>> 1. Crawling is expensive if a directory has lot of files.
>>>>> 3. This method of conversion will not work if file is deleted.
>>>>> We can improve performance of GFID to Path conversion if we record
>>>>> Basename also in file xattr.
>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=f1
>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=h1
>>>>> Note: Multiple base names delimited by zerobyte.
>>>>> What additional overhead compare to storing only PGFID,
>>>>> 1. Space
>>>>> 2. Number of xattrs will grow as number of hardlinks
>>>>> 3. Max size issue for xattr value?
>>>>> 4. Even when renamed within the same directory.
>>>>> 5. Updating value of xattr involves parsing in case of multiple
>>>>> hardlinks.
>>>>> Are there any performance issues except during initial 
>>>>> indexing.(Assume
>>>>> pgfid and basenames are populated by a separate script)
>>>>> Comments and Suggestions Welcome.
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-devel/attachments/20151210/21596af9/attachment-0001.html>

More information about the Gluster-devel mailing list