[Gluster-devel] GFID to Path Conversion

Venky Shankar vshankar at redhat.com
Wed Jan 6 01:24:37 UTC 2016



Shyam wrote:
> On 12/09/2015 12:47 AM, Aravinda wrote:
>> Hi,
>>
>> Sharing draft design for GFID to Path Conversion.(Directory GFID to
>> Path is
>> very easy in DHT v.1, this design may not work in case of DHT 2.0)
>
> (current thought) DHT2 would extend the manner in which name,pGFID is
> stored for files, for directories. So reverse path walking would
> leverage the same mechanism as explained below.
>
> Of course, as this would involve MDS hopping, the intention would be to
> *not* use this in IO critical paths, and rather use this in the tool set
> that needs reverse path walks to provide information to admins.
>
>>
>> Performance and Storage space impact yet to be analyzed.
>>
>> Storing the required informaton
>> -------------------------------
>> Metadata information related to Parent GFID and Basename will reside
>> with the file. PGFID and hash of Basename will become part of Xattr
>> Key name and Basename will be saved as Value.
>>
>> Xattr Key = meta.<PGFID>.<HASH(BASENAME)>
>> Xattr Value = <BASENAME>
>
> I would think we should keep the xattr name constant, and specialize the
> value, instead of encoding data in the xattr value itself. The issue is
> of course multiple xattr name:value pairs where name is constant is not
> feasible and needs some thought.

With DHT2, the "multi-value key" could possibly be stored efficiently in 
some kvdb rather than xattrs (when it does move there). With current 
DHT, we're still stuck with using xattrs, where having a compounded 
value would rather be inefficient.

>
>>
>> Non-crypto hash is suitable for this purpose.
>> Number of Xattrs on a file = Number of Links
>>
>> Converting GFID to Path
>> -----------------------
>> Example GFID: 78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038
>
> Here is where we get into a bit of a problem, if a file has links. Which
> path to follow would be a dilemma. We could return all paths, but tools
> like glusterfind or backup related, would prefer a single file. One of
> the thoughts is, if we could feed a pGFID:GFID pair as input, this still
> does not solve a file having links within the same pGFID.

Why not just list all possible paths? I think that might be the correct 
thing to do. In most cases, this would just be dealing with a single 
link count. For other cases (nlink > 1), the higher level code would 
need to do some sort of juggling - utilities such as glusterfind (or 
other backup tools) could possibly perform additional checks before 
doing their job, but in most cases, they would be dealing with a single 
link count.

>
> Anyway, something to note or consider.
>
>>
>> 1. List all xattrs of GFID file in the brick backend.
>> ($BRICK_ROOT/.glusterfs/78/e8/78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038)
>> 2. If Xattr Key starts with “meta”, Split to get parent GFID and collect
>> xattr value
>> 3. Convert Parent GFID to path using recursive readlink till path.
>
> This is the part which should/would change with DHT2 in my opinion. Sort
> of repeating step (2) here instead of a readlink.
>
>> 4. Join Converted parent dir path and xattr value(basename)
>>
>> Recording
>> ---------
>> MKNOD/CREATE/LINK/SYMLINK: Add new Xattr(PGFID, BN)
>
> Most of these operations as they exist today are not atomic, i.e we
> create the file and then add the xattrs and then possibly hardlink the
> GFID, so by the time the GFID makes it's presence, the file is all ready
> and (maybe) hence consistent.
>
> The other way to look at this is that we get the GFID representation
> ready, and then hard link the name into the name tree. Alternately we
> could leverage O_TMPFILE to create the file encode all its inode
> information and then bring it to life in the namespace. This is
> orthogonal to this design, but brings in needs to be consistent on
> failures.

IIRC, last time I checked, using O_TMPFILE was not portable, but this 
can still be used wherever its available.

>
> Either way, if a failure occurs midway, we have no way to recover the
> information for the inode and set it right. Thoughts?
>
>> RENAME: Remove old xattr(PGFID1, BN1), Add new xattr(PGFID2, BN2)
>> UNLINK: If Link count > 1 then Remove xattr(PGFID, BN)
>>
>> Heal on Lookup
>> --------------
>> Healing on lookup can be enabled if required, by default we can
>> disable this option since this may have performance implications
>> during read.
>>
>> Enabling the logging
>> ---------------------
>> This can be enabled using Volume set option. Option name TBD.
>>
>> Rebuild Index
>> -------------
>> Offline activity, crawls the backend filesystem and builds all the
>> required xattrs.
>
> Frequency of the rebuild? I would assume this would be run when the
> option is enabled, and later almost never, unless we want to recover
> from some inconsistency in the data (how to detect the same would be an
> open question).
>
> Also I think once this option is enabled, we should prevent disabling
> the same (or at least till the packages are downgraded), as this would
> be a hinge that multiple other features may depend on, and so we
> consider this an on-disk change that is made once, and later maintained
> for the volume, rather than turn on/off.
>
> Which means the initial index rebuild would be a volume version
> conversion from current to this representation and may need aditional
> thoughts on how we maintain volume versions.
>
>>
>> Comments and Suggestions Welcome.
>>
>> regards
>> Aravinda
>>
>> On 11/25/2015 10:08 AM, Aravinda wrote:
>>>
>>> regards
>>> Aravinda
>>>
>>> On 11/24/2015 11:25 PM, Shyam wrote:
>>>> There seem to be other interested consumers in gluster for the same
>>>> information, and I guess we need a god base design to address this on
>>>> disk change, so that it can be leveraged in the various use cases
>>>> appropriately.
>>>>
>>>> Request a few folks to list out how they would use this feature and
>>>> also what performance characteristics they expect around the same.
>>>>
>>>> - gluster find class of utilties
>>>> - change log processors
>>>> - swift on file
>>>> - inotify support on gluster
>>>> - Others?
>>> Debugging utilities for users/admins(Show path for GFIDs displayed in
>>> log files)
>>> Retrigger Sync in Geo-replication(Geo-rep reports failed GFIDs in
>>> logs, we can retrigger sync if path is known instead of GFID)
>>>>
>>>> [3] is an attempt in XFS to do the same, possibly there is a more
>>>> later thread around the same that discusses later approaches.
>>>>
>>>> [4] slide 13 onwards talks about how cephfs does this. (see cephfs
>>>> inode backtraces)
>>>>
>>>> Aravinda, could you put up a design for the same, and how and where
>>>> this is information is added etc. Would help review it from other
>>>> xlators perspective (like existing DHT).
>>>>
>>>> Shyam
>>>> [3] http://oss.sgi.com/archives/xfs/2014-01/msg00224.html
>>>> [4]
>>>> http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf
>>>>
>>>>
>>>> On 10/27/2015 10:02 AM, Shyam wrote:
>>>>> Aravinda, List,
>>>>>
>>>>> The topic is interesting and also relevant in the case of DHT2
>>>>> where we
>>>>> lose the hierarchy on a single brick (unlike the older DHT) and so
>>>>> some
>>>>> of the thoughts here are along the same lines as what we are debating
>>>>> w.r.t DHT2 as well.
>>>>>
>>>>> Here is another option that extends the current thought, that I would
>>>>> like to put forward, that is pretty much inspired from the Linux
>>>>> kernel
>>>>> NFS implementation (based on my current understanding of the same)
>>>>> [1] [2].
>>>>>
>>>>> If gluster server/brick processes handed out handles, (which are
>>>>> currently just GFID (or inode #) of the file), that encode pGFID/GFID,
>>>>> then on any handle based operation, we get the pGFID/GFID for the
>>>>> object
>>>>> being operated on. This solves the first part of the problem where we
>>>>> are encoding the pGFID in the xattr, and here we not only do that but
>>>>> further hand out the handle with that relationship.
>>>>>
>>>>> It also helps when an object is renamed and we still allow the older
>>>>> handle to be used for operations. Not a bad thing in some cases, and
>>>>> possibly not the best thing to do in some other cases (say access).
>>>>>
>>>>> To further this knowledge back to a name, what you propose can be
>>>>> stored
>>>>> on the object itself. Thus giving us a short dentry tree creation
>>>>> ability of pGFID->name(GFID).
>>>>>
>>>>> This of course changes the gluster RPC wire protocol, as we need to
>>>>> encode/send pGFID as well in some cases (or could be done adding
>>>>> this to
>>>>> the xdata payload.
>>>>>
>>>>> Shyam
>>>>>
>>>>> [1] http://nfs.sourceforge.net/#faq_c7
>>>>> [2] https://www.kernel.org/doc/Documentation/filesystems/nfs/Exporting
>>>>>
>>>>> On 10/27/2015 03:07 AM, Aravinda wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We have a volume option called "build-pgfid:on" to enable recording
>>>>>> parent gfid in file xattr. This simplifies the GFID to Path
>>>>>> conversion.
>>>>>> Is it possible to save base name also in xattr along with PGFID? It
>>>>>> helps in converting GFID to Path easily without doing crawl.
>>>>>>
>>>>>> Example structure,
>>>>>>
>>>>>> dir1 (3c789e71-24b0-4723-92a2-7eb3c14b4114)
>>>>>> - f1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>> - f2 (f1e7ad00-6500-4284-b21c-d02766ecc336)
>>>>>> dir2 (6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed)
>>>>>> - h1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>>
>>>>>> Where file f1 and h1 are hardlinks. Note the same GFID.
>>>>>>
>>>>>> Backend,
>>>>>>
>>>>>> .glusterfs
>>>>>> - 3c/78/3c789e71-24b0-4723-92a2-7eb3c14b4114
>>>>>> - 0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>>> - f1/e7/f1e7ad00-6500-4284-b21c-d02766ecc336
>>>>>> - 6c/3b/6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed
>>>>>>
>>>>>> Since f1 and h1 are hardlinks accross directories, file xattr will
>>>>>> have
>>>>>> two parent GFIDs. Xattr dump will be,
>>>>>>
>>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=1
>>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=1
>>>>>>
>>>>>> Number shows number of hardlinks per parent GFID.
>>>>>>
>>>>>> If we know GFID of a file, to get path,
>>>>>> 1. Identify which brick has that file using pathinfo xattr.
>>>>>> 2. Get all parent GFIDs(using listxattr on backend gfid path
>>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>> 3. Crawl those directories to find files with same inode as
>>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>>>
>>>>>> Updating PGFID to be done when,
>>>>>> 1. CREATE/MKNOD - Add xattr
>>>>>> 2. RENAME - If moved to different directory, Update PGFID
>>>>>> 3. UNLINK - If number of links is more than 1. Reduce number of link,
>>>>>> Remove respective parent PGFID
>>>>>> 4. LINK - Add PGFID if link to different directory, Increment count
>>>>>>
>>>>>> Advantageous:
>>>>>> 1. Crawling is limited to a few directories instead of full file
>>>>>> system
>>>>>> crawl.
>>>>>> 2. Break early during crawl when search reaches the hardlinks
>>>>>> number as
>>>>>> of Xattr value.
>>>>>>
>>>>>> Disadvantageous:
>>>>>> 1. Crawling is expensive if a directory has lot of files.
>>>>>> 2. Updating PGFID when CREATE/MKNOD/RENAME/UNLINK/LINK
>>>>>> 3. This method of conversion will not work if file is deleted.
>>>>>>
>>>>>> We can improve performance of GFID to Path conversion if we record
>>>>>> Basename also in file xattr.
>>>>>>
>>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=f1
>>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=h1
>>>>>>
>>>>>> Note: Multiple base names delimited by zerobyte.
>>>>>>
>>>>>> What additional overhead compare to storing only PGFID,
>>>>>> 1. Space
>>>>>> 2. Number of xattrs will grow as number of hardlinks
>>>>>> 3. Max size issue for xattr value?
>>>>>> 4. Even when renamed within the same directory.
>>>>>> 5. Updating value of xattr involves parsing in case of multiple
>>>>>> hardlinks.
>>>>>>
>>>>>> Are there any performance issues except during initial
>>>>>> indexing.(Assume
>>>>>> pgfid and basenames are populated by a separate script)
>>>>>>
>>>>>> Comments and Suggestions Welcome.
>>>>>>
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>


More information about the Gluster-devel mailing list