[Gluster-devel] GFID to Path Conversion

Mon Jan 11 14:35:09 UTC 2016

On 01/05/2016 08:24 PM, Venky Shankar wrote:
>
>
> Shyam wrote:
>> On 12/09/2015 12:47 AM, Aravinda wrote:
>>> Hi,
>>>
>>> Sharing draft design for GFID to Path Conversion.(Directory GFID to
>>> Path is
>>> very easy in DHT v.1, this design may not work in case of DHT 2.0)
>>
>> (current thought) DHT2 would extend the manner in which name,pGFID is
>> stored for files, for directories. So reverse path walking would
>> leverage the same mechanism as explained below.
>>
>> Of course, as this would involve MDS hopping, the intention would be to
>> *not* use this in IO critical paths, and rather use this in the tool set
>> that needs reverse path walks to provide information to admins.
>>
>>>
>>> Performance and Storage space impact yet to be analyzed.
>>>
>>> Storing the required informaton
>>> -------------------------------
>>> Metadata information related to Parent GFID and Basename will reside
>>> with the file. PGFID and hash of Basename will become part of Xattr
>>> Key name and Basename will be saved as Value.
>>>
>>> Xattr Key = meta.<PGFID>.<HASH(BASENAME)>
>>> Xattr Value = <BASENAME>
>>
>> I would think we should keep the xattr name constant, and specialize the
>> value, instead of encoding data in the xattr value itself. The issue is
>> of course multiple xattr name:value pairs where name is constant is not
>> feasible and needs some thought.
>
> With DHT2, the "multi-value key" could possibly be stored efficiently in
> some kvdb rather than xattrs (when it does move there). With current
> DHT, we're still stuck with using xattrs, where having a compounded
> value would rather be inefficient.

With DHT2 if we get to the point of on-disk xlator controlling the doff, 
then we could/may need that information as well for optimization 
purposes (just stating).

>
>>
>>>
>>> Non-crypto hash is suitable for this purpose.
>>> Number of Xattrs on a file = Number of Links
>>>
>>> Converting GFID to Path
>>> -----------------------
>>> Example GFID: 78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038
>>
>> Here is where we get into a bit of a problem, if a file has links. Which
>> path to follow would be a dilemma. We could return all paths, but tools
>> like glusterfind or backup related, would prefer a single file. One of
>> the thoughts is, if we could feed a pGFID:GFID pair as input, this still
>> does not solve a file having links within the same pGFID.
>
> Why not just list all possible paths? I think that might be the correct
> thing to do. In most cases, this would just be dealing with a single
> link count. For other cases (nlink > 1), the higher level code would
> need to do some sort of juggling - utilities such as glusterfind (or
> other backup tools) could possibly perform additional checks before
> doing their job, but in most cases, they would be dealing with a single
> link count.

Hmmm... one way for sure. But, let's say it is a backup application that 
needs files that changed, in the above example we would list a file with 
hardlinks twice. Hence the thought, of what we could do here.

<not for now tag>
Maybe we could pass the <index:pGFID:GFID> back as the handle, where the 
index is a number uniquely identifying the base name in the inode KV 
list. This way at least change log and others based on the handle can 
arrive at which bname/index was being operated upon etc.
<end tag>

>
>>
>> Anyway, something to note or consider.
>>
>>>
>>> 1. List all xattrs of GFID file in the brick backend.
>>> ($BRICK_ROOT/.glusterfs/78/e8/78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038)
>>> 2. If Xattr Key starts with “meta”, Split to get parent GFID and collect
>>> xattr value
>>> 3. Convert Parent GFID to path using recursive readlink till path.
>>
>> This is the part which should/would change with DHT2 in my opinion. Sort
>> of repeating step (2) here instead of a readlink.
>>
>>> 4. Join Converted parent dir path and xattr value(basename)
>>>
>>> Recording
>>> ---------
>>> MKNOD/CREATE/LINK/SYMLINK: Add new Xattr(PGFID, BN)
>>
>> Most of these operations as they exist today are not atomic, i.e we
>> create the file and then add the xattrs and then possibly hardlink the
>> GFID, so by the time the GFID makes it's presence, the file is all ready
>> and (maybe) hence consistent.
>>
>> The other way to look at this is that we get the GFID representation
>> ready, and then hard link the name into the name tree. Alternately we
>> could leverage O_TMPFILE to create the file encode all its inode
>> information and then bring it to life in the namespace. This is
>> orthogonal to this design, but brings in needs to be consistent on
>> failures.
>
> IIRC, last time I checked, using O_TMPFILE was not portable, but this
> can still be used wherever its available.
>
>>
>> Either way, if a failure occurs midway, we have no way to recover the
>> information for the inode and set it right. Thoughts?
>>
>>> RENAME: Remove old xattr(PGFID1, BN1), Add new xattr(PGFID2, BN2)
>>> UNLINK: If Link count > 1 then Remove xattr(PGFID, BN)
>>>
>>> Heal on Lookup
>>> --------------
>>> Healing on lookup can be enabled if required, by default we can
>>> disable this option since this may have performance implications
>>> during read.
>>>
>>> Enabling the logging
>>> ---------------------
>>> This can be enabled using Volume set option. Option name TBD.
>>>
>>> Rebuild Index
>>> -------------
>>> Offline activity, crawls the backend filesystem and builds all the
>>> required xattrs.
>>
>> Frequency of the rebuild? I would assume this would be run when the
>> option is enabled, and later almost never, unless we want to recover
>> from some inconsistency in the data (how to detect the same would be an
>> open question).
>>
>> Also I think once this option is enabled, we should prevent disabling
>> the same (or at least till the packages are downgraded), as this would
>> be a hinge that multiple other features may depend on, and so we
>> consider this an on-disk change that is made once, and later maintained
>> for the volume, rather than turn on/off.
>>
>> Which means the initial index rebuild would be a volume version
>> conversion from current to this representation and may need aditional
>> thoughts on how we maintain volume versions.
>>
>>>
>>> Comments and Suggestions Welcome.
>>>
>>> regards
>>> Aravinda
>>>
>>> On 11/25/2015 10:08 AM, Aravinda wrote:
>>>>
>>>> regards
>>>> Aravinda
>>>>
>>>> On 11/24/2015 11:25 PM, Shyam wrote:
>>>>> There seem to be other interested consumers in gluster for the same
>>>>> information, and I guess we need a god base design to address this on
>>>>> disk change, so that it can be leveraged in the various use cases
>>>>> appropriately.
>>>>>
>>>>> Request a few folks to list out how they would use this feature and
>>>>> also what performance characteristics they expect around the same.
>>>>>
>>>>> - gluster find class of utilties
>>>>> - change log processors
>>>>> - swift on file
>>>>> - inotify support on gluster
>>>>> - Others?
>>>> Debugging utilities for users/admins(Show path for GFIDs displayed in
>>>> log files)
>>>> Retrigger Sync in Geo-replication(Geo-rep reports failed GFIDs in
>>>> logs, we can retrigger sync if path is known instead of GFID)
>>>>>
>>>>> [3] is an attempt in XFS to do the same, possibly there is a more
>>>>> later thread around the same that discusses later approaches.
>>>>>
>>>>> [4] slide 13 onwards talks about how cephfs does this. (see cephfs
>>>>> inode backtraces)
>>>>>
>>>>> Aravinda, could you put up a design for the same, and how and where
>>>>> this is information is added etc. Would help review it from other
>>>>> xlators perspective (like existing DHT).
>>>>>
>>>>> Shyam
>>>>> [3] http://oss.sgi.com/archives/xfs/2014-01/msg00224.html
>>>>> [4]
>>>>> http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf
>>>>>
>>>>>
>>>>>
>>>>> On 10/27/2015 10:02 AM, Shyam wrote:
>>>>>> Aravinda, List,
>>>>>>
>>>>>> The topic is interesting and also relevant in the case of DHT2
>>>>>> where we
>>>>>> lose the hierarchy on a single brick (unlike the older DHT) and so
>>>>>> some
>>>>>> of the thoughts here are along the same lines as what we are debating
>>>>>> w.r.t DHT2 as well.
>>>>>>
>>>>>> Here is another option that extends the current thought, that I would
>>>>>> like to put forward, that is pretty much inspired from the Linux
>>>>>> kernel
>>>>>> NFS implementation (based on my current understanding of the same)
>>>>>> [1] [2].
>>>>>>
>>>>>> If gluster server/brick processes handed out handles, (which are
>>>>>> currently just GFID (or inode #) of the file), that encode
>>>>>> pGFID/GFID,
>>>>>> then on any handle based operation, we get the pGFID/GFID for the
>>>>>> object
>>>>>> being operated on. This solves the first part of the problem where we
>>>>>> are encoding the pGFID in the xattr, and here we not only do that but
>>>>>> further hand out the handle with that relationship.
>>>>>>
>>>>>> It also helps when an object is renamed and we still allow the older
>>>>>> handle to be used for operations. Not a bad thing in some cases, and
>>>>>> possibly not the best thing to do in some other cases (say access).
>>>>>>
>>>>>> To further this knowledge back to a name, what you propose can be
>>>>>> stored
>>>>>> on the object itself. Thus giving us a short dentry tree creation
>>>>>> ability of pGFID->name(GFID).
>>>>>>
>>>>>> This of course changes the gluster RPC wire protocol, as we need to
>>>>>> encode/send pGFID as well in some cases (or could be done adding
>>>>>> this to
>>>>>> the xdata payload.
>>>>>>
>>>>>> Shyam
>>>>>>
>>>>>> [1] http://nfs.sourceforge.net/#faq_c7
>>>>>> [2]
>>>>>> https://www.kernel.org/doc/Documentation/filesystems/nfs/Exporting
>>>>>>
>>>>>> On 10/27/2015 03:07 AM, Aravinda wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have a volume option called "build-pgfid:on" to enable recording
>>>>>>> parent gfid in file xattr. This simplifies the GFID to Path
>>>>>>> conversion.
>>>>>>> Is it possible to save base name also in xattr along with PGFID? It
>>>>>>> helps in converting GFID to Path easily without doing crawl.
>>>>>>>
>>>>>>> Example structure,
>>>>>>>
>>>>>>> dir1 (3c789e71-24b0-4723-92a2-7eb3c14b4114)
>>>>>>> - f1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>>> - f2 (f1e7ad00-6500-4284-b21c-d02766ecc336)
>>>>>>> dir2 (6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed)
>>>>>>> - h1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>>>
>>>>>>> Where file f1 and h1 are hardlinks. Note the same GFID.
>>>>>>>
>>>>>>> Backend,
>>>>>>>
>>>>>>> .glusterfs
>>>>>>> - 3c/78/3c789e71-24b0-4723-92a2-7eb3c14b4114
>>>>>>> - 0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>>>> - f1/e7/f1e7ad00-6500-4284-b21c-d02766ecc336
>>>>>>> - 6c/3b/6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed
>>>>>>>
>>>>>>> Since f1 and h1 are hardlinks accross directories, file xattr will
>>>>>>> have
>>>>>>> two parent GFIDs. Xattr dump will be,
>>>>>>>
>>>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=1
>>>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=1
>>>>>>>
>>>>>>> Number shows number of hardlinks per parent GFID.
>>>>>>>
>>>>>>> If we know GFID of a file, to get path,
>>>>>>> 1. Identify which brick has that file using pathinfo xattr.
>>>>>>> 2. Get all parent GFIDs(using listxattr on backend gfid path
>>>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>>> 3. Crawl those directories to find files with same inode as
>>>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>>>>
>>>>>>> Updating PGFID to be done when,
>>>>>>> 1. CREATE/MKNOD - Add xattr
>>>>>>> 2. RENAME - If moved to different directory, Update PGFID
>>>>>>> 3. UNLINK - If number of links is more than 1. Reduce number of
>>>>>>> link,
>>>>>>> Remove respective parent PGFID
>>>>>>> 4. LINK - Add PGFID if link to different directory, Increment count
>>>>>>>
>>>>>>> Advantageous:
>>>>>>> 1. Crawling is limited to a few directories instead of full file
>>>>>>> system
>>>>>>> crawl.
>>>>>>> 2. Break early during crawl when search reaches the hardlinks
>>>>>>> number as
>>>>>>> of Xattr value.
>>>>>>>
>>>>>>> Disadvantageous:
>>>>>>> 1. Crawling is expensive if a directory has lot of files.
>>>>>>> 2. Updating PGFID when CREATE/MKNOD/RENAME/UNLINK/LINK
>>>>>>> 3. This method of conversion will not work if file is deleted.
>>>>>>>
>>>>>>> We can improve performance of GFID to Path conversion if we record
>>>>>>> Basename also in file xattr.
>>>>>>>
>>>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=f1
>>>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=h1
>>>>>>>
>>>>>>> Note: Multiple base names delimited by zerobyte.
>>>>>>>
>>>>>>> What additional overhead compare to storing only PGFID,
>>>>>>> 1. Space
>>>>>>> 2. Number of xattrs will grow as number of hardlinks
>>>>>>> 3. Max size issue for xattr value?
>>>>>>> 4. Even when renamed within the same directory.
>>>>>>> 5. Updating value of xattr involves parsing in case of multiple
>>>>>>> hardlinks.
>>>>>>>
>>>>>>> Are there any performance issues except during initial
>>>>>>> indexing.(Assume
>>>>>>> pgfid and basenames are populated by a separate script)
>>>>>>>
>>>>>>> Comments and Suggestions Welcome.
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Gluster-devel mailing list
>>>>>> Gluster-devel at gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>> _______________________________________________
>>>> Gluster-devel mailing list
>>>> Gluster-devel at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>