[Gluster-devel] GFID to Path Conversion

Tue Jan 12 06:38:44 UTC 2016

regards
Aravinda

On 01/11/2016 08:00 PM, Shyam wrote:
> On 01/06/2016 04:46 AM, Aravinda wrote:
>>
>> regards
>> Aravinda
>>
>> On 01/06/2016 02:49 AM, Shyam wrote:
>>> On 12/09/2015 12:47 AM, Aravinda wrote:
>>>> Hi,
>>>>
>>>> Sharing draft design for GFID to Path Conversion.(Directory GFID to
>>>> Path is
>>>> very easy in DHT v.1, this design may not work in case of DHT 2.0)
>>>
>>> (current thought) DHT2 would extend the manner in which name,pGFID is
>>> stored for files, for directories. So reverse path walking would
>>> leverage the same mechanism as explained below.
>>>
>>> Of course, as this would involve MDS hopping, the intention would be
>>> to *not* use this in IO critical paths, and rather use this in the
>>> tool set that needs reverse path walks to provide information to 
>>> admins.
>>>
>>>>
>>>> Performance and Storage space impact yet to be analyzed.
>>>>
>>>> Storing the required informaton
>>>> -------------------------------
>>>> Metadata information related to Parent GFID and Basename will reside
>>>> with the file. PGFID and hash of Basename will become part of Xattr
>>>> Key name and Basename will be saved as Value.
>>>>
>>>>      Xattr Key = meta.<PGFID>.<HASH(BASENAME)>
>>>>      Xattr Value = <BASENAME>
>>>
>>> I would think we should keep the xattr name constant, and specialize
>>> the value, instead of encoding data in the xattr value itself. The
>>> issue is of course multiple xattr name:value pairs where name is
>>> constant is not feasible and needs some thought.
>> If we use single xattr for multiple values then updating one's basename
>> will have to parse the existing xattr before update(in case of 
>> hardlinks)
>> Wrote about other experiments did to update and read xattrs.
>> http://www.gluster.org/pipermail/gluster-devel/2015-December/047380.html
>
> Agree and understood, I am more thinking how we will enumerate all 
> such xattrs, when we just know the name. We possibly would do 
> listxattr in that case, would that be right?
To Create/Update the Xattr, search is not required. For example, Create 
d1/d2/f1

pgfid = get_gfid(d1/d2)
xattr_name = "meta." + pgfid + "." + HASH(f1)
value = "f1"
setxattr(d1/d2/f1, xattr_name, value)

In case of Rename(d1/d2/f1 => d1/d3/f3),
pgfid_old = get_gfid(d1/d2)
pgfid_new = get_gfid(d1/d3)
xattr_name_old = "meta." + pgfid_old + "." + HASH(f1)
xattr_name_new = "meta." + pgfid_new + "." + HASH(f3)
value_new = "f3"
removexattr(d1/d2/f1, xattr_name_old)
setxattr(d1/d3/f3, xattr_name_new, value_new)

Populate xattrs example, 
https://gist.github.com/aravindavk/5307489f68cbcfb37d3d

Each xattrs can be independently handled(thread safe) since xattr 
key/value is not dependent on others base name.

To read xattr and convert to path(Python example, 
https://gist.github.com/aravindavk/d1d0ca9c874b7d3d8d86)

paths = []
all_xattrs = listxattr(PATH)
for xattr in all_xattrs{
     if xattr_name.startswith("meta."){
         paths.append(getxattr(PATH, xattr_name))
     }
}
print paths

>>>
>>>>
>>>> Non-crypto hash is suitable for this purpose.
>>>> Number of Xattrs on a file = Number of Links
>>>>
>>>> Converting GFID to Path
>>>> -----------------------
>>>> Example GFID: 78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038
>>>
>>> Here is where we get into a bit of a problem, if a file has links.
>>> Which path to follow would be a dilemma. We could return all paths,
>>> but tools like glusterfind or backup related, would prefer a single
>>> file. One of the thoughts is, if we could feed a pGFID:GFID pair as
>>> input, this still does not solve a file having links within the same
>>> pGFID.
>>>
>>> Anyway, something to note or consider.
>>>
>>>>
>>>> 1. List all xattrs of GFID file in the brick backend.
>>>> ($BRICK_ROOT/.glusterfs/78/e8/78e8bce0-a8c9-4e67-9ffb-c4c4c7eff038)
>>>> 2. If Xattr Key starts with “meta”, Split to get parent GFID and 
>>>> collect
>>>> xattr value
>>>> 3. Convert Parent GFID to path using recursive readlink till path.
>>>
>>> This is the part which should/would change with DHT2 in my opinion.
>>> Sort of repeating step (2) here instead of a readlink.
>>>
>>>> 4. Join Converted parent dir path and xattr value(basename)
>>>>
>>>> Recording
>>>> ---------
>>>> MKNOD/CREATE/LINK/SYMLINK: Add new Xattr(PGFID, BN)
>>>
>>> Most of these operations as they exist today are not atomic, i.e we
>>> create the file and then add the xattrs and then possibly hardlink the
>>> GFID, so by the time the GFID makes it's presence, the file is all
>>> ready and (maybe) hence consistent.
>>>
>>> The other way to look at this is that we get the GFID representation
>>> ready, and then hard link the name into the name tree. Alternately we
>>> could leverage O_TMPFILE to create the file encode all its inode
>>> information and then bring it to life in the namespace. This is
>>> orthogonal to this design, but brings in needs to be consistent on
>>> failures.
>>>
>>> Either way, if a failure occurs midway, we have no way to recover the
>>> information for the inode and set it right. Thoughts?
>>>
>>>> RENAME: Remove old xattr(PGFID1, BN1), Add new xattr(PGFID2, BN2)
>>>> UNLINK: If Link count > 1 then Remove xattr(PGFID, BN)
>>>>
>>>> Heal on Lookup
>>>> --------------
>>>> Healing on lookup can be enabled if required, by default we can
>>>> disable this option since this may have performance implications
>>>> during read.
>>>>
>>>> Enabling the logging
>>>> ---------------------
>>>> This can be enabled using Volume set option. Option name TBD.
>>>>
>>>> Rebuild Index
>>>> -------------
>>>> Offline activity, crawls the backend filesystem and builds all the
>>>> required xattrs.
>>>
>>> Frequency of the rebuild? I would assume this would be run when the
>>> option is enabled, and later almost never, unless we want to recover
>>> from some inconsistency in the data (how to detect the same would be
>>> an open question).
>>>
>>> Also I think once this option is enabled, we should prevent disabling
>>> the same (or at least till the packages are downgraded), as this would
>>> be a hinge that multiple other features may depend on, and so we
>>> consider this an on-disk change that is made once, and later
>>> maintained for the volume, rather than turn on/off.
>>>
>>> Which means the initial index rebuild would be a volume version
>>> conversion from current to this representation and may need aditional
>>> thoughts on how we maintain volume versions.
>>>
>>>>
>>>> Comments and Suggestions Welcome.
>>>>
>>>> regards
>>>> Aravinda
>>>>
>>>> On 11/25/2015 10:08 AM, Aravinda wrote:
>>>>>
>>>>> regards
>>>>> Aravinda
>>>>>
>>>>> On 11/24/2015 11:25 PM, Shyam wrote:
>>>>>> There seem to be other interested consumers in gluster for the same
>>>>>> information, and I guess we need a god base design to address 
>>>>>> this on
>>>>>> disk change, so that it can be leveraged in the various use cases
>>>>>> appropriately.
>>>>>>
>>>>>> Request a few folks to list out how they would use this feature and
>>>>>> also what performance characteristics they expect around the same.
>>>>>>
>>>>>> - gluster find class of utilties
>>>>>> - change log processors
>>>>>> - swift on file
>>>>>> - inotify support on gluster
>>>>>> - Others?
>>>>> Debugging utilities for users/admins(Show path for GFIDs displayed in
>>>>> log files)
>>>>> Retrigger Sync in Geo-replication(Geo-rep reports failed GFIDs in
>>>>> logs, we can retrigger sync if path is known instead of GFID)
>>>>>>
>>>>>> [3] is an attempt in XFS to do the same, possibly there is a more
>>>>>> later thread around the same that discusses later approaches.
>>>>>>
>>>>>> [4] slide 13 onwards talks about how cephfs does this. (see cephfs
>>>>>> inode backtraces)
>>>>>>
>>>>>> Aravinda, could you put up a design for the same, and how and where
>>>>>> this is information is added etc. Would help review it from other
>>>>>> xlators perspective (like existing DHT).
>>>>>>
>>>>>> Shyam
>>>>>> [3] http://oss.sgi.com/archives/xfs/2014-01/msg00224.html
>>>>>> [4]
>>>>>> http://events.linuxfoundation.org/sites/events/files/slides/CephFS-Vault.pdf 
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/27/2015 10:02 AM, Shyam wrote:
>>>>>>> Aravinda, List,
>>>>>>>
>>>>>>> The topic is interesting and also relevant in the case of DHT2
>>>>>>> where we
>>>>>>> lose the hierarchy on a single brick (unlike the older DHT) and so
>>>>>>> some
>>>>>>> of the thoughts here are along the same lines as what we are 
>>>>>>> debating
>>>>>>> w.r.t DHT2 as well.
>>>>>>>
>>>>>>> Here is another option that extends the current thought, that I 
>>>>>>> would
>>>>>>> like to put forward, that is pretty much inspired from the Linux
>>>>>>> kernel
>>>>>>> NFS implementation (based on my current understanding of the same)
>>>>>>> [1] [2].
>>>>>>>
>>>>>>> If gluster server/brick processes handed out handles, (which are
>>>>>>> currently just GFID (or inode #) of the file), that encode
>>>>>>> pGFID/GFID,
>>>>>>> then on any handle based operation, we get the pGFID/GFID for the
>>>>>>> object
>>>>>>> being operated on. This solves the first part of the problem 
>>>>>>> where we
>>>>>>> are encoding the pGFID in the xattr, and here we not only do 
>>>>>>> that but
>>>>>>> further hand out the handle with that relationship.
>>>>>>>
>>>>>>> It also helps when an object is renamed and we still allow the 
>>>>>>> older
>>>>>>> handle to be used for operations. Not a bad thing in some cases, 
>>>>>>> and
>>>>>>> possibly not the best thing to do in some other cases (say access).
>>>>>>>
>>>>>>> To further this knowledge back to a name, what you propose can be
>>>>>>> stored
>>>>>>> on the object itself. Thus giving us a short dentry tree creation
>>>>>>> ability of pGFID->name(GFID).
>>>>>>>
>>>>>>> This of course changes the gluster RPC wire protocol, as we need to
>>>>>>> encode/send pGFID as well in some cases (or could be done adding
>>>>>>> this to
>>>>>>> the xdata payload.
>>>>>>>
>>>>>>> Shyam
>>>>>>>
>>>>>>> [1] http://nfs.sourceforge.net/#faq_c7
>>>>>>> [2]
>>>>>>> https://www.kernel.org/doc/Documentation/filesystems/nfs/Exporting
>>>>>>>
>>>>>>> On 10/27/2015 03:07 AM, Aravinda wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We have a volume option called "build-pgfid:on" to enable 
>>>>>>>> recording
>>>>>>>> parent gfid in file xattr. This simplifies the GFID to Path
>>>>>>>> conversion.
>>>>>>>> Is it possible to save base name also in xattr along with 
>>>>>>>> PGFID? It
>>>>>>>> helps in converting GFID to Path easily without doing crawl.
>>>>>>>>
>>>>>>>> Example structure,
>>>>>>>>
>>>>>>>> dir1 (3c789e71-24b0-4723-92a2-7eb3c14b4114)
>>>>>>>>      - f1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>>>>      - f2 (f1e7ad00-6500-4284-b21c-d02766ecc336)
>>>>>>>> dir2 (6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed)
>>>>>>>>      - h1 (0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>>>>
>>>>>>>> Where file f1 and h1 are hardlinks. Note the same GFID.
>>>>>>>>
>>>>>>>> Backend,
>>>>>>>>
>>>>>>>> .glusterfs
>>>>>>>>       - 3c/78/3c789e71-24b0-4723-92a2-7eb3c14b4114
>>>>>>>>       - 0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>>>>>       - f1/e7/f1e7ad00-6500-4284-b21c-d02766ecc336
>>>>>>>>       - 6c/3b/6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed
>>>>>>>>
>>>>>>>> Since f1 and h1 are hardlinks accross directories, file xattr will
>>>>>>>> have
>>>>>>>> two parent GFIDs. Xattr dump will be,
>>>>>>>>
>>>>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=1
>>>>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=1
>>>>>>>>
>>>>>>>> Number shows number of hardlinks per parent GFID.
>>>>>>>>
>>>>>>>> If we know GFID of a file, to get path,
>>>>>>>> 1. Identify which brick has that file using pathinfo xattr.
>>>>>>>> 2. Get all parent GFIDs(using listxattr on backend gfid path
>>>>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c)
>>>>>>>> 3. Crawl those directories to find files with same inode as
>>>>>>>> .glusterfs/0a/a9/0aa94a0a-62aa-4afc-9d59-eb68ad39f78c
>>>>>>>>
>>>>>>>> Updating PGFID to be done when,
>>>>>>>> 1. CREATE/MKNOD - Add xattr
>>>>>>>> 2. RENAME - If moved to different directory, Update PGFID
>>>>>>>> 3. UNLINK - If number of links is more than 1. Reduce number of
>>>>>>>> link,
>>>>>>>> Remove respective parent PGFID
>>>>>>>> 4. LINK - Add PGFID if link to different directory, Increment 
>>>>>>>> count
>>>>>>>>
>>>>>>>> Advantageous:
>>>>>>>> 1. Crawling is limited to a few directories instead of full file
>>>>>>>> system
>>>>>>>> crawl.
>>>>>>>> 2. Break early during crawl when search reaches the hardlinks
>>>>>>>> number as
>>>>>>>> of Xattr value.
>>>>>>>>
>>>>>>>> Disadvantageous:
>>>>>>>> 1. Crawling is expensive if a directory has lot of files.
>>>>>>>> 2. Updating PGFID when CREATE/MKNOD/RENAME/UNLINK/LINK
>>>>>>>> 3. This method of conversion will not work if file is deleted.
>>>>>>>>
>>>>>>>> We can improve performance of GFID to Path conversion if we record
>>>>>>>> Basename also in file xattr.
>>>>>>>>
>>>>>>>> trusted.pgfid.3c789e71-24b0-4723-92a2-7eb3c14b4114=f1
>>>>>>>> trusted.pgfid.6c3bf2ea-9b52-4bda-a1db-01f3ed5e3fed=h1
>>>>>>>>
>>>>>>>> Note: Multiple base names delimited by zerobyte.
>>>>>>>>
>>>>>>>> What additional overhead compare to storing only PGFID,
>>>>>>>> 1. Space
>>>>>>>> 2. Number of xattrs will grow as number of hardlinks
>>>>>>>> 3. Max size issue for xattr value?
>>>>>>>> 4. Even when renamed within the same directory.
>>>>>>>> 5. Updating value of xattr involves parsing in case of multiple
>>>>>>>> hardlinks.
>>>>>>>>
>>>>>>>> Are there any performance issues except during initial
>>>>>>>> indexing.(Assume
>>>>>>>> pgfid and basenames are populated by a separate script)
>>>>>>>>
>>>>>>>> Comments and Suggestions Welcome.
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Gluster-devel mailing list
>>>>>>> Gluster-devel at gluster.org
>>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>> _______________________________________________
>>>>> Gluster-devel mailing list
>>>>> Gluster-devel at gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>