[Gluster-devel] GFID2 - Proposal to add extra byte to existing GFID

Mon May 15 12:15:55 UTC 2017

On Tue, Apr 11, 2017 at 2:59 PM, Amar Tumballi <amarts at gmail.com> wrote:

> Comments inline.
>
> On Mon, Dec 19, 2016 at 1:47 PM, Xavier Hernandez <xhernandez at datalab.es>
> wrote:
>
>> On 12/19/2016 07:57 AM, Aravinda wrote:
>>
>>>
>>> regards
>>> Aravinda
>>>
>>> On 12/16/2016 05:47 PM, Xavier Hernandez wrote:
>>>
>>>> On 12/16/2016 08:31 AM, Aravinda wrote:
>>>>
>>>>> Proposal to add one more byte to GFID to store "Type" information.
>>>>> Extra byte will represent type(directory: 00, file: 01, Symlink: 02
>>>>> etc)
>>>>>
>>>>> For example, if a directory GFID is f4f18c02-0360-4cdc-8c00-0164e4
>>>>> 9a7afd
>>>>> then, GFID2 will be 00f4f18c02-0360-4cdc-8c00-0164e49a7afd.
>>>>>
>>>>> Changes to Backend store
>>>>> ------------------------
>>>>> Existing: .glusterfs/gfid[0:2]/gfid/[2:4]/gfid
>>>>> Proposed: .glusterfs/gfid2[0:2]/gfid2[2:4]/gfid2[4:6]/gfid2
>>>>>
>>>>> Advantages:
>>>>> -----------
>>>>> - Automatic grouping in .glusterfs directory based on file Type.
>>>>> - Easy identification of Type by looking at GFID in logs/status output
>>>>>   etc.
>>>>>
>>>>
> Above two will be good enough points to bump up the priority for the
> feature.
>
>
>> - Crawling(Quota/AFR): List of directories can be easily fetched by
>>>>>   crawling `.glusterfs/gfid2[0:2]/` directory. This enables easy
>>>>>   parallel Crawling.
>>>>>
>>>>
> With the current design, we still have to do a distributed readdir() to
> get all
> the entries in the directory. This layout change, along with proposed
> DHT2/EHT/DHT2+ (name for me doesn't matter here) layout, where directory
> entries would be created in just one place should enhance the performance
> overall.
>
>
>> - Quota - Marker: Marker transator can mark xtime of current file and
>>>>>   parent directory. No need to update xtime xattr of all directories
>>>>>   till root.
>>>>> - Geo-replication: - Crawl can be multithreaded during initial sync.
>>>>>   With marker changes above it will be more effective in crawling.
>>>>>
>>>>>
>
>> Please add if any more advantageous.
>>>>>
>>>>> Disadvantageous:
>>>>> ----------------
>>>>> Functionality is not changed with the above change except the length
>>>>> of the ID. I can't think of any disadvantages except the code changes
>>>>> to accommodate this change. Let me know if I missed anything here.
>>>>>
>>>>
>>>> One disadvantage is that 17 bytes is a very ugly number for
>>>> structures. Compilers will add paddings that will make any structure
>>>> containing a GFID noticeable bigger. This will also cause troubles on
>>>> all binary formats where a GFID is used, making them incompatible. One
>>>> clear case of this is the XDR encoding of the gluster protocol.
>>>> Currently a GFID is defined this way in many places:
>>>>
>>>>         opaque gfid[16]
>>>>
>>>> This seems to make it quite complex to allow a mix of gluster versions
>>>> in the same cluster (for example in a middle of an upgrade).
>>>>
>>>
> Totally agree with Xavier here. Not in support of adding one more byte.
>
>
>>
>>>> What about this alternative approach:
>>>>
>>>> Based on the RFC4122 [1] that describes the format of an UUID, we can
>>>> define a new structure for new GFID's using the same length.
>>>>
>>>> Currently all GFID's are generated using the "random" method. This
>>>> means that all GFID have this structure:
>>>>
>>>>         xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx
>>>>
>>>> Where N can be 8, 9, A or B, and M is 4.
>>>>
>>>> There are some special GFID's that have a M=0 and N=0, for example the
>>>> root GFID.
>>>>
>>>> What I propose is to use a new variant of GFID, for example E or F
>>>> (officially marked as reserved for future definition) or even 0 to 7.
>>>> We could use M as an internal version for the GFID structure (defined
>>>> by ourselves when needed). Then we could use the first 4 or 8 bits of
>>>> each GFID as you propose, without needing to extend current GFID
>>>> length nor risking to collide with existing GFID's.
>>>>
>>>> If we are concerned about the collision probability (quite small but
>>>> still bigger than the current version) because we loose some random
>>>> bits, we could use N = 0..7 and leave M random. This way we get 5 more
>>>> random bits, from which we could use 4 to represent the inode type.
>>>>
>>>> I think this way everything will work smoothly with older versions
>>>> with minimal effort.
>>>>
>>>> What do you think ?
>>>>
>>> That is really nice suggestion.
>>>
>>> To get the crawling advantageous as mentioned above, we need to make
>>> backend store as .glusterfs/N/gfid[0:2]/gfid[2:4]/gfid
>>>
>>
>> That's one possibility. Since N will be 4 bits at most, it won't collide
>> with currently existing subdirectories that represent 8 bits. Or we could
>> use M. It all depends on the exact interpretation we give to each field.
>>
>> One suggestion I would make is to define it in a way that we use the
>> minimal amount of bits to represent what we need now but leave space for
>> future extensions. For example creating a "reserved" value for the field.
>>
>>
While discussing this with Aravinda, we realized, if we just make changes
in UUID generation logic, we don't need to be worried about version
incompatibility.

Also, I have a question, What are the chances of uuid collision if we take
just 3 bits from the first byte ?

000 - Unspecified (can be anything).
001 - Directory
010 - Regular File
011 - Special files (symlink, Block and Char devices, socket files etc).
{100 - 111} - Reserved.

As a side-effect, it reduces the number of directories created at as the
metadata, inside of .glusterfs directory. (Will be 50% of current load).

-Amar

> Proposal:
>>
>> Use N = 00xx for special GFID's, like NULL GFID, or the ones currently
>> used in some places. All these will also have M = 0. All other values of M
>> will be reserved for future extensions.
>>
>> Also reserve all other values of N (01xx) for future extensions.
>>
>> This gives a lot of space to represent many things in the future if
>> necessary, while keeping current usage compatible with it.
>>
>> For this particular case we could use N = 0000 and define M as (this is a
>> mapping of the posix S_IFxxx values):
>>
>> M = 0000 Current special GFID's
>> M = 0001 Fifo (S_IFIFO)
>> M = 0010 Character Device (S_IFCHR)
>> M = 0100 Directory (S_IFDIR)
>> M = 0110 Block Device (S_IFBLK)
>> M = 1000 Regular File (S_IFREG)
>> M = 1010 Symbolic Link (S_IFLNK)
>> M = 1100 Socket (S_IFSOCK)
>>
>> M = xx11 \
>> M = x1x1  | Reserved for future extensions
>> M = 1xx1  |
>> M = 111x /
>>
>> If we use our own mapping instead of using the same values than IF_Sxxx
>> macros, we can get a more compact representation if needed.
>>
>> In this case the directory structure could be
>> .glusterfs/M/gfid[0:2]/gfid[2:4]/gfid. And use M = 0 to put all current
>> existing gfid's, or we could leave existing gfid's in their current
>> location.
>>
>> Or we could even have .glusterfs/NM/gfid[0:2]/gfid[2:4]/gfid. This would
>> probably be compatible even with future extensions.
>>
>>
> I would go with only 'M' being considered for current layout and keeping N
> for future developments. Even though we are not considering 'N' internally,
> we can keep directory name as '00MM' (zero zero M M). so that backend
> layout would be compatible to consider N later if required.
>
> One major thing is we need a solid plan for migration from current layout
> to newer layout.
>
> Regards,
> Amar
>
>
>> Xavi
>>
>>
>>
>>
>>>> Xavi
>>>>
>>>> [1] https://www.ietf.org/rfc/rfc4122.txt
>>>>
>>>>
>>>>> Changes:
>>>>> ---------
>>>>> - Code changes to accommodate 17 bytes GFID instead of 16 bytes(Read
>>>>>   and Write)
>>>>> - Migration Tool to upgrade GFIDs in Volume/Cluster
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>>
>>>>
>>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170515/3fea1b3f/attachment.html>