[Gluster-users] Shard storage suggestions

Krutika Dhananjay kdhananj at redhat.com
Tue Jul 19 07:23:29 UTC 2016


Please find my response inline:


On Mon, Jul 18, 2016 at 4:03 PM, Gandalf Corvotempesta <
gandalf.corvotempesta at gmail.com> wrote:

> 2016-07-18 12:25 GMT+02:00 Krutika Dhananjay <kdhananj at redhat.com>:
> > Hi,
> >
> > The suggestion you gave was in fact considered at the time of writing
> shard
> > translator.
> > Here are some of the considerations for sticking with a single directory
> as
> > opposed to a two-tier classification of shards based on the initial
> chars of
> > the uuid string:
> > i) Even for a 4TB disk with the smallest possible shard size of 4MB,
> there
> > will only be a max of 1048576 entries
> >  under /.shard in the worst case - a number far less than the max number
> of
> > inodes that are supported by most backend file systems.
>
> This with just 1 single file.
> What about thousands of huge sharded files? In a petabyte scale cluster,
> having
> thousands of huge file should be considered normal.
>
> > iii) Resolving shards from the original file name as given by the
> > application to the corresponding shard within a single directory
> (/.shard in
> > the existing case) would mean, looking up the parent dir /.shard first
> > followed by lookup on the actual shard that is to be operated on. But
> having
> > a two-tier sub-directory structure means that we not only have to resolve
> > (or look-up) /.shard first, but also the directories '/.shard/d2',
> > '/.shard/d2/18', and '/.shard/d2/18/d218cd1c-4bd9-40d7-9810-86b3f7932509'
> > before finally looking up the shard, which is a lot of network
> operations.
> > Yes, these are all one-time operations and the results can be cached in
> the
> > inode table, but still on account of having to have dynamic gfids (as
> > opposed to just /.shard, which has a fixed gfid -
> > be318638-e8a0-4c6d-977d-7a937aa84806), it is trivial to resolve the name
> of
> > the shard to gfid, or the parent name to parent gfid _even_ in memory.
>
> What about just 1 single level?
>
> /.shard/d218cd1c-4bd9-40d7-9810-86b3f7932509/d218cd1c-4bd9-40d7-9810-86b3f7932509.1
> ?
>
> You have the GFID, thus there is no need to crawl multiple levels,
> just direct-access to the proper path.
>
> With this soulution, you have 1.048.576 entries with a 4TB shared file
> with 4MB shard size.
> With the current implementation, you have 1.048.576 for each sharded
> file. If I have 100 4TB files, i'll end
> with 1.048.576*100 = 104.857.600 files in a single directory.
>

No. Note that all files and directories under /.shard are like normal files
and directories to the file-system, it is just
shard translator that has a special way of interpreting this layout of
files. What this means is that each file and directory at and under /.shard
will need to have its own unique gfid. In other words, for this layout you
suggested:

/.shard/d218cd1c-4bd9-40d7-9810-86b3f7932509/d218cd1c-4bd9-40d7-9810-86b3f7932509.1

the sub-directory /.shard/d218cd1c-4bd9-40d7-9810-86b3f7932509/ will not
have the gfid d218cd1c-4bd9-40d7-9810-86b3f7932509 because this is already
assigned to the original file that was sharded based on whose gfid we are
naming the rest of the shards. This means the subdir will have a new gfid
assigned to it, as will the file d218cd1c-4bd9-40d7-9810-86b3f7932509.1.
Does that make sense? You can actually create a file and do writes so that
it gets sharded and go to /.shard directory and see how the block file
(<gfid>.N where N=1,2,3...) gets a new gfid. That should give you some
understanding of inodes and gfids in gluster maybe.

I agree with the math that you did above. It's just that I am not
particularly sure to what extent this new classification approach will
actually improve the performance, not discounting the price we pay in terms
of the number of lookups needed to initially resolve shards with the new
solution.

-Krutika



>
> > Are you unhappy with the performance? What's your typical VM image size,
> > shard block size and the capacity of individual bricks?
>
> No, i'm just thinking about this optimization.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160719/37afe394/attachment.html>


More information about the Gluster-users mailing list