[Gluster-users] small files and cluster/stripe
Craig Carl
craig at gluster.com
Fri May 14 23:20:56 UTC 2010
Jeff -
I've paraphrased Tejas's response here -
1. There is no way to know how big a file will be until the fclose() is received.
2. What would we do about files that change sizes across the cutoff line?
3. We could perhaps add a size parameter to the rebalance/defrag scripts we have.
Would a process that redistributed the file on some sort of a schedule work?
Craig
--
Craig Carl
Gluster, Inc.
Cell - (408) 829-9953 (California, USA)
Gtalk - craig.carl at gmail.com
----- Original Message -----
From: "Jeff Anderson-Lee" <jonah at eecs.berkeley.edu>
To: "Craig Carl" <craig at gluster.com>
Cc: gluster-users at gluster.org
Sent: Thursday, May 13, 2010 6:39:31 PM GMT -08:00 US/Canada Pacific
Subject: Re: [Gluster-users] small files and cluster/stripe
On 5/13/2010 6:24 PM, Craig Carl wrote:
Jeff -
Thanks for your email, I think I've got a grasp of your environment now and I understand the problem. If we create a "/gluster/small_files" and a "/gluster/large_files" your users are unlikely to respect distinction, plus it is a management nightmare, right?
If you have time I'd like your help writing a feature request that would implement what you need. Something like -
Gluster should provide the option of distributing files based on size to different volumes.
This distribution should be transparent to users.
This distribution only needs to happen the first time a file is written.
The Gluster administrator should have the ability to provide a file size range for each volume.
The different volumes could be different types; mirror, stripe, mirror & distribute, etc.
What have I missed?
Craig
That would be one solution. I would target another that I suspecr is probably simpler:
Gluster should provide the option of pseudo-randomizing the distribution of file stripes across volumes, so that all small files do not end up on the same subvolume of a cluster/stripe.
This distribution should be transparent to users.
This distribution only needs to happen the first time a file is written and may be based on the file name hash (a la cluster/distribute).
The net behavior could be such that small files (less that the block-size) would have the same data distribution pattern as they would have with cluster/distribute, while larger files (greater than the stripe block-size) would have their upper blocks ditributed in a round-robin from that starting place.
Given that the code already exists for distributing files based on namehash in cluster/distribute I think this could be an easier feature to add.
Jeff
More information about the Gluster-users
mailing list