[Gluster-users] Small files

Matan Safriel dev.matan at gmail.com
Fri Jan 30 16:05:48 UTC 2015


Thanks Liam,

Indeed I started this journey from that blog post. I learned a lot from
your last reply, but basically it confirms my thoughts that gluster has the
potential to be better at small files. I'm however quite reluctant to delve
into testing it, as the status of some blocking (?) development items,
mentioned by John on this thread, remains unknown to me.

I will use replication at the application level, until I find a file system
that behaves with reliable semantics around mass replication of small
files. It will be slower, but predictable and won't require much
communication on user groups for every step....

Also I am looking at cloud setup, and have less freedom in choosing
communication speeds and bandwidth between nodes, than you'd get in your
own datacenter (I think).

Matan


On Fri, Jan 30, 2015 at 4:28 AM, Liam Slusser <lslusser at gmail.com> wrote:

> Matan -
>
> We replicate to two nodes.  But since a zfs send | zfs recv communicates
> one-way, I'd think you could do as many as you want.  It just might take a
> little bit longer - although you should be able to run multiple at a time
> as long as you had enough bandwidth over the network.  Ours are connected
> via a dedicated 10gigabit network and see around 4-5gbit/sec on a large
> commit.  How long the replication job takes depends on how much is changed
> between the two snapshots.
>
> Even though the seek time with a SSD is quick, you'll still get far
> greater throughput in sequential read/writing vs small random accesses.
>
> You can test it yourself.  Create a directory with 100 64MB files and
> another directory with 64,000 100K files.  Now copy it from one place to
> another and see for yourself which is faster.  Sequential reading always
> wins.  And this is true with both Gluster and HDFS.
>
> In HDFS small files exacerbates the problem because you need to contact
> the NameNode to get the block information and then contact the DataNode to
> get the block.  Think of it like this.  Reading 1000 64KB files in HDFS
> means 1000 requests to the NameNode and 1000 requests to the datanodes
> while reading 1 64MB file is one trip to the NameNode and one trip the the
> Datanode to get the same amount of data.
>
> You can read more about this issue here:
> http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
>
> thanks,
> liam
>
> On Thu, Jan 29, 2015 at 12:30 PM, Matan Safriel <dev.matan at gmail.com>
> wrote:
>
>> Hi Liam,
>>
>> Thanks for the comprehensive reply (!)
>> How many nodes do you safely replicate to with ZFS?
>> I don't think seek time is much of a concern with SSD by the way, so it
>> does seem that glusterfs is much better for the small files scenario than
>> HDFS, which as you say is very different in key aspects, and couldn't quite
>> follow why rebalancing is slow or slower than in the case of HDFS actually,
>> unless you just meant that HDFS works at a large block level and no more.
>>
>> Perhaps you'd care to comment ;)
>>
>> Matan
>>
>> On Thu, Jan 29, 2015 at 9:15 PM, Liam Slusser <lslusser at gmail.com> wrote:
>>
>>> Matan - I'll do my best to take a shot at answering this...
>>>
>>> They're completely different technologies.  HDFS is not posix compliant
>>> and is not a "mountable" filesystem while Gluster is.
>>>
>>> In HDFS land, every file, directory and block in HDFS is represented as
>>> an object in the namenode’s memory, each of which occupies 150 bytes.  So
>>> 10 million files would each up about 3 gigs of memory.  Furthermore was
>>> designed for streaming large files - the default blocksize in HDFS is 64MB.
>>>
>>> Gluster doesn't have a central namenode, so having millions of files
>>> doesn't put a tax on it in the same way.  But, again, small files causes
>>> lots of small seeks to handle the replication tasks/checks and generally
>>> isn't very efficient.  So don't expect blazing performance...  Doing
>>> rebalancing and rebuilding of Gluster bricks can be extremely painful since
>>> Gluster isn't a block level filesystem - so it will have to read each file
>>> one at a time.
>>>
>>> If you want to use HDFS and don't need a mountable filesystem have a
>>> look at HBASE.
>>>
>>> We tacked the small files problem by using a different technology.  I
>>> have an image store of about 120 million+ small-file images, I needed a
>>> "mountable" filesystem which was posix compliant and ended up doing a ZFS
>>> setup - using the built in replication to create a few identical copies on
>>> different servers for both load balancing and reliability.  So we update
>>> one server and than have a few read-only copies serving the data.  Changes
>>> get replicated, at a block level, every few minutes.
>>>
>>> thanks,
>>> liam
>>>
>>>
>>> On Thu, Jan 29, 2015 at 4:29 AM, Matan Safriel <dev.matan at gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>>
>>>>> Is glusterfs much better than hdfs for the many small files scenario?
>>>>>
>>>>> Thanks,
>>>>> Matan
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>
>>>
>>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150130/0009593b/attachment.html>


More information about the Gluster-users mailing list