[Gluster-users] GlusterFS vs Ceph and Hadoop

Sun Dec 29 00:50:36 UTC 2013

On 12/28/2013 08:06 PM, Knut Moe wrote:
> Are there benefits in configuration, scaling, management etc?

Hi,
One year ago I was also in the situation where I had to compare these 3 cluster 
file systems. I'm not the greatest expert on this topic, but here's a rough summary of what I know and why we chose GlusterFS over Hadoop or Ceph.

There are great differences between Hadoop/Ceph and GlusterFS.
First of all being that GlusterFS distributes files and works on top of
existing filesystems. It is completely transparent to the system and
applications.
This adds overhead (at costs of speed), but increases your options in
case of errors and long-term preservation.

Most other storage cluster systems on the other hand, split each file
into a certain number of blocks and distribute those.

Hadoop is not just some filesystem you can mount (as I understood from
reading Apache's docs): You have to talk to it using the Hadoop API. As
I understood it, you'd have to write your applications to use Hadoop,
instead of just having it as transparent filesystem beneath. Please
correct me if I'm wrong.

This makes a great difference regarding effort and use-cases.

What's also special about GlusterFS is that it's really (really) easy
and quick to set up a basic constellation, and it ships out-of-the-box
with most distro repositories. There are things you can tweak, but you
don't have to. It's understandable very quickly by any regular Linux admin.
Just yesterday, I've setup gluster on my raspberry pi home fileserver :)

There are many many more details about how Hadoop and Ceph "tick",
compared to GlusterFS (e.g. "NameNode"), but I think there are others
here who can explain that way better than me.

> I know Hadoop is used by the likes of Yahoo and Facebook. I would be
> interested in information on any large (known) users of GlusterFS.

When I was looking for systems for large storage clusters for long-term media 
archiving, I initially thought I'd go for Hadoop, because "the big ones are 
using it". I also listened to a presentation about Ceph, so I could compare them.

In the end, we chose GlusterFS, because for digital archiving, we needed a scalable storage cluster with the highest chance of maintaining it over the years, rather than minimizing our downtime to seconds. We are currently building up a storage for the national A/V archive with 2x >300 TiB.

I've already done initial tests with GlusterFS on one node, but it's too soon to really speak of "experience" on our side.
We've also only used gluster in "distribute" mode, so I have no experience with gluster-replication at all.

Regards,
Peter B.