[Gluster-users] very bad performance on small files

Fri Jan 14 23:26:53 UTC 2011

On 14 Jan 2011, at 23:12, Joe Landman wrote:

> If most of your file access times are dominated by latency (e.g. small, seeky like loads), and you are going over a gigabit connection, yeah, your performance is going to crater on any cluster file system.
> 
> Local latency to traverse the storage stack is on the order of 10's of microseconds.  Physical latency of the disk medium is on the order of 10's of microseconds for RAMdisk, 100's of microseconds for flash/ssd, and 1000's of microseconds (e.g. milliseconds) for spinning rust.
> 
> Now take 1 million small file writes.  Say 1024 bytes.  These million writes have to traverse the storage stack in the kernel to get to disk.
> 
> Now add in a network latency event on the order of 1000's of microseconds for the remote storage stack and network stack to respond.
> 
> I haven't measured it yet in a methodical manner, but I wouldn't be surprised to see IOP rates within a factor of 2 of the bare metal for a sufficiently fast network such as Infiniband, and within a factor of 4 or 5 for a slow network like Gigabit.
> 
> Our own experience has been generally that you are IOP constrained because of the stack you have to traverse.  If you add more latency into this stack, you have more to traverse, and therefore, you have more you need to wait.  Which will have a magnification effect upon times for small IO ops which are seeky (stat, small writes, random ops).

Sure, and all that applies equally to both NFS and gluster, yet in Max's example NFS was ~50x faster than gluster for an identical small-file workload. So what's gluster doing over and above what NFS is doing that's taking so long, given that network and disk factors are equal? I'd buy a factor of 2 for replication, but not 50.

In case you missed what I'm on about, it was these stats that Max posted:

> Here is the results per command:
> dd if=/dev/zero of=M/tmp bs=1M count=16384 69.2 MB/se (Native) 69.2
> MB/sec(FUSE) 52 MB/sec (NFS)
> dd if=/dev/zero of=M/tmp bs=1K count=163840000  88.1 MB/sec  (Native)
> 1.1MB/sec (FUSE) 52.4 MB/sec (NFS)
> time tar cf - M | pv > /dev/null 15.8 MB/sec (native) 3.48MB/sec
> (FUSE) 254 Kb/sec (NFS)

In my case I'm running 30kiops SSDs over gigabit. At the moment my problem (running 3.0.6) isn't performance but reliability - files are occasionally reported as 'vanished' by front-end apps (like rsync) even though they are present on both backing stores; no errors in gluster logs, self-heal doesn't help.

Marcus