[Gluster-users] Very slow Samba Directory Listing when many files or sub-directories.

Wed Feb 26 06:19:22 UTC 2014

Hello Jeff,

First of all, thank you for your work here.  I appreciate anyone disecting a performance issue.  I do a few thoughts, and asks of you, if you don't mind.

> 
> I have a problem with very slow Windows Explorer browsing
> when there are a large number of directories/files.
> In this case, the top level folder has almost 6000 directories,
> admittedly large, but it works almost instantaneously when a
> Windows Server share was being used.
> Migrating to a Samba/GlusterFS share, there is almost a 20
> second delay while the explorer window populates the list.
> This leaves a bad impression on the storage performance. The
> systems are otherwise idle.
> To isolate the cause, I've eliminated everything, from
> networking, Windows, and have narrowed in on GlusterFS
> being the sole cause of most of the directory lag.
> I was optimistic on using the GlusterFS VFS libgfapi instead
> of FUSE with Samba, and it does help performance
> dramatically in some cases, but it does not help (and
> sometimes hurts) when compared to the CIFS FUSE mount
> for directory listings.

This should be investigated further.

> 
> NFS for directory listings, and small I/O's seems to be
> better, but I cannot use NFS, as I need to use CIFS for
> Windows clients, need ACL's, Active Directory, etc.

Understood.

> 
> Directory listing of 6000 empty directories ('stat-prefetch'
> is on):
> 
> Directory listing the ext4 mount directly is almost
> instantaneous of course.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> >/dev/null
> real 0m41.235s (Throw away first time for ext4 FS cache population?)
> # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> >/dev/null
> real 0m0.110s
> # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> >/dev/null
> real 0m0.109s

The cache population time matches what I'd expect, for hitting this data cold.

> Directory listing the NFS mount is also very fast.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> real 0m44.352s (Throw away first time for ext4 FS cache population?)
> # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> real 0m0.471s
> # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> real 0m0.114s

Note that the last measurement is within a small amount of the ext4 times.  That looks "right" to me.

Now, I'd be interested in what happens if you wait ~30m and try it again with the caches warm.  (ls -l the local directory on the brick, to warm the cache.)

That should expire the NFS cache, and give an idea of how fast things are with the protocol overhead pulled out.

My guess: ~4s.  See below for why.

> Directory listing the CIFS FUSE mount is so slow, almost 16
> seconds!
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> real 0m56.573s (Throw away first time for ext4 FS cache population?)
> # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> real 0m16.101s
> # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> real 0m15.986s

> Directory listing the CIFS VFS libgfapi mount is about twice
> as fast as FUSE, but still slow at 8 seconds.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> real 0m48.839s (Throw away first time for ext4 FS cache population?)
> # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> real 0m8.197s
> # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> real 0m8.450s

Looking at the numbers, it looks like the network is being consulted, and the data pulled back across.

So let's do some quick math:

FUSE:

56s for the initial read.
16s for each read there after.
---
40s of cache population time.

VFS Module:

48s for the initial read.
8s for each read there after.
---
40s of cache population time.

The fact that the cache population time drops out as a constant, tells me that in fact, it is likely re-reading the data over the network, and not caching.

That should be controllable via mount parameters in mount.cifs.  Now, that doesn't mean that Samba taking 8s to do the actual work, and NFS taking in my guesstimate, 4s.  Is actually good.  But it certainly puts the performance in another light.

> ####################
> 
> Retesting directory list with Gluster default settings,
> including 'stat-prefetch' off, due to:
> 
> Bug 1004327 - New files are not inheriting ACL from parent directory
> unless "stat-prefetch" is off for the respective gluster
> volume
> https://bugzilla.redhat.com/show_bug.cgi?id=1004327
> 
> # gluster volume info nas-cbs-0005
> 
> Volume Name: nas-cbs-0005
> Type: Distribute
> Volume ID: 5068e9a5-d60f-439c-b319-befbf9a73a50
> Status: Started
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: 192.168.5.181:/exports/nas-segment-0004/nas-cbs-0005
> Options Reconfigured:
> performance.stat-prefetch: off
> server.allow-insecure: on
> nfs.rpc-auth-allow: *
> nfs.disable: off
> nfs.addr-namelookup: off
> 
> Directory listing of 6000 empty directories ('stat-prefetch'
> is off):
> 
> Accessing the ext4 mount directly is almost instantaneous of
> course.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> >/dev/null
> real 0m39.483s (Throw away first time for ext4 FS cache population?)
> # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> >/dev/null
> real 0m0.136s
> # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> >/dev/null
> real 0m0.109s
> 
> Accessing the NFS mount is also very fast.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> real 0m43.819s (Throw away first time for ext4 FS cache population?)
> # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> real 0m0.342s
> # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> real 0m0.200s
> 
> Accessing the CIFS FUSE mount is slow, almost 14 seconds!
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> real 0m55.759s (Throw away first time for ext4 FS cache population?)
> # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> real 0m13.458s
> # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> real 0m13.665s
> 
> Accessing the CIFS VFS libgfapi mount is now about twice as
> slow as FUSE, at almost 26 seconds due to 'stat-prefetch'
> being off!
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> real 1m2.821s (Throw away first time for ext4 FS cache population?)
> # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> real 0m25.563s
> # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> real 0m26.949s

This data is all showing the same behaviors I described above.  I expect the NFS client to win.

The likely reason for the difference in the performance of FUSE vs. libgfapi here is: Caching.

> ####################

> 4KB Writes:
> 
> NFS very small block writes are very slow at about 4 MB/sec.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> time to transfer data was 20.450521 secs, 4.10 MB/sec
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> time to transfer data was 19.669923 secs, 4.26 MB/sec
> 
> CIFS FUSE very small block writes are faster, at about 11
> MB/sec.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> of=/mnt/nas-cbs-0005-cifs/testfile count=20k
> time to transfer data was 7.247578 secs, 11.57 MB/sec
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> of=/mnt/nas-cbs-0005-cifs/testfile count=20k
> time to transfer data was 7.422002 secs, 11.30 MB/sec
> 
> CIFS VFS libgfapi very small block writes are twice as fast
> as CIFS FUSE, at about 22 MB/sec.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> time to transfer data was 3.766179 secs, 22.27 MB/sec
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> time to transfer data was 3.761176 secs, 22.30 MB/sec

I'm betting if you look at the back-end during the NFS transfer, it is at 100%.

For Samba, it is likely SMB and a chattier protocol holding it back.

> 4KB Reads:
> 
> NFS very small block reads are very fast at about 346
> MB/sec.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> time to transfer data was 0.244960 secs, 342.45 MB/sec
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> time to transfer data was 0.240472 secs, 348.84 MB/sec
> 
> CIFS FUSE very small block reads are less than half as fast
> as NFS, at about 143 MB/sec.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> if=/mnt/nas-cbs-0005-cifs/testfile count=20k
> time to transfer data was 0.606534 secs, 138.30 MB/sec
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> if=/mnt/nas-cbs-0005-cifs/testfile count=20k
> time to transfer data was 0.576185 secs, 145.59 MB/sec
> 
> CIFS VFS libgfapi very small block reads a slight bit slower
> than CIFS FUSE, at about 137 MB/sec.
> 
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> time to transfer data was 0.611328 secs, 137.22 MB/sec
> # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> time to transfer data was 0.615834 secs, 136.22 MB/sec

I'm not totally shocked by the bad performance here of Samba.

It matches what I've seen in other scenarios.

There may be things that can be done to speed things up.  But I think the issue is probably Samba+CIFS client.  (And I truly expect it is Samba, from my experience with Windows.)

I'd encourage you to also do a raw test of Samba+CIFS client without Gluster in the mix.  It would help confirm some of these thoughts.

I'd also encourage you to open a gluster BZ, so we don't lose track of this data.  Alas, I know we can't act on it immediately.

Also any replication and setup scripts you are using, would be GREAT to put in that BZ.  It will really help us reproduce the issue.

Thanks,

-Ira / ira@(redhat.com|samba.org)