[Gluster-users] Very slow Samba Directory Listing when many files or sub-directories.

Wed Feb 26 18:34:51 UTC 2014

Hi Ira, Vivek,

Thanks, I will open a BZ sometime today.

Ira, comments and answers to your questions
interstitially below.

Note that everything is done on the same box, so the
networking is all virtual, through the 'lo' device

On Tue, Feb 25, 2014 at 10:19 PM, Ira Cooper <ira at redhat.com> wrote:

> Hello Jeff,
>
> First of all, thank you for your work here.  I appreciate anyone disecting
> a performance issue.  I do a few thoughts, and asks of you, if you don't
> mind.
>
>
> >
> > I have a problem with very slow Windows Explorer browsing
> > when there are a large number of directories/files.
> > In this case, the top level folder has almost 6000 directories,
> > admittedly large, but it works almost instantaneously when a
> > Windows Server share was being used.
> > Migrating to a Samba/GlusterFS share, there is almost a 20
> > second delay while the explorer window populates the list.
> > This leaves a bad impression on the storage performance. The
> > systems are otherwise idle.
> > To isolate the cause, I've eliminated everything, from
> > networking, Windows, and have narrowed in on GlusterFS
> > being the sole cause of most of the directory lag.
> > I was optimistic on using the GlusterFS VFS libgfapi instead
> > of FUSE with Samba, and it does help performance
> > dramatically in some cases, but it does not help (and
> > sometimes hurts) when compared to the CIFS FUSE mount
> > for directory listings.
>
> This should be investigated further.
>
> >
> > NFS for directory listings, and small I/O's seems to be
> > better, but I cannot use NFS, as I need to use CIFS for
> > Windows clients, need ACL's, Active Directory, etc.
>
> Understood.
>
>
> >
> > Directory listing of 6000 empty directories ('stat-prefetch'
> > is on):
> >
> > Directory listing the ext4 mount directly is almost
> > instantaneous of course.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> > >/dev/null
> > real 0m41.235s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> > >/dev/null
> > real 0m0.110s
> > # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> > >/dev/null
> > real 0m0.109s
>
> The cache population time matches what I'd expect, for hitting this data
> cold.
>
> > Directory listing the NFS mount is also very fast.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> > real 0m44.352s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> > real 0m0.471s
> > # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> > real 0m0.114s
>
> Note that the last measurement is within a small amount of the ext4 times.
>  That looks "right" to me.
>
> Now, I'd be interested in what happens if you wait ~30m and try it again
> with the caches warm.  (ls -l the local directory on the brick, to warm the
> cache.)
>
Not sure if I did what you were asking here:
I repeated the test with a 30 minute wait, and did not see
the initial long cache populate time reoccur, even after
waiting more than an hour. I assume you wanted to see this
for the NFS mount to the GlusterFS volume, correct?

# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m43.903s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.407s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.289s
# date
Wed Feb 26 07:17:53 PST 2014
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null^C
# date
Wed Feb 26 07:52:16 PST 2014
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
# date
Wed Feb 26 09:10:28 PST 2014
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m1.018s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.116s

A similar experiment I did was to drop_caches, warm the FS cache up by
an "ls -l" on the brick, then time the warmed "ls -l" on the NFS mount:

 # date
Wed Feb 26 10:29:15 PST 2014
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
>/dev/null
real    0m41.899s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m2.400s
# time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
real    0m0.115s

>
> That should expire the NFS cache, and give an idea of how fast things are
> with the protocol overhead pulled out.
>
> My guess: ~4s.  See below for why.
>
> > Directory listing the CIFS FUSE mount is so slow, almost 16
> > seconds!
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> > real 0m56.573s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> > real 0m16.101s
> > # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> > real 0m15.986s
>
> > Directory listing the CIFS VFS libgfapi mount is about twice
> > as fast as FUSE, but still slow at 8 seconds.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> > real 0m48.839s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> > real 0m8.197s
> > # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> > real 0m8.450s
>
> Looking at the numbers, it looks like the network is being consulted, and
> the data pulled back across.
>
> So let's do some quick math:
>
> FUSE:
>
> 56s for the initial read.
> 16s for each read there after.
> ---
> 40s of cache population time.
>
> VFS Module:
>
> 48s for the initial read.
> 8s for each read there after.
> ---
> 40s of cache population time.
>
> The fact that the cache population time drops out as a constant, tells me
> that in fact, it is likely re-reading the data over the network, and not
> caching.
>
> That should be controllable via mount parameters in mount.cifs.  Now, that
> doesn't mean that Samba taking 8s to do the actual work, and NFS taking in
> my guesstimate, 4s.  Is actually good.  But it certainly puts the
> performance in another light.
>
> > ####################
> >
> > Retesting directory list with Gluster default settings,
> > including 'stat-prefetch' off, due to:
> >
> > Bug 1004327 - New files are not inheriting ACL from parent directory
> > unless "stat-prefetch" is off for the respective gluster
> > volume
> > https://bugzilla.redhat.com/show_bug.cgi?id=1004327
> >
> > # gluster volume info nas-cbs-0005
> >
> > Volume Name: nas-cbs-0005
> > Type: Distribute
> > Volume ID: 5068e9a5-d60f-439c-b319-befbf9a73a50
> > Status: Started
> > Number of Bricks: 1
> > Transport-type: tcp
> > Bricks:
> > Brick1: 192.168.5.181:/exports/nas-segment-0004/nas-cbs-0005
> > Options Reconfigured:
> > performance.stat-prefetch: off
> > server.allow-insecure: on
> > nfs.rpc-auth-allow: *
> > nfs.disable: off
> > nfs.addr-namelookup: off
> >
> > Directory listing of 6000 empty directories ('stat-prefetch'
> > is off):
> >
> > Accessing the ext4 mount directly is almost instantaneous of
> > course.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> > >/dev/null
> > real 0m39.483s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> > >/dev/null
> > real 0m0.136s
> > # time ls -l /exports/nas-segment-0004/nas-cbs-0005/cifs_share/manydirs/
> > >/dev/null
> > real 0m0.109s
> >
> > Accessing the NFS mount is also very fast.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> > real 0m43.819s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> > real 0m0.342s
> > # time ls -l /mnt/nas-cbs-0005/cifs_share/manydirs/ >/dev/null
> > real 0m0.200s
> >
> > Accessing the CIFS FUSE mount is slow, almost 14 seconds!
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> > real 0m55.759s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> > real 0m13.458s
> > # time ls -l /mnt/nas-cbs-0005-cifs/manydirs/ >/dev/null
> > real 0m13.665s
> >
> > Accessing the CIFS VFS libgfapi mount is now about twice as
> > slow as FUSE, at almost 26 seconds due to 'stat-prefetch'
> > being off!
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> > real 1m2.821s (Throw away first time for ext4 FS cache population?)
> > # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> > real 0m25.563s
> > # time ls -l /mnt/nas-cbs-0005-cifs-vfs/cifs_share/manydirs/ >/dev/null
> > real 0m26.949s
>
> This data is all showing the same behaviors I described above.  I expect
> the NFS client to win.
>
> The likely reason for the difference in the performance of FUSE vs.
> libgfapi here is: Caching.
>
> > ####################
>
> > 4KB Writes:
> >
> > NFS very small block writes are very slow at about 4 MB/sec.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> > of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> > time to transfer data was 20.450521 secs, 4.10 MB/sec
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> > of=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> > time to transfer data was 19.669923 secs, 4.26 MB/sec
> >
> > CIFS FUSE very small block writes are faster, at about 11
> > MB/sec.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> > of=/mnt/nas-cbs-0005-cifs/testfile count=20k
> > time to transfer data was 7.247578 secs, 11.57 MB/sec
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> > of=/mnt/nas-cbs-0005-cifs/testfile count=20k
> > time to transfer data was 7.422002 secs, 11.30 MB/sec
> >
> > CIFS VFS libgfapi very small block writes are twice as fast
> > as CIFS FUSE, at about 22 MB/sec.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> > of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> > time to transfer data was 3.766179 secs, 22.27 MB/sec
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync if=/dev/zero
> > of=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> > time to transfer data was 3.761176 secs, 22.30 MB/sec
>
> I'm betting if you look at the back-end during the NFS transfer, it is at
> 100%.
>
> For Samba, it is likely SMB and a chattier protocol holding it back.
>
> > 4KB Reads:
> >
> > NFS very small block reads are very fast at about 346
> > MB/sec.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> > if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> > time to transfer data was 0.244960 secs, 342.45 MB/sec
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> > if=/mnt/nas-cbs-0005/cifs_share/testfile count=20k
> > time to transfer data was 0.240472 secs, 348.84 MB/sec
> >
> > CIFS FUSE very small block reads are less than half as fast
> > as NFS, at about 143 MB/sec.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> > if=/mnt/nas-cbs-0005-cifs/testfile count=20k
> > time to transfer data was 0.606534 secs, 138.30 MB/sec
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> > if=/mnt/nas-cbs-0005-cifs/testfile count=20k
> > time to transfer data was 0.576185 secs, 145.59 MB/sec
> >
> > CIFS VFS libgfapi very small block reads a slight bit slower
> > than CIFS FUSE, at about 137 MB/sec.
> >
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> > if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> > time to transfer data was 0.611328 secs, 137.22 MB/sec
> > # sync;sync; echo '3' > /proc/sys/vm/drop_caches
> > # sgp_dd time=1 thr=4 bs=4k bpt=1 iflag=dsync oflag=dsync of=/dev/null
> > if=/mnt/nas-cbs-0005-cifs-vfs/cifs_share/testfile count=20k
> > time to transfer data was 0.615834 secs, 136.22 MB/sec
>
>
> I'm not totally shocked by the bad performance here of Samba.
>
> It matches what I've seen in other scenarios.
>
> There may be things that can be done to speed things up.  But I think the
> issue is probably Samba+CIFS client.  (And I truly expect it is Samba, from
> my experience with Windows.)
>
> I'd encourage you to also do a raw test of Samba+CIFS client without
> Gluster in the mix.  It would help confirm some of these thoughts.
>

The first test I did, which unfortunately didn't get
reported was to make sure that this was not just a
Samba/CIFS issue. To do this, I make a CIFS mount directly
to the storage brick/segment, bypassing GlusterFS, and
mounted it.
Note that there is neither the long cache population time on
the first run, nor the very long delays, although there is a
consistent 1.8 second delay which must be attributed to
Samba/CIFS itself:

[nas-cbs-0005-seg]
    path = /exports/nas-segment-0004/nas-cbs-0005
    admin users = "localadmin"
    valid users = "localadmin"
    invalid users =
    read list =
    write list = "localadmin"
    guest ok = yes
    read only = no
    hide unreadable = yes
    hide dot files = yes
    available = yes

# mount |grep seg
//10.10.200.181/nas-cbs-0005-seg on /mnt/nas-cbs-0005-seg type cifs
(rw,username=localadmin,password=localadmin)
# sync;sync; echo '3' > /proc/sys/vm/drop_caches
# time ls -l /mnt/nas-cbs-0005-seg/cifs_share/manyfiles/ >/dev/null
real    0m1.745s
# time ls -l /mnt/nas-cbs-0005-seg/cifs_share/manyfiles/ >/dev/null
real    0m1.819s
# time ls -l /mnt/nas-cbs-0005-seg/cifs_share/manyfiles/ >/dev/null
real    0m1.781s

>
> I'd also encourage you to open a gluster BZ, so we don't lose track of
> this data.  Alas, I know we can't act on it immediately.
>
> Also any replication and setup scripts you are using, would be GREAT to
> put in that BZ.  It will really help us reproduce the issue.
>
> Thanks,
>
> -Ira / ira@(redhat.com|samba.org)
>

-- 
~ Jeff Byers ~
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140226/a446edce/attachment.html>