[Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)

Fri Jun 8 17:57:21 UTC 2012

Thanks for sharing that Brian,

I wonder if the cause of the problem when trying to power Up VMware ESXi VMs is for the same reason.

Fernando

-----Original Message-----
From: Brian Candler [mailto:B.Candler at pobox.com] 
Sent: 08 June 2012 17:47
To: Pranith Kumar Karampuri
Cc: olav johansen; gluster-users at gluster.org; Fernando Frediani (Qube)
Subject: Re: [Gluster-users] Performance optimization tips Gluster 3.3? (small files / directory listings)

On Thu, Jun 07, 2012 at 02:36:26PM +0100, Brian Candler wrote:
> I'm interested in understanding this, especially the split-brain 
> scenarios (better to understand them *before* you're stuck in a 
> problem :-)
> 
> BTW I'm in the process of building a 2-node 3.3 test cluster right now.

FYI, I have got KVM working with a glusterfs 3.3.0 replicated volume as the image store.

There are two nodes, both running as glusterfs storage and as KVM hosts.

I build a 10.04 ubuntu image using vmbuilder, stored on the replicated glusterfs volume:

    vmbuilder kvm ubuntu --hostname lucidtest --mem 512 --debug --rootsize 20480 --dest /gluster/safe/images/lucidtest

I was able to fire it up (virsh start lucidtest), ssh into it, and then live-migrate it to another host:

    brian at dev-storage1:~$ virsh migrate --live lucidtest qemu+ssh://dev-storage2/system
    brian at dev-storage2's password: 

    brian at dev-storage1:~$ virsh list
     Id Name                 State
    ----------------------------------

    brian at dev-storage1:~$ 

And I live-migrated it back again, all without the ssh session being interrupted.

I then rebooted the second storage server. While it was rebooting I did some work in the VM which grew its image. When the second storage server came back, it resynchronised the image immediately and automatically. Here is the relevant entry from /var/log/glusterfs/glustershd.log on the first
(non-rebooted) machine:

    [2012-06-08 17:08:40.817893] E [socket.c:1715:socket_connect_finish] 0-safe-client-1: connection to 10.0.1.2:24009 failed (Connection timed out)
    [2012-06-08 17:09:10.698272] I [client-handshake.c:1636:select_server_supported_programs] 0-safe-client-1: Using Program GlusterFS 3.3.0, Num (1298437), Version (330)
    [2012-06-08 17:09:10.700197] I [client-handshake.c:1433:client_setvolume_cbk] 0-safe-client-1: Connected to 10.0.1.2:24009, attached to remote volume '/disk/storage2/safe'.
    [2012-06-08 17:09:10.700234] I [client-handshake.c:1445:client_setvolume_cbk] 0-safe-client-1: Server and Client lk-version numbers are not same, reopening the fds
    [2012-06-08 17:09:10.701901] I [client-handshake.c:453:client_set_lk_version_cbk] 0-safe-client-1: Server lk version = 1
    [2012-06-08 17:09:14.699571] I [afr-common.c:1189:afr_detect_self_heal_by_iatt] 0-safe-replicate-0: size differs for <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff> 
    [2012-06-08 17:09:14.699616] I [afr-common.c:1340:afr_launch_self_heal] 0-safe-replicate-0: background  data self-heal triggered. path: <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff>, reason: lookup detected pending operations
    [2012-06-08 17:09:18.230855] I [afr-self-heal-algorithm.c:122:sh_loop_driver_done] 0-safe-replicate-0: diff self-heal on <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff>: completed. (19 blocks of 3299 were different (0.58%))
    [2012-06-08 17:09:18.232520] I [afr-self-heal-common.c:2159:afr_self_heal_completion_cbk] 0-safe-replicate-0: background  data self-heal completed on <gfid:1f080b06-46f1-468e-b21a-12bf4a7c81ff>

So at first glance this is extremely impressive. It's also very new and shiny, and I wonder how many edge cases remain to be debugged in live use, but I can't argue that it's very neat indeed!

Performance-wise:

(1) on the storage/VM host, which has the replicated volume mounted via FUSE:

root at dev-storage1:~# dd if=/dev/zero of=/gluster/safe/test.zeros bs=1024k count=500
500+0 records in
500+0 records out
524288000 bytes (524 MB) copied, 2.7086 s, 194 MB/s

(The bricks have a 12-disk md RAID10 array, far-2 layout, and there's probably scope for some performance tweaking here)

(2) however from within the VM guest, performance was very poor (2.2MB/s).

I tried my usual tuning options:

        <driver name='qemu' type='qcow2' io='native' cache='none'/>
        ...
        <target dev='vda' bus='virtio'/>
        <!-- delete <address type='drive' controller='0' bus='0' unit='0'/> -->

but glusterfs objected to the cache='none' option (possibly this opens the file with O_DIRECT?)

    # virsh start lucidtest
    virsherror: Failed to start domain lucidtest
    error: internal error process exited while connecting to monitor: char device redirected to /dev/pts/0
    kvm: -drive file=/gluster/safe/images/lucidtest/tmpaJqTD9.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none,aio=native: could not open disk image /gluster/safe/images/lucidtest/tmpaJqTD9.qcow2: Invalid argument

The VM boots with io='native' and bus='virtio', but performance is still very poor:

    ubuntu at lucidtest:~$ dd if=/dev/zero of=/var/tmp/test.zeros bs=1024k count=100
    100+0 records in
    100+0 records out
    104857600 bytes (105 MB) copied, 17.4095 s, 6.0 MB/s

This will need some further work.

The guest is lucid (10.04) only because for some reason I cannot get a 12.04 image built with vmbuilder to work (it spins at 100% CPU).  This is not related to glusterfs and something I need to debug separately. Maybe a
12.04 guest will also run better.

Anyway, just thought it was worth a mention. Keep up the good work guys!

Regards,

Brian.