[Gluster-users] Self healing of 3.3.0 cause our 2 bricks replicated cluster freeze (client read/write timeout)

Thu Nov 29 08:12:04 UTC 2012

Our 2 bricks are Ubuntu 12.04 VMs running off 2 physical boxes. The
internal network connection between these 2 physical boxes is direct
cable connection on Broadcom NetXtreme II BCM5709 1000Base-T port.
Couple months ago when we setup this way, we did measure the TCP speed
between 2 VMs over such direct cable connection is close to 100MB/s.

It never come to my mind that limited network bandwidth could be
reason behind this. Our data size on gluster is 100G at the moment,
most of them are user uploaded photos (<1MB), and our daily
incremental size is between 500M to 1G. For couple hours' lost sync, I
didn't expect glusterfs could use lots of network bandwidth to
saturate 1Gbps connection. Although we didn't monitor the network
bandwidth usage previously, I think we do need now per your advice.

On Thu, Nov 29, 2012 at 2:48 PM, Bryan Whitehead <driver at megahappy.net> wrote:
> when you mount xfs, also use the inode64 option. That will help with xfs
> performance.
>
> My offhand guess is you are likely running into limited network bandwidth
> for the 2 bricks to sync. As the network gets flooded nfs response gets
> poor. Make sure you are getting full-duplex connections - or upgrade your
> network to 10G or (even better) Infiniband.
>
>
> On Mon, Nov 26, 2012 at 1:46 AM, ZHANG Cheng <czhang.oss at gmail.com> wrote:
>>
>> Early this morning our 2 bricks replicated cluster had an outage. The
>> disk space for one of the brick server (brick02) was used up. When we
>> responded to the disk full alert, the issue already lasted for a few
>> hours. We reclaimed some disk space, and reboot the brick02 server,
>> expecting once it come back it will go self healing.
>>
>> It did go self healing, but just after couple minutes, access to
>> gluster filesystem freeze. Tons of "nfs: server brick not responding,
>> still trying" popped up in dmesg. The load average on app server went
>> up to 200 something from usual 0.10. We had to shutdown brick02 server
>> or stop gluster server process on it, to get the gluster cluster back
>> working.
>>
>> How could we deal with this issue? Thanks in advance.
>>
>> Our gluster setup is followed the official doc.
>>
>> gluster> volume info
>>
>> Volume Name: staticvol
>> Type: Replicate
>> Volume ID: fdcbf635-5faf-45d6-ab4e-be97c74d7715
>> Status: Started
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: brick01:/exports/static
>> Brick2: brick02:/exports/static
>>
>> Underlying filesystem is xfs (on a lvm volume), as:
>> /dev/mapper/vg_node-brick on /exports/static type xfs
>> (rw,noatime,nodiratime,nobarrier,logbufs=8)
>>
>> The brick servers don't act as gluster client.
>>
>> Our app servers are the gluster client, mount via nfs.
>> brick:/staticvol on /mnt/gfs-static type nfs
>> (rw,noatime,nodiratime,vers=3,rsize=8192,wsize=8192,addr=10.10.10.51)
>>
>> brick is a DNS round-robin record for brick01 and brick02.
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>