[Gluster-users] Not real confident in 3.3
Sean Fulton
sean at gcnpublishing.com
Sun Jun 17 12:17:30 UTC 2012
It's a replicated volume, but only one client was writing one process to
the cluster, so I don't understand how you could have a split brain. The
other issue is that while making a tar of the static files on the
replicated volume, I kept getting errors from tar that the file changed
as we read it. This was content I had copied *to* the cluster, and only
one client node was acting on it at a time, so there is no chance anyone
or anything was updating the files. And this error was coming up every 6
to 10 files.
All three nodes were part of a Linux-HA NFS cluster that worked
flawlessly for weeks, so I feel pretty confident it's not the environment.
I understand the hang could be un-related, but the two things above
cause me concern. Previously when I worked with 3.2.6 and 3.2.6 I had a
lot of problems with split brains, "No end-point connected" errors,
etc., so I gave up on Gluster. The stuff above, in a test environment,
makes me wonder. What could cause this in a closed dev env?
sean
On 06/17/2012 03:42 AM, Brian Candler wrote:
> On Sat, Jun 16, 2012 at 04:47:51PM -0400, Sean Fulton wrote:
>> 1) The split-brain message is strange because there are only two
>> server nodes and 1 client node which has mounted the volume via NFS
>> on a floating IP. This was done to guarantee that only one node gets
>> written to at any point in time, so there is zero chance that two
>> nodes were updated simultaneously.
> Are you using a distributed volume, or a replicated volume? Writes to a
> replicated volume go to both nodes.
>
>> [586898.273283] INFO: task flush-0:45:633954 blocked for more than 120 seconds.
>> [586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [586898.273295] flush-0:45 D ffff8806037592d0 0 633954 20 0x00000000
>> [586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c 0000000000000000
>> [586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80 ffff88000d1ebbf0
>> [586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8 ffff88000d1ebfd8
>> [586898.273326] Call Trace:
>> [586898.273335] [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20
>> [586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20
>> [586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20
> Are you using XFS by any chance?
>
> I started with XFS, because that was what the gluster docs recommend, but
> eventually gave up on it. I can replicate those sort of kernel lockups on a
> 24-disk MD array within a short space of time - without gluster, just by
> throwing four bonnie++ processes at it.
>
> The same tests run with either ext4 or btrfs do not hang, at least not
> during two days of continuous testing.
>
> Of course, any kernel problem cannot be the fault of glusterfs, since
> glusterfs runs entirely in userland.
>
> Regards,
>
> Brian.
>
--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203
More information about the Gluster-users
mailing list