[Gluster-users] Not real confident in 3.3
Sean Fulton
sean at gcnpublishing.com
Sat Jun 16 21:12:19 UTC 2012
Let me re-iterate, I really, really want to see Gluster work for our
environment. I am hopeful this is something I did or something that can
be easily fixed.
Yes, there was an error on the client server:
[586898.273283] INFO: task flush-0:45:633954 blocked for more than 120
seconds.
[586898.273290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[586898.273295] flush-0:45 D ffff8806037592d0 0 633954 2 0
0x00000000
[586898.273304] ffff88000d1ebbe0 0000000000000046 ffff88000d1ebd6c
0000000000000000
[586898.273312] ffff88000d1ebce0 ffffffff81054444 ffff88000d1ebc80
ffff88000d1ebbf0
[586898.273319] ffff8806050ac5f8 ffff880603759888 ffff88000d1ebfd8
ffff88000d1ebfd8
[586898.273326] Call Trace:
[586898.273335] [<ffffffff81054444>] ? find_busiest_group+0x244/0xb20
[586898.273343] [<ffffffff811ab050>] ? inode_wait+0x0/0x20
[586898.273349] [<ffffffff811ab05e>] inode_wait+0xe/0x20
[586898.273357] [<ffffffff814e752f>] __wait_on_bit+0x5f/0x90
[586898.273365] [<ffffffff811bbd6c>] ? writeback_sb_inodes+0x13c/0x210
[586898.273370] [<ffffffff811bab28>] inode_wait_for_writeback+0x98/0xc0
[586898.273377] [<ffffffff81095550>] ? wake_bit_function+0x0/0x50
[586898.273382] [<ffffffff811bc1f8>] wb_writeback+0x218/0x420
[586898.273389] [<ffffffff814e637e>] ? thread_return+0x4e/0x7d0
[586898.273394] [<ffffffff811bc5a9>] wb_do_writeback+0x1a9/0x250
[586898.273402] [<ffffffff8107e2e0>] ? process_timeout+0x0/0x10
[586898.273407] [<ffffffff811bc6b3>] bdi_writeback_task+0x63/0x1b0
[586898.273412] [<ffffffff810953e7>] ? bit_waitqueue+0x17/0xc0
[586898.273419] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273424] [<ffffffff8114cf06>] bdi_start_fn+0x86/0x100
[586898.273429] [<ffffffff8114ce80>] ? bdi_start_fn+0x0/0x100
[586898.273434] [<ffffffff81094f36>] kthread+0x96/0xa0
[586898.273440] [<ffffffff8100c20a>] child_rip+0xa/0x20
[586898.273445] [<ffffffff81094ea0>] ? kthread+0x0/0xa0
[586898.273450] [<ffffffff8100c200>] ? child_rip+0x0/0x20
[root at server-10 ~]#
Here are the file sizes. Secure was big, but was hung for quite a long time:
-rw------- 1 root root 0 Dec 20 10:17 boot.log
-rw------- 1 root utmp 281079168 Jun 15 21:53 btmp
-rw------- 1 root root 337661 Jun 16 16:36 cron
-rw-r--r-- 1 root root 0 Jun 9 18:33 dmesg
-rw-r--r-- 1 root root 0 Jun 9 16:19 dmesg.old
-rw-r--r-- 1 root root 98585 Dec 21 14:32 dracut.log
drwxr-xr-x 5 root root 4096 Dec 21 16:53 glusterfs
drwx------ 2 root root 4096 Mar 1 16:11 httpd
-rw-r--r-- 1 root root 146000 Jun 16 13:36 lastlog
drwxr-xr-x 2 root root 4096 Dec 20 10:35 mail
-rw------- 1 root root 1072902 Jun 9 18:33 maillog
-rw------- 1 root root 50638 Jun 16 12:13 messages
drwxr-xr-x 2 root root 4096 Dec 30 16:14 nginx
drwx------ 3 root root 4096 Dec 20 10:35 samba
-rw------- 1 root root 222214339 Jun 16 13:37 secure
-rw------- 1 root root 0 Sep 13 2011 spooler
-rw------- 1 root root 0 Sep 13 2011 tallylog
-rw-rw-r-- 1 root utmp 114432 Jun 16 13:37 wtmp
-rw------- 1 root root 7015 Jun 16 12:13 yum.log
On 06/16/2012 05:04 PM, Anand Avati wrote:
> Was there anything in dmesg on the servers? If you are able to
> reproduce the hang, can you get the output of 'gluster volume status
> <name> callpool' and 'gluster volume status <name> nfs callpool' ?
>
> How big is the 'log/secure' file? Is it so large the the client was
> just busy writing it for a very long time? Are there any signs of
> disconnections or ping tmeouts in the logs?
>
> Avati
>
> On Sat, Jun 16, 2012 at 10:48 AM, Sean Fulton <sean at gcnpublishing.com
> <mailto:sean at gcnpublishing.com>> wrote:
>
> I do not mean to be argumentative, but I have to admit a little
> frustration with Gluster. I know an enormous emount of effort has
> gone into this product, and I just can't believe that with all the
> effort behind it and so many people using it, it could be so fragile.
>
> So here goes. Perhaps someone here can point to the error of my
> ways. I really want this to work because it would be ideal for our
> environment, but ...
>
> Please note that all of the nodes below are OpenVZ nodes with
> nfs/nfsd/fuse modules loaded on the hosts.
>
> After spending months trying to get 3.2.5 and 3.2.6 working in a
> production environment, I gave up on Gluster and went with a
> Linux-HA/NFS cluster which just works. The problems I had with
> gluster were strange lock-ups, split brains, and too many
> instances where the whole cluster was off-line until I reloaded
> the data.
>
> So wiith the release of 3.3, I decided to give it another try. I
> created one relicated volume on my two NFS servers.
>
> I then mounted the volume on a client as follows:
> 10.10.10.7:/pub2 /pub2 nfs
> rw,noacl,noatime,nodiratime,soft,proto=tcp,vers=3,defaults 0 0
>
> I threw some data at it (find / -mount -print | cpio -pvdum
> /pub2/test)
>
> Within 10 seconds it locked up solid. No error messages on any of
> the servers, the client was unresponsive and load on the client
> was 15+. I restarted glusterd on both of my NFS servers, and the
> client remained locked. Finally I killed the cpio process on the
> client. When I started another cpio, it runs further than before,
> but now the logs on my NFS/Gluster server say:
>
> [2012-06-16 13:37:35.242754] I
> [afr-self-heal-common.c:1318:afr_sh_missing_entries_lookup_done]
> 0-pub2-replicate-0: No sources for dir of
> <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure, in missing
> entry self-heal, continuing with the rest of the self-heals
> [2012-06-16 13:37:35.243315] I
> [afr-self-heal-common.c:994:afr_sh_missing_entries_done]
> 0-pub2-replicate-0: split brain found, aborting selfheal of
> <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
> [2012-06-16 13:37:35.243350] E
> [afr-self-heal-common.c:2156:afr_self_heal_completion_cbk]
> 0-pub2-replicate-0: background data gfid self-heal failed on
> <gfid:4a787ad7-ab91-46ef-9b31-715e49f5f818>/log/secure
>
>
> This still seems to be an INCREDIBLY fragile system. Why would it
> lock solid while copying a large file? Why no errors in the logs?
>
> I am the only one seeing this kind of behavior?
>
> sean
>
>
>
>
>
> --
> Sean Fulton
> GCN Publishing, Inc.
> Internet Design, Development and Consulting For Today's Media
> Companies
> http://www.gcnpublishing.com
> (203) 665-6211, x203 <tel:%28203%29%20665-6211%2C%20x203>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org <mailto:Gluster-users at gluster.org>
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>
>
--
Sean Fulton
GCN Publishing, Inc.
Internet Design, Development and Consulting For Today's Media Companies
http://www.gcnpublishing.com
(203) 665-6211, x203
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120616/e8279678/attachment.html>
More information about the Gluster-users
mailing list