[Gluster-users] Brick crashes
Albert Zhang
albertwzhang at gmail.com
Sat Jun 9 05:56:37 UTC 2012
Looks like some process hung there due to memory issues in kernel, error message from the very beginning would be helpful
Sent from my iPhone
On 2012-6-9, at 上午8:26, Ling Ho <ling at slac.stanford.edu> wrote:
> Hi Anand,
>
> ulimit -l running as root is 64.
>
>
> This dmesg out is from the second system.
>
> I don't see any new on the first system other that what were there when system booted.
> Do you want to see the whole dmesg output? Where should I post it, there are 1600 lines.
>
> ...
> ling
>
> INFO: task glusterfs:8880 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfs D 0000000000000000 0 8880 1 0x00000080
> ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80
> 000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5
> ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
> Call Trace:
> [<ffffffff814ef8f5>] ? page_fault+0x25/0x30
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffff81141768>] sys_munmap+0x48/0x80
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfs:8880 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfs D 0000000000000000 0 8880 1 0x00000080
> ffff880614b75e48 0000000000000086 0000000000000000 ffff88010ed65d80
> 000000000000038b 000000000000038b ffff880614b75ee8 ffffffff814ef8f5
> ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
> Call Trace:
> [<ffffffff814ef8f5>] ? page_fault+0x25/0x30
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffff81141768>] sys_munmap+0x48/0x80
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfs:8880 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfs D 0000000000000009 0 8880 1 0x00000080
> ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338
> ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740
> ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
> Call Trace:
> [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
> [<ffffffff81010469>] sys_mmap+0x29/0x30
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfs:8880 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfs D 0000000000000009 0 8880 1 0x00000080
> ffff880614b75e08 0000000000000086 0000000000000000 ffff88062d638338
> ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88061406f740
> ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
> Call Trace:
> [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
> [<ffffffff81010469>] sys_mmap+0x29/0x30
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfs:8880 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfs D 0000000000000003 0 8880 1 0x00000080
> ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8
> ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480
> ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
> Call Trace:
> [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
> [<ffffffff81010469>] sys_mmap+0x29/0x30
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfsd:9471 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfsd D 0000000000000004 0 9471 1 0x00000080
> ffff8801077c3740 0000000000000082 0000000000000000 ffff8801077c36b8
> ffffffff8127f138 0000000000000000 0000000000000000 ffff8801077c36d8
> ffff8806146f4638 ffff8801077c3fd8 000000000000f4e8 ffff8806146f4638
> Call Trace:
> [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30
> [<ffffffff8127f138>] ? swiotlb_dma_mapping_error+0x18/0x30
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffffa019607a>] ? ixgbe_xmit_frame_ring+0x93a/0xfc0 [ixgbe]
> [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30
> [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff814ee6f4>] ? down_read+0x24/0x30
> [<ffffffff81042bc7>] __do_page_fault+0x187/0x480
> [<ffffffff81430c38>] ? dev_queue_xmit+0x178/0x6b0
> [<ffffffff8146809c>] ? ip_finish_output+0x13c/0x310
> [<ffffffff814f253e>] do_page_fault+0x3e/0xa0
> [<ffffffff814ef8f5>] page_fault+0x25/0x30
> [<ffffffff81275a6d>] ? copy_user_generic_string+0x2d/0x40
> [<ffffffff81425655>] ? memcpy_toiovec+0x55/0x80
> [<ffffffff81426070>] skb_copy_datagram_iovec+0x60/0x2c0
> [<ffffffff8141ceac>] ? lock_sock_nested+0xac/0xc0
> [<ffffffff814ef5cb>] ? _spin_unlock_bh+0x1b/0x20
> [<ffffffff814722d5>] tcp_recvmsg+0xca5/0xe90
> [<ffffffff814925ea>] inet_recvmsg+0x5a/0x90
> [<ffffffff8141bff1>] sock_aio_read+0x181/0x190
> [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80
> [<ffffffff8100988e>] ? __switch_to+0x26e/0x320
> [<ffffffff8141be70>] ? sock_aio_read+0x0/0x190
> [<ffffffff8117614b>] do_sync_readv_writev+0xfb/0x140
> [<ffffffff81090a90>] ? autoremove_wake_function+0x0/0x40
> [<ffffffff8120c1e6>] ? security_file_permission+0x16/0x20
> [<ffffffff811771df>] do_readv_writev+0xcf/0x1f0
> [<ffffffff811b9b50>] ? sys_epoll_wait+0xa0/0x300
> [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760
> [<ffffffff81177513>] vfs_readv+0x43/0x60
> [<ffffffff81177641>] sys_readv+0x51/0xb0
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfsd:9545 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfsd D 0000000000000006 0 9545 1 0x00000080
> ffff880c24a7bcf8 0000000000000082 0000000000000000 ffffffff8107c0a0
> ffff88066a0a7580 ffff880c30460000 0000000000000000 0000000000000000
> ffff88066a0a7b38 ffff880c24a7bfd8 000000000000f4e8 ffff88066a0a7b38
> Call Trace:
> [<ffffffff8107c0a0>] ? process_timeout+0x0/0x10
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff8127f18c>] ? is_swiotlb_buffer+0x3c/0x50
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffffa0211b96>] ib_umem_release+0x76/0x110 [ib_core]
> [<ffffffffa0230d52>] mlx4_ib_dereg_mr+0x32/0x50 [mlx4_ib]
> [<ffffffffa020cd85>] ib_dereg_mr+0x35/0x50 [ib_core]
> [<ffffffffa041bc5b>] ib_uverbs_dereg_mr+0x7b/0xf0 [ib_uverbs]
> [<ffffffffa04194ef>] ib_uverbs_write+0xbf/0xe0 [ib_uverbs]
> [<ffffffff8117646d>] ? rw_verify_area+0x5d/0xc0
> [<ffffffff81176588>] vfs_write+0xb8/0x1a0
> [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0
> [<ffffffff81176f91>] sys_write+0x51/0x90
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfsd:9546 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfsd D 0000000000000004 0 9546 1 0x00000080
> ffff880c0634bcf0 0000000000000082 ffff880c0634bcb8 ffff880c0634bcb4
> 0000000000015f80 ffff88063fc24b00 ffff880655495f80 0000000000000400
> ffff880c2dccc5f8 ffff880c0634bfd8 000000000000f4e8 ffff880c2dccc5f8
> Call Trace:
> [<ffffffff810566a3>] ? perf_event_task_sched_out+0x33/0x80
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff810097cc>] ? __switch_to+0x1ac/0x320
> [<ffffffff814ef1f6>] rwsem_down_read_failed+0x26/0x30
> [<ffffffff814ecb0e>] ? thread_return+0x4e/0x760
> [<ffffffff81276e84>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff814ee6f4>] ? down_read+0x24/0x30
> [<ffffffff81042bc7>] __do_page_fault+0x187/0x480
> [<ffffffffa0419e16>] ? ib_uverbs_event_read+0x1d6/0x240 [ib_uverbs]
> [<ffffffff814f253e>] do_page_fault+0x3e/0xa0
> [<ffffffff814ef8f5>] page_fault+0x25/0x30
> INFO: task glusterfsd:9553 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfsd D 000000000000000e 0 9553 1 0x00000080
> ffff8806e131dd98 0000000000000082 0000000000000000 ffff8806e131dd64
> ffff8806e131dd48 ffffffffa026dfb6 ffff8806e131dd28 ffffffff00000000
> ffff880c2f41c678 ffff8806e131dfd8 000000000000f4e8 ffff880c2f41c678
> Call Trace:
> [<ffffffffa026dfb6>] ? xfs_attr_get+0xb6/0xc0 [xfs]
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffff81136009>] sys_madvise+0x329/0x760
> [<ffffffff81195740>] ? mntput_no_expire+0x30/0x110
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> INFO: task glusterfs:8880 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> glusterfs D 0000000000000003 0 8880 1 0x00000080
> ffff880614b75e08 0000000000000086 0000000000000000 ffff880630ab1ab8
> ffff880c30ef88c0 ffffffff8120d34f ffff880614b75d98 ffff88062df10480
> ffff88062bc4ba78 ffff880614b75fd8 000000000000f4e8 ffff88062bc4ba78
> Call Trace:
> [<ffffffff8120d34f>] ? security_inode_permission+0x1f/0x30
> [<ffffffff814ef065>] rwsem_down_failed_common+0x95/0x1d0
> [<ffffffff814ef1c3>] rwsem_down_write_failed+0x23/0x30
> [<ffffffff81276eb3>] call_rwsem_down_write_failed+0x13/0x20
> [<ffffffff814ee6c2>] ? down_write+0x32/0x40
> [<ffffffff81131ddc>] sys_mmap_pgoff+0x5c/0x2d0
> [<ffffffff81010469>] sys_mmap+0x29/0x30
> [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
>
>
> On 06/08/2012 05:18 PM, Anand Avati wrote:
>>
>> Those are 4.x GB. Can you post dmesg output as well? Also, what's 'ulimit -l' on your system?
>>
>> On Fri, Jun 8, 2012 at 4:41 PM, Ling Ho <ling at slac.stanford.edu> wrote:
>>
>> This is the core file from the crash just now
>>
>> [root at psanaoss213 /]# ls -al core*
>> -rw------- 1 root root 4073594880 Jun 8 15:05 core.22682
>>
>> From yesterday:
>> [root at psanaoss214 /]# ls -al core*
>> -rw------- 1 root root 4362727424 Jun 8 00:58 core.13483
>> -rw------- 1 root root 4624773120 Jun 8 03:21 core.8792
>>
>>
>>
>> On 06/08/2012 04:34 PM, Anand Avati wrote:
>>>
>>> Is it possible the system was running low on memory? I see you have 48GB, but memory registration failure typically would be because the system limit on the number of pinnable pages in RAM was hit. Can you tell us the size of your core dump files after the crash?
>>>
>>> Avati
>>>
>>> On Fri, Jun 8, 2012 at 4:22 PM, Ling Ho <ling at slac.stanford.edu> wrote:
>>> Hello,
>>>
>>> I have a brick that crashed twice today, and another different brick that crashed just a while a go.
>>>
>>> This is what I see in one of the brick logs:
>>>
>>> patchset: git://git.gluster.com/glusterfs.git
>>> patchset: git://git.gluster.com/glusterfs.git
>>> signal received: 6
>>> signal received: 6
>>> time of crash: 2012-06-08 15:05:11
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> fdatasync 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 3.2.6
>>> /lib64/libc.so.6[0x34bc032900]
>>> /lib64/libc.so.6(gsignal+0x35)[0x34bc032885]
>>> /lib64/libc.so.6(abort+0x175)[0x34bc034065]
>>> /lib64/libc.so.6[0x34bc06f977]
>>> /lib64/libc.so.6[0x34bc075296]
>>> /opt/glusterfs/3.2.6/lib64/libglusterfs.so.0(__gf_free+0x44)[0x7f1740ba25e4]
>>> /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_destroy+0x47)[0x7f1740956967]
>>> /opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_unref+0x62)[0x7f1740956a32]
>>> /opt/glusterfs/3.2.6/lib64/glusterfs/3.2.6/rpc-transport/rdma.so(+0xc135)[0x7f173ca27135]
>>> /lib64/libpthread.so.0[0x34bc8077f1]
>>> /lib64/libc.so.6(clone+0x6d)[0x34bc0e5ccd]
>>> ---------
>>>
>>> And somewhere before these, there is also
>>> [2012-06-08 15:05:07.512604] E [rdma.c:198:rdma_new_post] 0-rpc-transport/rdma: memory registration failed
>>>
>>> I have 48GB of memory on the system:
>>>
>>> # free
>>> total used free shared buffers cached
>>> Mem: 49416716 34496648 14920068 0 31692 28209612
>>> -/+ buffers/cache: 6255344 43161372
>>> Swap: 4194296 1740 4192556
>>>
>>> # uname -a
>>> Linux psanaoss213 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> The server gluster versions is 3.2.6-1. I am using have both rdma clients and tcp clients over 10Gb/s network.
>>>
>>> Any suggestion what I should look for?
>>>
>>> Is there a way to just restart the brick, and not glusterd on the server? I have 8 bricks on the server.
>>>
>>> Thanks,
>>> ...
>>> ling
>>>
>>>
>>> Here's the volume info:
>>>
>>> # gluster volume info
>>>
>>> Volume Name: ana12
>>> Type: Distribute
>>> Status: Started
>>> Number of Bricks: 40
>>> Transport-type: tcp,rdma
>>> Bricks:
>>> Brick1: psanaoss214:/brick1
>>> Brick2: psanaoss214:/brick2
>>> Brick3: psanaoss214:/brick3
>>> Brick4: psanaoss214:/brick4
>>> Brick5: psanaoss214:/brick5
>>> Brick6: psanaoss214:/brick6
>>> Brick7: psanaoss214:/brick7
>>> Brick8: psanaoss214:/brick8
>>> Brick9: psanaoss211:/brick1
>>> Brick10: psanaoss211:/brick2
>>> Brick11: psanaoss211:/brick3
>>> Brick12: psanaoss211:/brick4
>>> Brick13: psanaoss211:/brick5
>>> Brick14: psanaoss211:/brick6
>>> Brick15: psanaoss211:/brick7
>>> Brick16: psanaoss211:/brick8
>>> Brick17: psanaoss212:/brick1
>>> Brick18: psanaoss212:/brick2
>>> Brick19: psanaoss212:/brick3
>>> Brick20: psanaoss212:/brick4
>>> Brick21: psanaoss212:/brick5
>>> Brick22: psanaoss212:/brick6
>>> Brick23: psanaoss212:/brick7
>>> Brick24: psanaoss212:/brick8
>>> Brick25: psanaoss213:/brick1
>>> Brick26: psanaoss213:/brick2
>>> Brick27: psanaoss213:/brick3
>>> Brick28: psanaoss213:/brick4
>>> Brick29: psanaoss213:/brick5
>>> Brick30: psanaoss213:/brick6
>>> Brick31: psanaoss213:/brick7
>>> Brick32: psanaoss213:/brick8
>>> Brick33: psanaoss215:/brick1
>>> Brick34: psanaoss215:/brick2
>>> Brick35: psanaoss215:/brick4
>>> Brick36: psanaoss215:/brick5
>>> Brick37: psanaoss215:/brick7
>>> Brick38: psanaoss215:/brick8
>>> Brick39: psanaoss215:/brick3
>>> Brick40: psanaoss215:/brick6
>>> Options Reconfigured:
>>> performance.io-thread-count: 16
>>> performance.write-behind-window-size: 16MB
>>> performance.cache-size: 1GB
>>> nfs.disable: on
>>> performance.cache-refresh-timeout: 1
>>> network.ping-timeout: 42
>>> performance.cache-max-file-size: 1PB
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>>
>>
>>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20120609/ea928503/attachment.html>
More information about the Gluster-users
mailing list