[Gluster-users] Brick crashes

Sat Jun 9 05:58:01 UTC 2012

hi Ling Ho,
    It seems like you are using rdma, could you confirm?
I am suspecting a memory leak. Could you help me confirm if that is the case.
Please post the output of the following:
1) when you start the brick perform 'kill -USR1 <pid-of-brick>' This will save a file /tmp/glusterdump.<pid-of-brick>
2) mv /tmp/glusterdump.<pid-of-brick> /tmp/glusterdump.<pid-of-brick>.pre
3) Run the brick for a while and observe 'top -p <pid-of-brick>' to see if 'RES' filed is increasing. After it increases by 1G or so
do one more 'kill -USR1 <pid-of-brick>' Now attach the outputs of both /tmp/glusterdump.<pid-of-brick> /tmp/glusterdump.<pid-of-brick>.pre
to this mail.
Do let us know the operations that are performed on the system to re-create this case in our test labs.

Pranith.

----- Original Message -----
From: "Ling Ho" <ling at slac.stanford.edu>
To: "Anand Avati" <anand.avati at gmail.com>
Cc: Gluster-users at gluster.org
Sent: Saturday, June 9, 2012 5:11:12 AM
Subject: Re: [Gluster-users] Brick crashes

This is the core file from the crash just now 

[root at psanaoss213 /]# ls -al core* 
-rw------- 1 root root 4073594880 Jun 8 15:05 core.22682 

>From yesterday: 
[root at psanaoss214 /]# ls -al core* 
-rw------- 1 root root 4362727424 Jun 8 00:58 core.13483 
-rw------- 1 root root 4624773120 Jun 8 03:21 core.8792 

On 06/08/2012 04:34 PM, Anand Avati wrote: 

Is it possible the system was running low on memory? I see you have 48GB, but memory registration failure typically would be because the system limit on the number of pinnable pages in RAM was hit. Can you tell us the size of your core dump files after the crash? 

Avati 

On Fri, Jun 8, 2012 at 4:22 PM, Ling Ho < ling at slac.stanford.edu > wrote: 

Hello, 

I have a brick that crashed twice today, and another different brick that crashed just a while a go. 

This is what I see in one of the brick logs: 

patchset: git:// git.gluster.com/glusterfs.git 
patchset: git:// git.gluster.com/glusterfs.git 
signal received: 6 
signal received: 6 
time of crash: 2012-06-08 15:05:11 
configuration details: 
argp 1 
backtrace 1 
dlfcn 1 
fdatasync 1 
libpthread 1 
llistxattr 1 
setfsid 1 
spinlock 1 
epoll.h 1 
xattr.h 1 
st_atim.tv_nsec 1 
package-string: glusterfs 3.2.6 
/lib64/libc.so.6[0x34bc032900] 
/lib64/libc.so.6(gsignal+0x35)[0x34bc032885] 
/lib64/libc.so.6(abort+0x175)[0x34bc034065] 
/lib64/libc.so.6[0x34bc06f977] 
/lib64/libc.so.6[0x34bc075296] 
/opt/glusterfs/3.2.6/lib64/libglusterfs.so.0(__gf_free+0x44)[0x7f1740ba25e4] 
/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_destroy+0x47)[0x7f1740956967] 
/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_transport_unref+0x62)[0x7f1740956a32] 
/opt/glusterfs/3.2.6/lib64/glusterfs/3.2.6/rpc-transport/rdma.so(+0xc135)[0x7f173ca27135] 
/lib64/libpthread.so.0[0x34bc8077f1] 
/lib64/libc.so.6(clone+0x6d)[0x34bc0e5ccd] 
--------- 

And somewhere before these, there is also 
[2012-06-08 15:05:07.512604] E [rdma.c:198:rdma_new_post] 0-rpc-transport/rdma: memory registration failed 

I have 48GB of memory on the system: 

# free 
total used free shared buffers cached 
Mem: 49416716 34496648 14920068 0 31692 28209612 
-/+ buffers/cache: 6255344 43161372 
Swap: 4194296 1740 4192556 

# uname -a 
Linux psanaoss213 2.6.32-220.7.1.el6.x86_64 #1 SMP Fri Feb 10 15:22:22 EST 2012 x86_64 x86_64 x86_64 GNU/Linux 

The server gluster versions is 3.2.6-1. I am using have both rdma clients and tcp clients over 10Gb/s network. 

Any suggestion what I should look for? 

Is there a way to just restart the brick, and not glusterd on the server? I have 8 bricks on the server. 

Thanks, 
... 
ling 

Here's the volume info: 

# gluster volume info 

Volume Name: ana12 
Type: Distribute 
Status: Started 
Number of Bricks: 40 
Transport-type: tcp,rdma 
Bricks: 
Brick1: psanaoss214:/brick1 
Brick2: psanaoss214:/brick2 
Brick3: psanaoss214:/brick3 
Brick4: psanaoss214:/brick4 
Brick5: psanaoss214:/brick5 
Brick6: psanaoss214:/brick6 
Brick7: psanaoss214:/brick7 
Brick8: psanaoss214:/brick8 
Brick9: psanaoss211:/brick1 
Brick10: psanaoss211:/brick2 
Brick11: psanaoss211:/brick3 
Brick12: psanaoss211:/brick4 
Brick13: psanaoss211:/brick5 
Brick14: psanaoss211:/brick6 
Brick15: psanaoss211:/brick7 
Brick16: psanaoss211:/brick8 
Brick17: psanaoss212:/brick1 
Brick18: psanaoss212:/brick2 
Brick19: psanaoss212:/brick3 
Brick20: psanaoss212:/brick4 
Brick21: psanaoss212:/brick5 
Brick22: psanaoss212:/brick6 
Brick23: psanaoss212:/brick7 
Brick24: psanaoss212:/brick8 
Brick25: psanaoss213:/brick1 
Brick26: psanaoss213:/brick2 
Brick27: psanaoss213:/brick3 
Brick28: psanaoss213:/brick4 
Brick29: psanaoss213:/brick5 
Brick30: psanaoss213:/brick6 
Brick31: psanaoss213:/brick7 
Brick32: psanaoss213:/brick8 
Brick33: psanaoss215:/brick1 
Brick34: psanaoss215:/brick2 
Brick35: psanaoss215:/brick4 
Brick36: psanaoss215:/brick5 
Brick37: psanaoss215:/brick7 
Brick38: psanaoss215:/brick8 
Brick39: psanaoss215:/brick3 
Brick40: psanaoss215:/brick6 
Options Reconfigured: 
performance.io-thread-count: 16 
performance.write-behind-window-size: 16MB 
performance.cache-size: 1GB 
nfs.disable: on 
performance.cache-refresh-timeout: 1 
network.ping-timeout: 42 
performance.cache-max-file-size: 1PB 

_______________________________________________ 
Gluster-users mailing list 
Gluster-users at gluster.org 
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users 

_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users