[Gluster-devel] Сrash - 2.0.git-2009.06.16

Sun Jul 5 07:53:18 UTC 2009

Hi

Firstly, a fix for this crash is under-review.
See http://patches.gluster.com/patch/672/

Secondly, I saw in the logs provided by you that the number of
outstanding/pending requests on a single thread were more than 64.
This could be because of a large number of concurrent meta-data
operations or a large number of files being open at the same
time or both.

I suggest that you try increasing the number of io-threads
at the client and server to 8 in order to balance the large
number of pending requests over more threads. It might result
in better performance.

-Shehjar

NovA wrote:
> Hi everybody!
> 
> 
> 
> Recently I've migrated our small 24-node HPC-cluster from glusterFS 
> 1.3.8 unify to 2.0 distribute. It seems that performance really 
> increased a lot. Thanks for your work!
> 
> I use the following translators. On servers: 
> posix->locks->iothreads->protocol/server; on clients: 
> protocol/client->distribute->iothreads->write-behind. IO-threads 
> translator uses 4 threads, NO autoscaling.
> 
> 
> 
> Unfortunately, after upgrade I've got new issues. First, I've noticed
>  a very high memory usage. Now GlusterFS on the head node eats 737Mb 
> of RES memory and doesnt return it back. The memory usage have been 
> increased in the migration process by the command "cd 
> ${namespace_export} && find . | (cd ${distribute_mount} && xargs -d 
> '\n' stat -c '%n')". Note that provided 
> migrate-unify-to-distribute.sh script (with "execute_on" function) 
> doesn't work...
> 
> 
> 
> Second problem is more important. A client on one of the nodes has 
> crashed today with the following backtrace:
> 
> ------
> 
> Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l 
> /var/log/glusterfs/client.log /home'.
> 
> Program terminated with signal 11, Segmentation fault.
> 
> #0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6
> 
> (gdb) bt
> 
> #0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6
> 
> #1  0x00002b8039bedc0c in malloc () from /lib64/libc.so.6
> 
> #2  0x00002b8039548732 in fop_writev_stub (frame=<value optimized 
> out>,
> 
> fn=0x2b803ab6c160 <iot_writev_wrapper>, fd=0x2aaab001e8a0, 
> vector=0x2aaab0071d50,
> 
> count=<value optimized out>, off=105432, iobref=0x2aaab0082d60) at 
> common-utils.h:166
> 
> #3  0x00002b803ab6ec00 in iot_writev (frame=0x4, this=0x6150c0, 
> fd=0x2aaab0082711,
> 
> vector=0x2aaab0083060, count=3, offset=105432, iobref=0x2aaab0082d60)
> 
> 
> 
> at io-threads.c:1212
> 
> #4  0x00002b803ad7a3de in wb_sync (frame=0x2aaab0034c40, 
> file=0x2aaaac007280,
> 
> winds=0x7fff717a5450) at write-behind.c:445
> 
> #5  0x00002b803ad7a4ff in wb_do_ops (frame=0x2aaab0034c40, 
> file=0x2aaaac007280,
> 
> winds=0x7fff717a5450, unwinds=<value optimized out>, 
> other_requests=0x7fff717a5430)
> 
> at write-behind.c:1579
> 
> #6  0x00002b803ad7a617 in wb_process_queue (frame=0x2aaab0034c40, 
> file=0x2aaaac007280,
> 
> flush_all=0 '\0') at write-behind.c:1624
> 
> #7  0x00002b803ad7dd81 in wb_sync_cbk (frame=0x2aaab0034c40,
> 
> cookie=<value optimized out>, this=<value optimized out>, op_ret=19, 
> op_errno=0,
> 
> stbuf=<value optimized out>) at write-behind.c:338
> 
> #8  0x00002b803ab6a1e0 in iot_writev_cbk (frame=0x2aaab00309d0,
> 
> cookie=<value optimized out>, this=<value optimized out>, op_ret=19, 
> op_errno=0,
> 
> stbuf=0x7fff717a5590) at io-threads.c:1186
> 
> #9  0x00002b803a953aae in dht_writev_cbk (frame=0x63e3e0, 
> cookie=<value optimized out>,
> 
> this=<value optimized out>, op_ret=19, op_errno=0, 
> stbuf=0x7fff717a5590)
> 
> at dht-common.c:1797
> 
> #10 0x00002b803a7406e9 in client_write_cbk (frame=0x648a80, 
> hdr=<value optimized out>,
> 
> hdrlen=<value optimized out>, iobuf=<value optimized out>) at 
> client-protocol.c:4363
> 
> #11 0x00002b803a72c83a in protocol_client_pollin (this=0x60ec70, 
> trans=0x61a380)
> 
> at client-protocol.c:6230
> 
> #12 0x00002b803a7370bc in notify (this=0x4, event=<value optimized 
> out>, data=0x61a380)
> 
> at client-protocol.c:6274
> 
> #13 0x00002b8039533183 in xlator_notify (xl=0x60ec70, event=2, 
> data=0x61a380)
> 
> at xlator.c:820
> 
> #14 0x00002aaaaaaaff0b in socket_event_handler (fd=<value optimized 
> out>, idx=4,
> 
> data=0x61a380, poll_in=1, poll_out=0, poll_err=0) at socket.c:813
> 
> #15 0x00002b803954b2aa in event_dispatch_epoll (event_pool=0x6094f0)
>  at event.c:804
> 
> #16 0x0000000000403f34 in main (argc=6, argv=0x7fff717a64f8) at 
> glusterfsd.c:1223
> 
> ----------
> 
> 
> 
> Later glusterFS crashed again with different backtrace:
> 
> ----------
> 
> Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l 
> /var/log/glusterfs/client.log /home'.
> 
> Program terminated with signal 6, Aborted.
> 
> #0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6
> 
> (gdb) bt
> 
> #0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6
> 
> #1  0x00002ae6dfcd60e0 in abort () from /lib64/libc.so.6
> 
> #2  0x00002ae6dfd0cfbb in ?? () from /lib64/libc.so.6
> 
> #3  0x00002ae6dfd1221d in ?? () from /lib64/libc.so.6
> 
> #4  0x00002ae6dfd13f76 in free () from /lib64/libc.so.6
> 
> #5  0x00002ae6df673efd in mem_put (pool=0x631a90, ptr=0x2aaaac0bc520)
>  at mem-pool.c:191
> 
> #6  0x00002ae6e0c992ce in iot_dequeue_ordered (worker=0x631a20) at 
> io-threads.c:2407
> 
> #7  0x00002ae6e0c99326 in iot_worker_ordered (arg=<value optimized 
> out>)
> 
> at io-threads.c:2421
> 
> #8  0x00002ae6dfa8e020 in start_thread () from /lib64/libpthread.so.0
> 
> 
> 
> #9  0x00002ae6dfd68f8d in clone () from /lib64/libc.so.6
> 
> #10 0x0000000000000000 in ?? ()
> 
> ----------
> 
> 
> 
> Hope this backtraces help to find an issue...
> 
> 
> 
> Best regards,
> 
> Andrey
> 
> 
> _______________________________________________ Gluster-devel mailing
>  list Gluster-devel at nongnu.org 
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
>