Re: [Gluster-devel] Сrash - 2.0.git-2009.06.16

Thu Jun 25 20:06:02 UTC 2009

Hi!

June 25 2009 21:28 Shehjar Tikoo (shehjart at gluster.com) wrote:
> Thanks. I'd also need your server and client volfiles and logs.
I'll send it tomorrow from the office.

> What application were you using when this crash took place?
That node was running an in-house CFD-solver, which constantly writes
log-file (fflush after each line) and once an hour solution files. The
crash took place after >48 hours of running, when another user tried
to submit his job using Sun Grid Engine (SGE) scheduler. SGE just
failed to create file for stdout redirection, and job didn't started.
It definitely was not high IO load or so.

The second crash was on the same node and the same user cased it. :)
This time the node was idle, the user submitted his job, it started.
His script creates many soft-links and compiles something. The script
finished normally as the user said. But next time I've checked the
node glusterFS client was down... So again I dont see here any
specific usage.

> What version of GlusterFS is this? Is it a recent git checkout?
Yes it is. Pooled on 2009.06.16.

Best wishes,
  Andrey

PS: Any comments about memory usage?

>>
>> Recently I've migrated our small 24-node HPC-cluster from glusterFS
>> 1.3.8 unify to 2.0 distribute. It seems that performance really
>> increased a lot. Thanks for your work!
>>
>> I use the following translators. On servers:
>> posix->locks->iothreads->protocol/server; on clients:
>> protocol/client->distribute->iothreads->write-behind. IO-threads
>> translator uses 4 threads, NO autoscaling.
>>
>>
>>
>> Unfortunately, after upgrade I've got new issues. First, I've noticed
>> a very high memory usage. Now GlusterFS on the head node eats 737Mb of
>> RES memory and doesnt return it back. The memory usage have been
>> increased in the migration process by the command "cd
>> ${namespace_export} && find . | (cd ${distribute_mount} && xargs -d
>> '\n' stat -c '%n')". Note that provided migrate-unify-to-distribute.sh
>> script (with "execute_on" function) doesn't work...
>>
>>
>>
>> Second problem is more important. A client on one of the nodes has
>> crashed today with the following backtrace:
>>
>> ------
>>
>> Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l
>> /var/log/glusterfs/client.log /home'.
>>
>> Program terminated with signal 11, Segmentation fault.
>>
>> #0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6
>>
>> (gdb) bt
>>
>> #0  0x00002b8039bec860 in ?? () from /lib64/libc.so.6
>>
>> #1  0x00002b8039bedc0c in malloc () from /lib64/libc.so.6
>>
>> #2  0x00002b8039548732 in fop_writev_stub (frame=<value optimized out>,
>>
>>    fn=0x2b803ab6c160 <iot_writev_wrapper>, fd=0x2aaab001e8a0,
>> vector=0x2aaab0071d50,
>>
>>    count=<value optimized out>, off=105432, iobref=0x2aaab0082d60) at
>> common-utils.h:166
>>
>> #3  0x00002b803ab6ec00 in iot_writev (frame=0x4, this=0x6150c0,
>> fd=0x2aaab0082711,
>>
>>    vector=0x2aaab0083060, count=3, offset=105432, iobref=0x2aaab0082d60)
>>
>>    at io-threads.c:1212
>>
>> #4  0x00002b803ad7a3de in wb_sync (frame=0x2aaab0034c40,
>> file=0x2aaaac007280,
>>
>>    winds=0x7fff717a5450) at write-behind.c:445
>>
>> #5  0x00002b803ad7a4ff in wb_do_ops (frame=0x2aaab0034c40,
>> file=0x2aaaac007280,
>>
>>    winds=0x7fff717a5450, unwinds=<value optimized out>,
>> other_requests=0x7fff717a5430)
>>
>>    at write-behind.c:1579
>>
>> #6  0x00002b803ad7a617 in wb_process_queue (frame=0x2aaab0034c40,
>> file=0x2aaaac007280,
>>
>>    flush_all=0 '\0') at write-behind.c:1624
>>
>> #7  0x00002b803ad7dd81 in wb_sync_cbk (frame=0x2aaab0034c40,
>>
>>    cookie=<value optimized out>, this=<value optimized out>,
>> op_ret=19, op_errno=0,
>>
>>    stbuf=<value optimized out>) at write-behind.c:338
>>
>> #8  0x00002b803ab6a1e0 in iot_writev_cbk (frame=0x2aaab00309d0,
>>
>>    cookie=<value optimized out>, this=<value optimized out>,
>> op_ret=19, op_errno=0,
>>
>>    stbuf=0x7fff717a5590) at io-threads.c:1186
>>
>> #9  0x00002b803a953aae in dht_writev_cbk (frame=0x63e3e0,
>> cookie=<value optimized out>,
>>
>>    this=<value optimized out>, op_ret=19, op_errno=0,
>> stbuf=0x7fff717a5590)
>>
>>    at dht-common.c:1797
>>
>> #10 0x00002b803a7406e9 in client_write_cbk (frame=0x648a80, hdr=<value
>> optimized out>,
>>
>>    hdrlen=<value optimized out>, iobuf=<value optimized out>) at
>> client-protocol.c:4363
>>
>> #11 0x00002b803a72c83a in protocol_client_pollin (this=0x60ec70,
>> trans=0x61a380)
>>
>>    at client-protocol.c:6230
>>
>> #12 0x00002b803a7370bc in notify (this=0x4, event=<value optimized
>> out>, data=0x61a380)
>>
>>    at client-protocol.c:6274
>>
>> #13 0x00002b8039533183 in xlator_notify (xl=0x60ec70, event=2,
>> data=0x61a380)
>>
>>    at xlator.c:820
>>
>> #14 0x00002aaaaaaaff0b in socket_event_handler (fd=<value optimized out>,
>> idx=4,
>>
>>    data=0x61a380, poll_in=1, poll_out=0, poll_err=0) at socket.c:813
>>
>> #15 0x00002b803954b2aa in event_dispatch_epoll (event_pool=0x6094f0)
>> at event.c:804
>>
>> #16 0x0000000000403f34 in main (argc=6, argv=0x7fff717a64f8) at
>> glusterfsd.c:1223
>>
>> ----------
>>
>>
>>
>> Later glusterFS crashed again with different backtrace:
>>
>> ----------
>>
>> Core was generated by `glusterfs -f /etc/glusterfs/client.vol -l
>> /var/log/glusterfs/client.log /home'.
>>
>> Program terminated with signal 6, Aborted.
>>
>> #0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6
>>
>> (gdb) bt
>>
>> #0  0x00002ae6dfcd4b45 in raise () from /lib64/libc.so.6
>>
>> #1  0x00002ae6dfcd60e0 in abort () from /lib64/libc.so.6
>>
>> #2  0x00002ae6dfd0cfbb in ?? () from /lib64/libc.so.6
>>
>> #3  0x00002ae6dfd1221d in ?? () from /lib64/libc.so.6
>>
>> #4  0x00002ae6dfd13f76 in free () from /lib64/libc.so.6
>>
>> #5  0x00002ae6df673efd in mem_put (pool=0x631a90, ptr=0x2aaaac0bc520)
>> at mem-pool.c:191
>>
>> #6  0x00002ae6e0c992ce in iot_dequeue_ordered (worker=0x631a20) at
>> io-threads.c:2407
>>
>> #7  0x00002ae6e0c99326 in iot_worker_ordered (arg=<value optimized out>)
>>
>>    at io-threads.c:2421
>>
>> #8  0x00002ae6dfa8e020 in start_thread () from /lib64/libpthread.so.0
>>
>> #9  0x00002ae6dfd68f8d in clone () from /lib64/libc.so.6
>>
>> #10 0x0000000000000000 in ?? ()
>>
>> ----------
>>
>>
>>
>> Hope this backtraces help to find an issue...