[Gluster-devel] Crash on HA config when restoring a server

Wed Aug 8 19:39:02 UTC 2007

OK, on further investigation, it looks like the previous errors were
related to a namespace problem because of a previous test I was doing.
The namespace might not have matched the shares in some way.
Backtraces inline below link to appropriate logs referenced before in
case you want to look into this still and make it error more gracefully
or report this in the logs.

I was able to crash it again doing the same test with a fresh
namespace and share with some sample files thrown in:

Last bit of glusterfsd log here: http://glusterfs.pastebin.com/m3d99c08f

GDB backtrace:
#0  0x00117a8f in afr_lookup_cbk (frame=0xb7e034d0, cookie=0xb7e02cf8, this=0x8e7bee8, op_ret=-1, op_errno=107, inode=0xb7e020d8, buf=0x0) at afr.c:314
#1  0x002a94ff in client_lookup_cbk (frame=0xb7e02cf8, args=0xb7e04dd0) at client-protocol.c:4003
#2  0x002aa1c8 in notify (this=0x8e7a550, event=3, data=0x8e7d988) at client-protocol.c:4296
#3  0x005b9510 in transport_notify (this=0x8e7d988, event=17) at transport.c:154
#4  0x005b9cd9 in sys_epoll_iteration (ctx=0xbfebb820) at epoll.c:53
#5  0x005b97fd in poll_iteration (ctx=0xbfebb820) at transport.c:300
#6  0x08049496 in main (argc=7, argv=0xbfebb8f4) at glusterfsd.c:310

On Wednesday 08 August 2007 11:14, Anand Avati wrote:
> Kevan,
> is it possible to get a backtrace from the coredump using gdb?
> thanks
> avati
>
> 2007/8/8, Kevan Benson <kbenson at a-1networks.com>:
> > When running a HA config with 2 servers and 2 clients, I can consistently
> > crash the active server after failing the other.  This is on TLA version
> > patched to 440.
> >
> > System configs at http://glusterfs.pastebin.com/m52564c56
> > Server A: 172.16.1.81
> > Server B: 172.16.1.82
> > Client A: 172.16.1.85
> > Client B: 172.16.1.86
> > Note: Client transport-timeout (on clients and servers) was set to 10 in
> > first
> > two crashes, and set to 30 on Client A and B in the last one (servers
> > still
> > had it set to 10).
> >
> > For the first crash, I fail server B (ifdown eth1), and then try to ls
> > the mount point with the client (time ls -l /mnt/glusterfs) from both
> > clients.  I
> > generally get a "ls: /mnt/glusterfs/: Transport endpoint is not
> > connected" error once or twice, and then the active server's (A)
> > glusterfsd will either
> > start responding or crash (about 50% chance).  In this case, I had
> > restored
> > network connectivity to server B and ran a few more ls's from the
> > clients.
> >
> > The glusterfsd.log (including backtrace) is at
> > http://glusterfs.pastebin.com/m15d7f914

GDB backtrace:
#0  0x00125fef in afr_readdir (frame=0x94d3878, this=0x945ebc8, size=0, offset=0, fd=0x94d35f0) at afr.c:3190
#1  0x00136fae in unify_sh_opendir_cbk (frame=0x9464648, cookie=0x945edf8, this=0x945f358, op_ret=Variable "op_ret" is not available.
) at unify-self-heal.c:378
#2  0x00125535 in afr_opendir_cbk (frame=0x94d3c18, cookie=0x950bdd8, this=0x945ebc8, op_ret=7, op_errno=17, fd=0x94d35f0) at afr.c:2907
#3  0x00219e7c in default_opendir_cbk (frame=0x0, cookie=0x950b8d0, this=0x945d550, op_ret=7, op_errno=17, fd=0x94d35f0) at defaults.c:837
#4  0x0039564a in posix_opendir (frame=0x950b8d0, this=0x945d470, loc=0xbfe4e6d0, fd=0x94d35f0) at posix.c:153
#5  0x00219f09 in default_opendir (frame=0x950bdd8, this=0x945d550, loc=0xbfe4e6d0, fd=0x94d35f0) at defaults.c:849
#6  0x0012579b in afr_opendir (frame=0x94d3c18, this=0x945ebc8, loc=0xbfe4e6d0, fd=0x94d35f0) at afr.c:2937
#7  0x001373c2 in gf_unify_self_heal (frame=0x9464648, this=0x945f358, local=0x94d3658) at unify-self-heal.c:463
#8  0x0012e0b8 in unify_lookup_cbk (frame=0x9464648, cookie=0x1, this=0x945f358, op_ret=0, op_errno=0, inode=0x94d4150, buf=0x94d3f68) at unify.c:307
#9  0x0011dea8 in afr_lookup_cbk (frame=0x94d2e70, cookie=0x94d2d80, this=0x945ebc8, op_ret=-1, op_errno=107, inode=0x94d4150, buf=0x0) at afr.c:415
#10 0x001184ff in client_lookup_cbk (frame=0x94d2d80, args=0x94d4128) at client-protocol.c:4003
#11 0x001191c8 in notify (this=0x945e518, event=3, data=0x94624c0) at client-protocol.c:4296
#12 0x0021c510 in transport_notify (this=0x94624c0, event=17) at transport.c:154
#13 0x0021ccd9 in sys_epoll_iteration (ctx=0xbfe4ea10) at epoll.c:53
#14 0x0021c7fd in poll_iteration (ctx=0xbfe4ea10) at transport.c:300
#15 0x08049496 in main (argc=7, argv=0xbfe4eae4) at glusterfsd.c:310

> >
> > Upon restarting glusterfs on server A and restoring the network
> > connection to
> > server B, I initiated the above ls from the clients and crashed server
> > A's glusterfsd again.  Glusterfsd on Server B was never restarted, it was
> > failed
> > because of lack of connectivity.
> >
> > The glusterfsd.log (including backtrace) for THIS crash is at
> > http://glusterfs.pastebin.com/m28ee8e5a

GDB backtrace:
#0  0x00a0dfef in afr_readdir (frame=0x8d9b958, this=0x8d5dbc8, size=0, offset=0, fd=0x8d9b250) at afr.c:3190
#1  0x00987fae in unify_sh_opendir_cbk (frame=0x8d63510, cookie=0x8d5ddf8, this=0x8d5e358, op_ret=Variable "op_ret" is not available.
) at unify-self-heal.c:378
#2  0x00a0d535 in afr_opendir_cbk (frame=0x8d9b538, cookie=0x8d9b668, this=0x8d5dbc8, op_ret=9, op_errno=17, fd=0x8d9b250) at afr.c:2907
#3  0x009abe7c in default_opendir_cbk (frame=0x0, cookie=0x8d9b6a0, this=0x8d5c550, op_ret=9, op_errno=17, fd=0x8d9b250) at defaults.c:837
#4  0x0023e64a in posix_opendir (frame=0x8d9b6a0, this=0x8d5c470, loc=0xbfe69200, fd=0x8d9b250) at posix.c:153
#5  0x009abf09 in default_opendir (frame=0x8d9b668, this=0x8d5c550, loc=0xbfe69200, fd=0x8d9b250) at defaults.c:849
#6  0x00a0d79b in afr_opendir (frame=0x8d9b538, this=0x8d5dbc8, loc=0xbfe69200, fd=0x8d9b250) at afr.c:2937
#7  0x009883c2 in gf_unify_self_heal (frame=0x8d63510, this=0x8d5e358, local=0x8d9aaf0) at unify-self-heal.c:463
#8  0x0097f0b8 in unify_lookup_cbk (frame=0x8d63510, cookie=0x1, this=0x8d5e358, op_ret=0, op_errno=0, inode=0x8d9aa88, buf=0x8d9b0b8) at unify.c:307
#9  0x00a05ea8 in afr_lookup_cbk (frame=0x8d9ad40, cookie=0x8d9b180, this=0x8d5dbc8, op_ret=-1, op_errno=107, inode=0x8d9aa88, buf=0x0) at afr.c:415
#10 0x002b14ff in client_lookup_cbk (frame=0x8d9b180, args=0x8d9af70) at client-protocol.c:4003
#11 0x002abdb5 in client_protocol_xfer (frame=0x8d9b180, this=0x8d5d518, type=GF_OP_TYPE_FOP_REQUEST, op=GF_FOP_LOOKUP, request=0x8d636c8) at client-protocol.c:347
#12 0x002ae2d5 in client_lookup (frame=0x8d9b180, this=0x8d5d518, loc=0xbfe696f0) at client-protocol.c:2034
#13 0x00a062ab in afr_lookup (frame=0x8d9ad40, this=0x8d5dbc8, loc=0xbfe696f0) at afr.c:533
#14 0x0097f434 in unify_lookup (frame=0x8d63510, this=0x8d5e358, loc=0xbfe696f0) at unify.c:376
#15 0x00954a16 in server_lookup (frame=0x8d9a9a0, bound_xl=0x8d5e358, params=0x8d63660) at server-protocol.c:2366
#16 0x0095a7d7 in notify (this=0x8d5e4c0, event=2, data=0x8d63268) at server-protocol.c:5523
#17 0x009ae510 in transport_notify (this=0x8d63268, event=1) at transport.c:154
#18 0x009aecd9 in sys_epoll_iteration (ctx=0xbfe698d0) at epoll.c:53
#19 0x009ae7fd in poll_iteration (ctx=0xbfe698d0) at transport.c:300
#20 0x08049496 in main (argc=7, argv=0xbfe699a4) at glusterfsd.c:310

> > Here's a crash from doing an ls with one server failed, after restarting
> > one
> > of  the servers a few times.
> >
> > The glusterfsd.log (including backtrace):
> > http://glusterfs.pastebin.com/m2ee6c471

GDB backtrace:
#0  0x00169fef in afr_readdir (frame=0x95fb0e8, this=0x954de50, size=0, offset=0, fd=0x95c2678) at afr.c:3190
#1  0x00121fae in unify_sh_opendir_cbk (frame=0x95c29b8, cookie=0x954e300, this=0x954e358, op_ret=Variable "op_ret" is not available.
) at unify-self-heal.c:378
#2  0x00169535 in afr_opendir_cbk (frame=0x95c2120, cookie=0x95facb8, this=0x954de50, op_ret=12, op_errno=17, fd=0x95c2678) at afr.c:2907
#3  0x00f81e7c in default_opendir_cbk (frame=0x0, cookie=0x95facf0, this=0x954d3b8, op_ret=12, op_errno=17, fd=0x95c2678) at defaults.c:837
#4  0x0011264a in posix_opendir (frame=0x95facf0, this=0x954ce98, loc=0xbfe66c90, fd=0x95c2678) at posix.c:153
#5  0x00f81f09 in default_opendir (frame=0x95facb8, this=0x954d3b8, loc=0xbfe66c90, fd=0x95c2678) at defaults.c:849
#6  0x0016979b in afr_opendir (frame=0x95c2120, this=0x954de50, loc=0xbfe66c90, fd=0x95c2678) at afr.c:2937
#7  0x001223c2 in gf_unify_self_heal (frame=0x95c29b8, this=0x954e358, local=0x95fa448) at unify-self-heal.c:463
#8  0x001190b8 in unify_lookup_cbk (frame=0x95c29b8, cookie=0x0, this=0x954e358, op_ret=0, op_errno=0, inode=0x95c3110, buf=0x95c2a58) at unify.c:307
#9  0x00161ea8 in afr_lookup_cbk (frame=0x958abb0, cookie=0x95c2b88, this=0x954de50, op_ret=-1, op_errno=107, inode=0x95c3110, buf=0x0) at afr.c:415
#10 0x00c344ff in client_lookup_cbk (frame=0x95c2b88, args=0x95c34c8) at client-protocol.c:4003
#11 0x00c351c8 in notify (this=0x954d610, event=3, data=0x954f990) at client-protocol.c:4296
#12 0x00f84510 in transport_notify (this=0x954f990, event=17) at transport.c:154
#13 0x00f84cd9 in sys_epoll_iteration (ctx=0xbfe66fd0) at epoll.c:53
#14 0x00f847fd in poll_iteration (ctx=0xbfe66fd0) at transport.c:300
#15 0x08049496 in main (argc=7, argv=0xbfe670a4) at glusterfsd.c:310

> > All logs shown are from the crashing server, Server A.  I can just as
> > easily
> > crash server B by failing A.  Let me know if you need more logs from
> > other hosts and I'll re-run whichever scenarios you like,
> >
> > --
> > - Kevan Benson
> > - A-1 Networks
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel at nongnu.org
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel

-- 
- Kevan Benson
- A-1 Networks