[Gluster-users] Gluster 3.7.13 NFS Crash

Tue Aug 9 05:32:44 UTC 2016

Well, i'm not entirely sure it is a setup-related issue. If you have the
steps to recreate the issue, along with the relevant
information about volume configuration, logs, core, version etc, then it
would be good to track this issue through a bug report.

-Krutika

On Mon, Aug 8, 2016 at 8:56 PM, Mahdi Adnan <mahdi.adnan at outlook.com> wrote:

> Thank you very much for all the efforts.
> I have deployed a new cluster with 3 servers and used nfs-ganesha instead
> of the native nfs, so far it's working fine, also, i tried to reproduce
> this issue in a test environment but i had no luck and it just worked as it
> should be.
> Do you think i should file a bug report ? or maybe it's an issue with my
> setup only ?
>
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: kdhananj at redhat.com
> Date: Mon, 8 Aug 2016 16:33:19 +0530
>
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
> To: mahdi.adnan at outlook.com
> CC: gluster-users at gluster.org
>
> Hi,
>
> Sorry I haven't had the chance to look into this issue last week. Do you
> mind raising a bug in upstream with all
> the relevant information and I'll take a look sometime this week?
>
> -Krutika
>
> On Fri, Aug 5, 2016 at 11:58 AM, Mahdi Adnan <mahdi.adnan at outlook.com>
> wrote:
>
> Hi,
>
> Yes, i got some messages regarding an existing file name in the renaming
> process while the VMs are online.
>
> and here's the output;
> (gdb) frame 2
> #2  0x00007f195deb1787 in shard_common_inode_write_do
> (frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716
> 3716                        anon_fd = fd_anonymous (local->inode_list[i]);
> (gdb) p local->fop
> $1 = GF_FOP_WRITE
> (gdb)
>
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: kdhananj at redhat.com
> Date: Fri, 5 Aug 2016 10:48:36 +0530
>
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
> To: mahdi.adnan at outlook.com
> CC: gluster-users at gluster.org
>
> Also, could you print local->fop please?
>
> -Krutika
>
> On Fri, Aug 5, 2016 at 10:46 AM, Krutika Dhananjay <kdhananj at redhat.com>
> wrote:
>
> Were the images being renamed (specifically to a pathname that already
> exists) while they were being written to?
>
> -Krutika
>
> On Thu, Aug 4, 2016 at 1:14 PM, Mahdi Adnan <mahdi.adnan at outlook.com>
> wrote:
>
> Hi,
>
> Kindly check the following link for all 7 bricks logs;
>
> https://db.tt/YP5qTGXk
>
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: kdhananj at redhat.com
> Date: Thu, 4 Aug 2016 13:00:43 +0530
>
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
> To: mahdi.adnan at outlook.com
> CC: gluster-users at gluster.org
>
> Could you also attach the brick logs please?
>
> -Krutika
>
> On Thu, Aug 4, 2016 at 12:48 PM, Mahdi Adnan <mahdi.adnan at outlook.com>
> wrote:
>
> appreciate your help,
>
> (gdb) frame 2
> #2  0x00007f195deb1787 in shard_common_inode_write_do
> (frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716
> 3716                        anon_fd = fd_anonymous (local->inode_list[i]);
> (gdb) p local->inode_list[0]
> $4 = (inode_t *) 0x7f195c532b18
> (gdb) p local->inode_list[1]
> $5 = (inode_t *) 0x0
> (gdb)
>
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: kdhananj at redhat.com
> Date: Thu, 4 Aug 2016 12:43:10 +0530
>
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
> To: mahdi.adnan at outlook.com
> CC: gluster-users at gluster.org
>
> OK.
> Could you also print the values of the following variables from the
> original core:
> i. i
> ii. local->inode_list[0]
> iii. local->inode_list[1]
>
> -Krutika
>
> On Wed, Aug 3, 2016 at 9:01 PM, Mahdi Adnan <mahdi.adnan at outlook.com>
> wrote:
>
> Hi,
>
> Unfortunately no, but i can setup a test bench and see if it gets the same
> results.
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: kdhananj at redhat.com
> Date: Wed, 3 Aug 2016 20:59:50 +0530
>
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
> To: mahdi.adnan at outlook.com
> CC: gluster-users at gluster.org
>
> Do you have a test case that consistently recreates this problem?
>
> -Krutika
>
> On Wed, Aug 3, 2016 at 8:32 PM, Mahdi Adnan <mahdi.adnan at outlook.com>
> wrote:
>
> Hi,
>
>  So i have updated to 3.7.14 and i still have the same issue with NFS.
> based on what i have provided so far from logs and dumps do you think it's
> an NFS issue ? should i switch to nfs-ganesha ?
> the problem is, the current setup is used in a production environment, and
> switching the mount point of  +50 VMs from native nfs to nfs-ganesha is not
> going to be smooth and without downtime, so i really appreciate your
> thoughts on this matter.
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: mahdi.adnan at outlook.com
> To: kdhananj at redhat.com
> Date: Tue, 2 Aug 2016 08:44:16 +0300
>
> CC: gluster-users at gluster.org
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
>
> Hi,
>
> The NFS just crashed again, latest bt;
>
> (gdb) bt
> #0  0x00007f0b71a9f210 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1  0x00007f0b72c6fcd5 in fd_anonymous (inode=0x0) at fd.c:804
> #2  0x00007f0b64ca5787 in shard_common_inode_write_do
> (frame=0x7f0b707c062c, this=0x7f0b6002ac10) at shard.c:3716
> #3  0x00007f0b64ca5a53 in shard_common_inode_write_post_lookup_shards_handler
> (frame=<optimized out>, this=<optimized out>) at shard.c:3769
> #4  0x00007f0b64c9eff5 in shard_common_lookup_shards_cbk
> (frame=0x7f0b707c062c, cookie=<optimized out>, this=0x7f0b6002ac10,
> op_ret=0,
>     op_errno=<optimized out>, inode=<optimized out>, buf=0x7f0b51407640,
> xdata=0x7f0b72f57648, postparent=0x7f0b514076b0) at shard.c:1601
> #5  0x00007f0b64efe141 in dht_lookup_cbk (frame=0x7f0b7075fcdc,
> cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0,
> inode=0x7f0b5f1d1f58,
>     stbuf=0x7f0b51407640, xattr=0x7f0b72f57648, postparent=0x7f0b514076b0)
> at dht-common.c:2174
> #6  0x00007f0b651871f3 in afr_lookup_done (frame=frame at entry=0x7f0b7079a4c8,
> this=this at entry=0x7f0b60023ba0) at afr-common.c:1825
> #7  0x00007f0b65187b84 in afr_lookup_metadata_heal_check (frame=frame at entry
> =0x7f0b7079a4c8, this=0x7f0b60023ba0, this at entry=0xca0bd88259f5a800)
>     at afr-common.c:2068
> #8  0x00007f0b6518834f in afr_lookup_entry_heal (frame=frame at entry
> =0x7f0b7079a4c8, this=0xca0bd88259f5a800, this at entry=0x7f0b60023ba0) at
> afr-common.c:2157
> #9  0x00007f0b6518867d in afr_lookup_cbk (frame=0x7f0b7079a4c8,
> cookie=<optimized out>, this=0x7f0b60023ba0, op_ret=<optimized out>,
>     op_errno=<optimized out>, inode=<optimized out>, buf=0x7f0b564e9940,
> xdata=0x7f0b72f708c8, postparent=0x7f0b564e99b0) at afr-common.c:2205
> #10 0x00007f0b653d6e42 in client3_3_lookup_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f0b7076354c)
>     at client-rpc-fops.c:2981
> #11 0x00007f0b72a00a30 in rpc_clnt_handle_reply (clnt=clnt at entry
> =0x7f0b603393c0, pollin=pollin at entry=0x7f0b50c1c2d0) at rpc-clnt.c:764
> #12 0x00007f0b72a00cef in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f0b603393f0, event=<optimized out>, data=0x7f0b50c1c2d0) at
> rpc-clnt.c:925
> #13 0x00007f0b729fc7c3 in rpc_transport_notify (this=this at entry
> =0x7f0b60349040, event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data at entry=0x7f0b50c1c2d0)
>     at rpc-transport.c:546
> #14 0x00007f0b678c39a4 in socket_event_poll_in (this=this at entry
> =0x7f0b60349040) at socket.c:2353
> #15 0x00007f0b678c65e4 in socket_event_handler (fd=fd at entry=29,
> idx=idx at entry=17, data=0x7f0b60349040, poll_in=1, poll_out=0, poll_err=0)
> at socket.c:2466
> #16 0x00007f0b72ca0f7a in event_dispatch_epoll_handler
> (event=0x7f0b564e9e80, event_pool=0x7f0b7349bf20) at event-epoll.c:575
> #17 event_dispatch_epoll_worker (data=0x7f0b60152d40) at event-epoll.c:678
> #18 0x00007f0b71a9adc5 in start_thread () from /lib64/libpthread.so.0
> #19 0x00007f0b713dfced in clone () from /lib64/libc.so.6
>
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
> ------------------------------
> From: mahdi.adnan at outlook.com
> To: kdhananj at redhat.com
> Date: Mon, 1 Aug 2016 16:31:50 +0300
> CC: gluster-users at gluster.org
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
>
> Many thanks,
>
> here's the results;
>
>
> (gdb) p cur_block
> $15 = 4088
> (gdb) p last_block
> $16 = 4088
> (gdb) p local->first_block
> $17 = 4087
> (gdb) p odirect
> $18 = _gf_false
> (gdb) p fd->flags
> $19 = 2
> (gdb) p local->call_count
> $20 = 2
>
>
> If you need more core dumps, i have several files i can upload.
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: kdhananj at redhat.com
> Date: Mon, 1 Aug 2016 18:39:27 +0530
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
> To: mahdi.adnan at outlook.com
> CC: gluster-users at gluster.org
>
> Sorry I didn't make myself  clear. The reason I asked YOU to do it is
> because i tried it on my system and im not getting the backtrace (it's all
> question marks).
>
> Attach the core to gdb.
> At the gdb prompt, go to frame 2 by typing
> (gdb) f 2
>
> There, for each of the variables i asked you to get the values of, type p
> followed by the variable name.
> For instance, to get the value of the variable 'odirect', do this:
>
> (gdb) p odirect
>
> and gdb will print its value for you in response.
>
> -Krutika
>
> On Mon, Aug 1, 2016 at 4:55 PM, Mahdi Adnan <mahdi.adnan at outlook.com>
> wrote:
>
> Hi,
>
> How to get the results of the below variables ? i cant get the results
> from gdb.
>
>
> --
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
>
> ------------------------------
> From: kdhananj at redhat.com
> Date: Mon, 1 Aug 2016 15:51:38 +0530
> Subject: Re: [Gluster-users] Gluster 3.7.13 NFS Crash
> To: mahdi.adnan at outlook.com
> CC: gluster-users at gluster.org
>
>
> Could you also print and share the values of the following variables from
> the backtrace please:
>
> i. cur_block
> ii. last_block
> iii. local->first_block
> iv. odirect
> v. fd->flags
> vi. local->call_count
>
> -Krutika
>
> On Sat, Jul 30, 2016 at 5:04 PM, Mahdi Adnan <mahdi.adnan at outlook.com>
> wrote:
>
> Hi,
>
> I really appreciate if someone can help me fix my nfs crash, its happening
> a lot and it's causing lots of issues to my VMs;
> the problem is every few hours the native nfs crash and the volume become
> unavailable from the affected node unless i restart glusterd.
> the volume is used by vmware esxi as a datastore for it's VMs with the
> following options;
>
>
> OS: CentOS 7.2
> Gluster: 3.7.13
>
> Volume Name: vlm01
> Type: Distributed-Replicate
> Volume ID: eacd8248-dca3-4530-9aed-7714a5a114f2
> Status: Started
> Number of Bricks: 7 x 3 = 21
> Transport-type: tcp
> Bricks:
> Brick1: gfs01:/bricks/b01/vlm01
> Brick2: gfs02:/bricks/b01/vlm01
> Brick3: gfs03:/bricks/b01/vlm01
> Brick4: gfs01:/bricks/b02/vlm01
> Brick5: gfs02:/bricks/b02/vlm01
> Brick6: gfs03:/bricks/b02/vlm01
> Brick7: gfs01:/bricks/b03/vlm01
> Brick8: gfs02:/bricks/b03/vlm01
> Brick9: gfs03:/bricks/b03/vlm01
> Brick10: gfs01:/bricks/b04/vlm01
> Brick11: gfs02:/bricks/b04/vlm01
> Brick12: gfs03:/bricks/b04/vlm01
> Brick13: gfs01:/bricks/b05/vlm01
> Brick14: gfs02:/bricks/b05/vlm01
> Brick15: gfs03:/bricks/b05/vlm01
> Brick16: gfs01:/bricks/b06/vlm01
> Brick17: gfs02:/bricks/b06/vlm01
> Brick18: gfs03:/bricks/b06/vlm01
> Brick19: gfs01:/bricks/b07/vlm01
> Brick20: gfs02:/bricks/b07/vlm01
> Brick21: gfs03:/bricks/b07/vlm01
> Options Reconfigured:
> performance.readdir-ahead: off
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: off
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> performance.strict-write-ordering: on
> performance.write-behind: off
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-window-size: 128
> features.shard-block-size: 16MB
> features.shard: on
> auth.allow: 192.168.221.50,192.168.221.51,192.168.221.52,192.168.221.56,
> 192.168.208.130,192.168.208.131,192.168.208.132,192.168.208.
> 89,192.168.208.85,192.168.208.208.86
> network.ping-timeout: 10
>
>
> latest bt;
>
>
> (gdb) bt
> #0  0x00007f196acab210 in pthread_spin_lock () from /lib64/libpthread.so.0
> #1  0x00007f196be7bcd5 in fd_anonymous (inode=0x0) at fd.c:804
> #2  0x00007f195deb1787 in shard_common_inode_write_do
> (frame=0x7f19699f1164, this=0x7f195802ac10) at shard.c:3716
> #3  0x00007f195deb1a53 in shard_common_inode_write_post_lookup_shards_handler
> (frame=<optimized out>, this=<optimized out>) at shard.c:3769
> #4  0x00007f195deaaff5 in shard_common_lookup_shards_cbk
> (frame=0x7f19699f1164, cookie=<optimized out>, this=0x7f195802ac10,
> op_ret=0,
>     op_errno=<optimized out>, inode=<optimized out>, buf=0x7f194970bc40,
> xdata=0x7f196c15451c, postparent=0x7f194970bcb0) at shard.c:1601
> #5  0x00007f195e10a141 in dht_lookup_cbk (frame=0x7f196998e7d4,
> cookie=<optimized out>, this=<optimized out>, op_ret=0, op_errno=0,
> inode=0x7f195c532b18,
>     stbuf=0x7f194970bc40, xattr=0x7f196c15451c, postparent=0x7f194970bcb0)
> at dht-common.c:2174
> #6  0x00007f195e3931f3 in afr_lookup_done (frame=frame at entry=0x7f196997f8a4,
> this=this at entry=0x7f1958022a20) at afr-common.c:1825
> #7  0x00007f195e393b84 in afr_lookup_metadata_heal_check (frame=frame at entry
> =0x7f196997f8a4, this=0x7f1958022a20, this at entry=0xe3a929e0b67fa500)
>     at afr-common.c:2068
> #8  0x00007f195e39434f in afr_lookup_entry_heal (frame=frame at entry
> =0x7f196997f8a4, this=0xe3a929e0b67fa500, this at entry=0x7f1958022a20) at
> afr-common.c:2157
> #9  0x00007f195e39467d in afr_lookup_cbk (frame=0x7f196997f8a4,
> cookie=<optimized out>, this=0x7f1958022a20, op_ret=<optimized out>,
>     op_errno=<optimized out>, inode=<optimized out>, buf=0x7f195effa940,
> xdata=0x7f196c1853b0, postparent=0x7f195effa9b0) at afr-common.c:2205
> #10 0x00007f195e5e2e42 in client3_3_lookup_cbk (req=<optimized out>,
> iov=<optimized out>, count=<optimized out>, myframe=0x7f196999952c)
>     at client-rpc-fops.c:2981
> #11 0x00007f196bc0ca30 in rpc_clnt_handle_reply (clnt=clnt at entry
> =0x7f19583adaf0, pollin=pollin at entry=0x7f195907f930) at rpc-clnt.c:764
> #12 0x00007f196bc0ccef in rpc_clnt_notify (trans=<optimized out>,
> mydata=0x7f19583adb20, event=<optimized out>, data=0x7f195907f930) at
> rpc-clnt.c:925
> #13 0x00007f196bc087c3 in rpc_transport_notify (this=this at entry
> =0x7f19583bd770, event=event at entry=RPC_TRANSPORT_MSG_RECEIVED,
> data=data at entry=0x7f195907f930)
>     at rpc-transport.c:546
> #14 0x00007f1960acf9a4 in socket_event_poll_in (this=this at entry
> =0x7f19583bd770) at socket.c:2353
> #15 0x00007f1960ad25e4 in socket_event_handler (fd=fd at entry=25,
> idx=idx at entry=14, data=0x7f19583bd770, poll_in=1, poll_out=0, poll_err=0)
> at socket.c:2466
> #16 0x00007f196beacf7a in event_dispatch_epoll_handler
> (event=0x7f195effae80, event_pool=0x7f196dbf5f20) at event-epoll.c:575
> #17 event_dispatch_epoll_worker (data=0x7f196dc41e10) at event-epoll.c:678
> #18 0x00007f196aca6dc5 in start_thread () from /lib64/libpthread.so.0
> #19 0x00007f196a5ebced in clone () from /lib64/libc.so.6
>
>
>
>
> nfs logs and the core dump can be found in the dropbox link below;
> https://db.tt/rZrC9d7f
>
>
> thanks in advance.
>
> Respectfully
> *Mahdi A. Mahdi*
>
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
>
> _______________________________________________ Gluster-users mailing
> list Gluster-users at gluster.org http://www.gluster.org/mailman
> /listinfo/gluster-users
>
> _______________________________________________ Gluster-users mailing
> list Gluster-users at gluster.org http://www.gluster.org/mailman
> /listinfo/gluster-users
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20160809/2c4b9919/attachment.html>