From atumball at redhat.com Fri Feb 1 03:24:40 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Fri, 1 Feb 2019 08:54:40 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Hi Artem, Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a clone of other bugs where recent discussions happened), and marked it as a blocker for glusterfs-5.4 release. We already have fixes for log flooding - https://review.gluster.org/22128, and are the process of identifying and fixing the issue seen with crash. Can you please tell if the crashes happened as soon as upgrade ? or was there any particular pattern you observed before the crash. -Amar On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii wrote: > Within 24 hours after updating from rock solid 4.1 to 5.3, I already got a > crash which others have mentioned in > https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to unmount, > kill gluster, and remount: > > > [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fcccafcd329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] > [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fcccafcd329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] > [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fcccafcd329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] > [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fcccafcd329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-3" > repeated 5 times between [2019-01-31 09:37:54.751905] and [2019-01-31 > 09:38:03.958061] > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch > handler" repeated 72 times between [2019-01-31 09:37:53.746741] and > [2019-01-31 09:38:04.696993] > pending frames: > frame : type(1) op(READ) > frame : type(1) op(OPEN) > frame : type(0) op(0) > patchset: git://git.gluster.org/glusterfs.git > signal received: 6 > time of crash: > 2019-01-31 09:38:04 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 5.3 > /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] > /lib64/libc.so.6(+0x36160)[0x7fccd622d160] > /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] > /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] > /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] > /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] > /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] > > /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] > /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] > /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] > /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] > /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] > /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] > /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] > --------- > > Do the pending patches fix the crash or only the repeated warnings? I'm > running glusterfs on OpenSUSE 15.0 installed via > http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, > not too sure how to make it core dump. > > If it's not fixed by the patches above, has anyone already opened a ticket > for the crashes that I can join and monitor? This is going to create a > massive problem for us since production systems are crashing. > > Thanks. > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa > wrote: > >> >> >> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii >> wrote: >> >>> Also, not sure if related or not, but I got a ton of these "Failed to >>> dispatch handler" in my logs as well. Many people have been commenting >>> about this issue here >>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>> >> >> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >> >> >>> ==> mnt-SITE_data1.log <== >>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fd966fcd329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>> ==> mnt-SITE_data3.log <== >>>> The message "E [MSGID: 101191] >>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>> [2019-01-30 20:38:20.015593] >>>> The message "I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>> ==> mnt-SITE_data1.log <== >>>> The message "I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>> The message "E [MSGID: 101191] >>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>> [2019-01-30 20:38:20.546355] >>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>> selecting local read_child SITE_data1-client-0 >>>> ==> mnt-SITE_data3.log <== >>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>> selecting local read_child SITE_data3-client-0 >>>> ==> mnt-SITE_data1.log <== >>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>> handler >>> >>> >>> I'm hoping raising the issue here on the mailing list may bring some >>> additional eyeballs and get them both fixed. >>> >>> Thanks. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii >>> wrote: >>> >>>> I found a similar issue here: >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a comment >>>> from 3 days ago from someone else with 5.3 who started seeing the spam. >>>> >>>> Here's the command that repeats over and over: >>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fd966fcd329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>> >>> >> +Milind Changire Can you check why this message is >> logged and send a fix? >> >> >>>> Is there any fix for this issue? >>>> >>>> Thanks. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From spisla80 at gmail.com Fri Feb 1 15:13:58 2019 From: spisla80 at gmail.com (David Spisla) Date: Fri, 1 Feb 2019 16:13:58 +0100 Subject: [Gluster-users] Corrupted File readable via FUSE? Message-ID: Hello Gluster Community, I have got a 4 Node Cluster with a Replica 4 Volume, so each node has a brick with a copy of a file. Now I tried out the bitrot functionality and corrupt the copy on the brick of node1. After this I scrub ondemand and the file is marked correctly as corrupted. No I try to read that file from FUSE on node1 (with corrupt copy): $ cat file1.txt cat: file1.txt: Transport endpoint is not connected FUSE log says: *[2019-02-01 15:02:19.191984] E [MSGID: 114031] [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote operation failed. Path: /data/file1.txt (b432c1d6-ece2-42f2-8749-b11e058c4be3) [Input/output error]* [2019-02-01 15:02:19.192269] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fc642471329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fc642682af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fc64a78d218] ) 0-dict: dict is NULL [Invalid argument] [2019-02-01 15:02:19.192714] E [MSGID: 108009] [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to open /data/file1.txt on subvolume archive1-client-0 [Input/output error] *[2019-02-01 15:02:19.193009] W [fuse-bridge.c:2371:fuse_readv_cbk] 0-glusterfs-fuse: 147733: READ => -1 gfid=b432c1d6-ece2-42f2-8749-b11e058c4be3 fd=0x7fc60408bbb8 (Transport endpoint is not connected)* [2019-02-01 15:02:19.193653] W [MSGID: 114028] [client-lk.c:347:delete_granted_locks_owner] 0-archive1-client-0: fdctx not valid [Invalid argument] And from FUSE on node2 (with heal copy): $ cat file1.txt file1 It seems to be that node1 wants to get the file from its own brick, but the copy there is broken. Node2 gets the file from its own brick with a heal copy, so reading the file succeed. But I am wondering myself because sometimes reading the file from node1 with the broken copy succeed What is the expected behaviour here? Is it possibly to read files with a corrupted copy from any client access? Regards David Spisla -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Fri Feb 1 17:03:33 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Fri, 1 Feb 2019 09:03:33 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Hi, The first (and so far only) crash happened at 2am the next day after we upgraded, on only one of four servers and only to one of two mounts. I have no idea what caused it, but yeah, we do have a pretty busy site ( apkmirror.com), and it caused a disruption for any uploads or downloads from that server until I woke up and fixed the mount. I wish I could be more helpful but all I have is that stack trace. I'm glad it's a blocker and will hopefully be resolved soon. On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < atumball at redhat.com> wrote: > Hi Artem, > > Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a > clone of other bugs where recent discussions happened), and marked it as a > blocker for glusterfs-5.4 release. > > We already have fixes for log flooding - https://review.gluster.org/22128, > and are the process of identifying and fixing the issue seen with crash. > > Can you please tell if the crashes happened as soon as upgrade ? or was > there any particular pattern you observed before the crash. > > -Amar > > > On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii > wrote: > >> Within 24 hours after updating from rock solid 4.1 to 5.3, I already got >> a crash which others have mentioned in >> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to unmount, >> kill gluster, and remount: >> >> >> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fcccafcd329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fcccafcd329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fcccafcd329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fcccafcd329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >> selecting local read_child SITE_data1-client-3" repeated 5 times between >> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >> [2019-01-31 09:38:04.696993] >> pending frames: >> frame : type(1) op(READ) >> frame : type(1) op(OPEN) >> frame : type(0) op(0) >> patchset: git://git.gluster.org/glusterfs.git >> signal received: 6 >> time of crash: >> 2019-01-31 09:38:04 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 5.3 >> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >> >> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >> --------- >> >> Do the pending patches fix the crash or only the repeated warnings? I'm >> running glusterfs on OpenSUSE 15.0 installed via >> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >> not too sure how to make it core dump. >> >> If it's not fixed by the patches above, has anyone already opened a >> ticket for the crashes that I can join and monitor? This is going to create >> a massive problem for us since production systems are crashing. >> >> Thanks. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii >>> wrote: >>> >>>> Also, not sure if related or not, but I got a ton of these "Failed to >>>> dispatch handler" in my logs as well. Many people have been commenting >>>> about this issue here >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>> >>> >>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>> >>> >>>> ==> mnt-SITE_data1.log <== >>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fd966fcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>> ==> mnt-SITE_data3.log <== >>>>> The message "E [MSGID: 101191] >>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>> [2019-01-30 20:38:20.015593] >>>>> The message "I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>> ==> mnt-SITE_data1.log <== >>>>> The message "I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>> The message "E [MSGID: 101191] >>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>> [2019-01-30 20:38:20.546355] >>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>> selecting local read_child SITE_data1-client-0 >>>>> ==> mnt-SITE_data3.log <== >>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>> selecting local read_child SITE_data3-client-0 >>>>> ==> mnt-SITE_data1.log <== >>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>> handler >>>> >>>> >>>> I'm hoping raising the issue here on the mailing list may bring some >>>> additional eyeballs and get them both fixed. >>>> >>>> Thanks. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>> archon810 at gmail.com> wrote: >>>> >>>>> I found a similar issue here: >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>> spam. >>>>> >>>>> Here's the command that repeats over and over: >>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fd966fcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>> >>>> >>> +Milind Changire Can you check why this message >>> is logged and send a fix? >>> >>> >>>>> Is there any fix for this issue? >>>>> >>>>> Thanks. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cory at securecloudsolutions.com Fri Feb 1 20:01:27 2019 From: cory at securecloudsolutions.com (Cory Sanders) Date: Fri, 1 Feb 2019 20:01:27 +0000 Subject: [Gluster-users] please remove me from the list Message-ID: <866ea5d1ecfa46bc95abbb3747b504c8@winhexbeus17.winus.mail> Thjanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedro at pmc.digital Fri Feb 1 15:35:54 2019 From: pedro at pmc.digital (Pedro Costa) Date: Fri, 1 Feb 2019 15:35:54 +0000 Subject: [Gluster-users] Help analise statedumps Message-ID: Hi, I have a 3x replicated cluster running 4.1.7 on ubuntu 16.04.5, all 3 replicas are also clients hosting a Node.js/Nginx web server. The current configuration is as such: Volume Name: gvol1 Type: Replicate Volume ID: XXXXXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: vm000000:/srv/brick1/gvol1 Brick2: vm000001:/srv/brick1/gvol1 Brick3: vm000002:/srv/brick1/gvol1 Options Reconfigured: cluster.self-heal-readdir-size: 2KB cluster.self-heal-window-size: 2 cluster.background-self-heal-count: 20 network.ping-timeout: 5 disperse.eager-lock: off performance.parallel-readdir: on performance.readdir-ahead: on performance.rda-cache-limit: 128MB performance.cache-refresh-timeout: 10 performance.nl-cache-timeout: 600 performance.nl-cache: on cluster.nufa: on performance.enable-least-priority: off server.outstanding-rpc-limit: 128 performance.strict-o-direct: on cluster.shd-max-threads: 12 client.event-threads: 4 cluster.lookup-optimize: on network.inode-lru-limit: 90000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.cache-samba-metadata: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on features.utime: on storage.ctime: on server.event-threads: 4 performance.cache-size: 256MB performance.read-ahead: on cluster.readdir-optimize: on cluster.strict-readdir: on performance.io-thread-count: 8 server.allow-insecure: on cluster.read-hash-mode: 0 cluster.lookup-unhashed: auto cluster.choose-local: on I believe there's a memory leak somewhere, it just keeps going up until it hangs one or more nodes taking the whole cluster down sometimes. I have taken 2 statedumps on one of the nodes, one where the memory is too high and another just after a reboot with the app running and the volume fully healed. https://pmcdigital.sharepoint.com/:u:/g/EYDsNqTf1UdEuE6B0ZNVPfIBf_I-AbaqHotB1lJOnxLlTg?e=boYP09 (high memory) https://pmcdigital.sharepoint.com/:u:/g/EWZBsnET2xBHl6OxO52RCfIBvQ0uIDQ1GKJZ1GrnviyMhg?e=wI3yaY (after reboot) Any help would be greatly appreciated, Kindest Regards, Pedro Maia Costa Senior Developer, pmc.digital -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedro at pmc.digital Fri Feb 1 15:35:54 2019 From: pedro at pmc.digital (Pedro Costa) Date: Fri, 1 Feb 2019 15:35:54 +0000 Subject: [Gluster-users] Help analise statedumps Message-ID: Hi, I have a 3x replicated cluster running 4.1.7 on ubuntu 16.04.5, all 3 replicas are also clients hosting a Node.js/Nginx web server. The current configuration is as such: Volume Name: gvol1 Type: Replicate Volume ID: XXXXXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: vm000000:/srv/brick1/gvol1 Brick2: vm000001:/srv/brick1/gvol1 Brick3: vm000002:/srv/brick1/gvol1 Options Reconfigured: cluster.self-heal-readdir-size: 2KB cluster.self-heal-window-size: 2 cluster.background-self-heal-count: 20 network.ping-timeout: 5 disperse.eager-lock: off performance.parallel-readdir: on performance.readdir-ahead: on performance.rda-cache-limit: 128MB performance.cache-refresh-timeout: 10 performance.nl-cache-timeout: 600 performance.nl-cache: on cluster.nufa: on performance.enable-least-priority: off server.outstanding-rpc-limit: 128 performance.strict-o-direct: on cluster.shd-max-threads: 12 client.event-threads: 4 cluster.lookup-optimize: on network.inode-lru-limit: 90000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.cache-samba-metadata: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on features.utime: on storage.ctime: on server.event-threads: 4 performance.cache-size: 256MB performance.read-ahead: on cluster.readdir-optimize: on cluster.strict-readdir: on performance.io-thread-count: 8 server.allow-insecure: on cluster.read-hash-mode: 0 cluster.lookup-unhashed: auto cluster.choose-local: on I believe there's a memory leak somewhere, it just keeps going up until it hangs one or more nodes taking the whole cluster down sometimes. I have taken 2 statedumps on one of the nodes, one where the memory is too high and another just after a reboot with the app running and the volume fully healed. https://pmcdigital.sharepoint.com/:u:/g/EYDsNqTf1UdEuE6B0ZNVPfIBf_I-AbaqHotB1lJOnxLlTg?e=boYP09 (high memory) https://pmcdigital.sharepoint.com/:u:/g/EWZBsnET2xBHl6OxO52RCfIBvQ0uIDQ1GKJZ1GrnviyMhg?e=wI3yaY (after reboot) Any help would be greatly appreciated, Kindest Regards, Pedro Maia Costa Senior Developer, pmc.digital -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Sat Feb 2 20:14:30 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Sat, 2 Feb 2019 12:14:30 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: The fuse crash happened again yesterday, to another volume. Are there any mount options that could help mitigate this? In the meantime, I set up a monit (https://mmonit.com/monit/) task to watch and restart the mount, which works and recovers the mount point within a minute. Not ideal, but a temporary workaround. By the way, the way to reproduce this "Transport endpoint is not connected" condition for testing purposes is to kill -9 the right "glusterfs --process-name fuse" process. monit check: check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 start program = "/bin/mount /mnt/glusterfs_data1" stop program = "/bin/umount /mnt/glusterfs_data1" if space usage > 90% for 5 times within 15 cycles then alert else if succeeded for 10 cycles then alert stack trace: [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 26 times between [2019-02-01 23:21:20.857333] and [2019-02-01 23:21:56.164427] The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-3" repeated 27 times between [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] pending frames: frame : type(1) op(LOOKUP) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2019-02-01 23:22:03 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.3 /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii wrote: > Hi, > > The first (and so far only) crash happened at 2am the next day after we > upgraded, on only one of four servers and only to one of two mounts. > > I have no idea what caused it, but yeah, we do have a pretty busy site ( > apkmirror.com), and it caused a disruption for any uploads or downloads > from that server until I woke up and fixed the mount. > > I wish I could be more helpful but all I have is that stack trace. > > I'm glad it's a blocker and will hopefully be resolved soon. > > On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < > atumball at redhat.com> wrote: > >> Hi Artem, >> >> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a >> clone of other bugs where recent discussions happened), and marked it as a >> blocker for glusterfs-5.4 release. >> >> We already have fixes for log flooding - https://review.gluster.org/22128, >> and are the process of identifying and fixing the issue seen with crash. >> >> Can you please tell if the crashes happened as soon as upgrade ? or was >> there any particular pattern you observed before the crash. >> >> -Amar >> >> >> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii >> wrote: >> >>> Within 24 hours after updating from rock solid 4.1 to 5.3, I already got >>> a crash which others have mentioned in >>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to unmount, >>> kill gluster, and remount: >>> >>> >>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fcccafcd329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fcccafcd329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fcccafcd329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fcccafcd329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>> The message "I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>> The message "E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>> [2019-01-31 09:38:04.696993] >>> pending frames: >>> frame : type(1) op(READ) >>> frame : type(1) op(OPEN) >>> frame : type(0) op(0) >>> patchset: git://git.gluster.org/glusterfs.git >>> signal received: 6 >>> time of crash: >>> 2019-01-31 09:38:04 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 5.3 >>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>> >>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>> --------- >>> >>> Do the pending patches fix the crash or only the repeated warnings? I'm >>> running glusterfs on OpenSUSE 15.0 installed via >>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>> not too sure how to make it core dump. >>> >>> If it's not fixed by the patches above, has anyone already opened a >>> ticket for the crashes that I can join and monitor? This is going to create >>> a massive problem for us since production systems are crashing. >>> >>> Thanks. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>> rgowdapp at redhat.com> wrote: >>> >>>> >>>> >>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii >>>> wrote: >>>> >>>>> Also, not sure if related or not, but I got a ton of these "Failed to >>>>> dispatch handler" in my logs as well. Many people have been commenting >>>>> about this issue here >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>> >>>> >>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>> >>>> >>>>> ==> mnt-SITE_data1.log <== >>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fd966fcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> ==> mnt-SITE_data3.log <== >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>> [2019-01-30 20:38:20.015593] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>> ==> mnt-SITE_data1.log <== >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>> [2019-01-30 20:38:20.546355] >>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>> selecting local read_child SITE_data1-client-0 >>>>>> ==> mnt-SITE_data3.log <== >>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>> selecting local read_child SITE_data3-client-0 >>>>>> ==> mnt-SITE_data1.log <== >>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>> handler >>>>> >>>>> >>>>> I'm hoping raising the issue here on the mailing list may bring some >>>>> additional eyeballs and get them both fixed. >>>>> >>>>> Thanks. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> I found a similar issue here: >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>> spam. >>>>>> >>>>>> Here's the command that repeats over and over: >>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fd966fcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> >>>>> >>>> +Milind Changire Can you check why this message >>>> is logged and send a fix? >>>> >>>> >>>>>> Is there any fix for this issue? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> -- >> Amar Tumballi (amarts) >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kashif.alig at gmail.com Sun Feb 3 22:19:22 2019 From: kashif.alig at gmail.com (mohammad kashif) Date: Sun, 3 Feb 2019 22:19:22 +0000 Subject: [Gluster-users] gluster remove-brick Message-ID: Hi I have a pure distributed gluster volume with nine nodes and trying to remove one node, I ran gluster volume remove-brick atlasglust nodename:/glusteratlas/brick007/gv0 start It completed but with around 17000 failures Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- nodename 4185858 27.5TB 6746030 17488 0 completed 405:15:34 I can see that there is still 1.5 TB of data on the node which I was trying to remove. I am not sure what to do now? Should I run remove-brick command again so the files which has been failed can be tried again? or should I run commit first and then try to remove node again? Please advise as I don't want to remove files. Thanks Kashif -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Mon Feb 4 05:08:36 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Mon, 4 Feb 2019 10:38:36 +0530 Subject: [Gluster-users] gluster remove-brick In-Reply-To: References: Message-ID: Hi, The status shows quite a few failures. Please check the rebalance logs to see why that happened. We can decide what to do based on the errors. Once you run a commit, the brick will no longer be part of the volume and you will not be able to access those files via the client. Do you have sufficient space on the remaining bricks for the files on the removed brick? Regards, Nithya On Mon, 4 Feb 2019 at 03:50, mohammad kashif wrote: > Hi > > I have a pure distributed gluster volume with nine nodes and trying to > remove one node, I ran > gluster volume remove-brick atlasglust nodename:/glusteratlas/brick007/gv0 > start > > It completed but with around 17000 failures > > Node Rebalanced-files size scanned failures > skipped status run time in h:m:s > --------- ----------- ----------- > ----------- ----------- ----------- ------------ > -------------- > nodename 4185858 27.5TB 6746030 > 17488 0 completed 405:15:34 > > I can see that there is still 1.5 TB of data on the node which I was > trying to remove. > > I am not sure what to do now? Should I run remove-brick command again so > the files which has been failed can be tried again? > > or should I run commit first and then try to remove node again? > > Please advise as I don't want to remove files. > > Thanks > > Kashif > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From srakonde at redhat.com Mon Feb 4 06:10:23 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Mon, 4 Feb 2019 11:40:23 +0530 Subject: [Gluster-users] Help analise statedumps In-Reply-To: References: Message-ID: Hi, Can you please specify which process has leak? Have you took the statedump of the same process which has leak? Thanks, Sanju On Sat, Feb 2, 2019 at 3:15 PM Pedro Costa wrote: > Hi, > > > > I have a 3x replicated cluster running 4.1.7 on ubuntu 16.04.5, all 3 > replicas are also clients hosting a Node.js/Nginx web server. > > > > The current configuration is as such: > > > > Volume Name: gvol1 > > Type: Replicate > > Volume ID: XXXXXX > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 3 = 3 > > Transport-type: tcp > > Bricks: > > Brick1: vm000000:/srv/brick1/gvol1 > > Brick2: vm000001:/srv/brick1/gvol1 > > Brick3: vm000002:/srv/brick1/gvol1 > > Options Reconfigured: > > cluster.self-heal-readdir-size: 2KB > > cluster.self-heal-window-size: 2 > > cluster.background-self-heal-count: 20 > > network.ping-timeout: 5 > > disperse.eager-lock: off > > performance.parallel-readdir: on > > performance.readdir-ahead: on > > performance.rda-cache-limit: 128MB > > performance.cache-refresh-timeout: 10 > > performance.nl-cache-timeout: 600 > > performance.nl-cache: on > > cluster.nufa: on > > performance.enable-least-priority: off > > server.outstanding-rpc-limit: 128 > > performance.strict-o-direct: on > > cluster.shd-max-threads: 12 > > client.event-threads: 4 > > cluster.lookup-optimize: on > > network.inode-lru-limit: 90000 > > performance.md-cache-timeout: 600 > > performance.cache-invalidation: on > > performance.cache-samba-metadata: on > > performance.stat-prefetch: on > > features.cache-invalidation-timeout: 600 > > features.cache-invalidation: on > > storage.fips-mode-rchecksum: on > > transport.address-family: inet > > nfs.disable: on > > performance.client-io-threads: on > > features.utime: on > > storage.ctime: on > > server.event-threads: 4 > > performance.cache-size: 256MB > > performance.read-ahead: on > > cluster.readdir-optimize: on > > cluster.strict-readdir: on > > performance.io-thread-count: 8 > > server.allow-insecure: on > > cluster.read-hash-mode: 0 > > cluster.lookup-unhashed: auto > > cluster.choose-local: on > > > > I believe there?s a memory leak somewhere, it just keeps going up until it > hangs one or more nodes taking the whole cluster down sometimes. > > > > I have taken 2 statedumps on one of the nodes, one where the memory is too > high and another just after a reboot with the app running and the volume > fully healed. > > > > > https://pmcdigital.sharepoint.com/:u:/g/EYDsNqTf1UdEuE6B0ZNVPfIBf_I-AbaqHotB1lJOnxLlTg?e=boYP09 > (high memory) > > > > > https://pmcdigital.sharepoint.com/:u:/g/EWZBsnET2xBHl6OxO52RCfIBvQ0uIDQ1GKJZ1GrnviyMhg?e=wI3yaY > (after reboot) > > > > Any help would be greatly appreciated, > > > > Kindest Regards, > > > > > *Pedro Maia Costa **Senior Developer, pmc.digital* > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Mon Feb 4 08:57:20 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Mon, 4 Feb 2019 14:27:20 +0530 Subject: [Gluster-users] Corrupted File readable via FUSE? In-Reply-To: References: Message-ID: Hi David, I guess https://review.gluster.org/#/c/glusterfs/+/21996/ helps to fix the issue. I will leave it to Raghavendra Bhat to reconfirm. Regards, Amar On Fri, Feb 1, 2019 at 8:45 PM David Spisla wrote: > Hello Gluster Community, > I have got a 4 Node Cluster with a Replica 4 Volume, so each node has a > brick with a copy of a file. Now I tried out the bitrot functionality and > corrupt the copy on the brick of node1. After this I scrub ondemand and the > file is marked correctly as corrupted. > > No I try to read that file from FUSE on node1 (with corrupt copy): > $ cat file1.txt > cat: file1.txt: Transport endpoint is not connected > FUSE log says: > > *[2019-02-01 15:02:19.191984] E [MSGID: 114031] > [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote > operation failed. Path: /data/file1.txt > (b432c1d6-ece2-42f2-8749-b11e058c4be3) [Input/output error]* > [2019-02-01 15:02:19.192269] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fc642471329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fc642682af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fc64a78d218] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-01 15:02:19.192714] E [MSGID: 108009] > [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to > open /data/file1.txt on subvolume archive1-client-0 [Input/output error] > *[2019-02-01 15:02:19.193009] W [fuse-bridge.c:2371:fuse_readv_cbk] > 0-glusterfs-fuse: 147733: READ => -1 > gfid=b432c1d6-ece2-42f2-8749-b11e058c4be3 fd=0x7fc60408bbb8 (Transport > endpoint is not connected)* > [2019-02-01 15:02:19.193653] W [MSGID: 114028] > [client-lk.c:347:delete_granted_locks_owner] 0-archive1-client-0: fdctx not > valid [Invalid argument] > > And from FUSE on node2 (with heal copy): > $ cat file1.txt > file1 > > It seems to be that node1 wants to get the file from its own brick, but > the copy there is broken. Node2 gets the file from its own brick with a > heal copy, so reading the file succeed. > But I am wondering myself because sometimes reading the file from node1 > with the broken copy succeed > > What is the expected behaviour here? Is it possibly to read files with a > corrupted copy from any client access? > > Regards > David Spisla > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From armin.weisser at iternity.com Sun Feb 3 07:50:19 2019 From: armin.weisser at iternity.com (=?utf-8?B?QXJtaW4gV2Vpw59lcg==?=) Date: Sun, 3 Feb 2019 07:50:19 +0000 Subject: [Gluster-users] BOF Session - FOSDEM Today Message-ID: <0FE829C6-A236-4922-821D-7D8E7802D4F8@iternity.com> To all the Gluster folks @ FOSDEM: there will be a great BOF Session today. Where: Room H.3242 When: 11:30 am What: Gluster Performance - Status Quo and Best Practices Hope to see you there! Cheers, Armin -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Mon Feb 4 10:02:42 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Mon, 4 Feb 2019 15:32:42 +0530 Subject: [Gluster-users] Memory management, OOM kills and glusterfs Message-ID: All, Me, Csaba and Manoj are presenting our experiences with using FUSE as an interface for Glusterfs at Vault'19 [1]. One of the areas Glusterfs has faced difficulties is with memory management. One of the reasons for high memory consumption has been the amount of memory consumed by glusterfs mount process to maintain the inodes looked up by the kernel. Though we've a solution [2] for this problem, things would've been much easier and effective if Glusterfs was in kernel space (for the case of memory management). In kernel space, the memory consumed by inodes would be accounted for kernel's inode cache and hence kernel memory management would manage the inodes more effectively and intelligently. However, being in userspace there is no way to account the memory consumed for an inode in user space and hence only a very small part of the memory gets accounted (the inode maintained by fuse kernel module). The objective of this mail is to collect more cases/user issues/bugs such as these so that we can present them as evidence. I am currently aware of a tracker issue [3] which covers the issue I mentioned above. Also, if you are aware of any other memory management issues, we are interested in them. [1] https://www.usenix.org/conference/vault19/presentation/pillai [2] https://review.gluster.org/#/c/glusterfs/+/19778/ [3] https://bugzilla.redhat.com/show_bug.cgi?id=1647277 -------------- next part -------------- An HTML attachment was scrubbed... URL: From spisla80 at gmail.com Mon Feb 4 11:01:58 2019 From: spisla80 at gmail.com (David Spisla) Date: Mon, 4 Feb 2019 12:01:58 +0100 Subject: [Gluster-users] Corrupted File readable via FUSE? In-Reply-To: References: Message-ID: Hello Amar, sounds good. Until now this patch is only merged into master. I think it should be part of the next v5.x patch release! Regards David Am Mo., 4. Feb. 2019 um 09:58 Uhr schrieb Amar Tumballi Suryanarayan < atumball at redhat.com>: > Hi David, > > I guess https://review.gluster.org/#/c/glusterfs/+/21996/ helps to fix > the issue. I will leave it to Raghavendra Bhat to reconfirm. > > Regards, > Amar > > On Fri, Feb 1, 2019 at 8:45 PM David Spisla wrote: > >> Hello Gluster Community, >> I have got a 4 Node Cluster with a Replica 4 Volume, so each node has a >> brick with a copy of a file. Now I tried out the bitrot functionality and >> corrupt the copy on the brick of node1. After this I scrub ondemand and the >> file is marked correctly as corrupted. >> >> No I try to read that file from FUSE on node1 (with corrupt copy): >> $ cat file1.txt >> cat: file1.txt: Transport endpoint is not connected >> FUSE log says: >> >> *[2019-02-01 15:02:19.191984] E [MSGID: 114031] >> [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote >> operation failed. Path: /data/file1.txt >> (b432c1d6-ece2-42f2-8749-b11e058c4be3) [Input/output error]* >> [2019-02-01 15:02:19.192269] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fc642471329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fc642682af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fc64a78d218] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-01 15:02:19.192714] E [MSGID: 108009] >> [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to >> open /data/file1.txt on subvolume archive1-client-0 [Input/output error] >> *[2019-02-01 15:02:19.193009] W [fuse-bridge.c:2371:fuse_readv_cbk] >> 0-glusterfs-fuse: 147733: READ => -1 >> gfid=b432c1d6-ece2-42f2-8749-b11e058c4be3 fd=0x7fc60408bbb8 (Transport >> endpoint is not connected)* >> [2019-02-01 15:02:19.193653] W [MSGID: 114028] >> [client-lk.c:347:delete_granted_locks_owner] 0-archive1-client-0: fdctx not >> valid [Invalid argument] >> >> And from FUSE on node2 (with heal copy): >> $ cat file1.txt >> file1 >> >> It seems to be that node1 wants to get the file from its own brick, but >> the copy there is broken. Node2 gets the file from its own brick with a >> heal copy, so reading the file succeed. >> But I am wondering myself because sometimes reading the file from node1 >> with the broken copy succeed >> >> What is the expected behaviour here? Is it possibly to read files with a >> corrupted copy from any client access? >> >> Regards >> David Spisla >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kashif.alig at gmail.com Mon Feb 4 11:09:01 2019 From: kashif.alig at gmail.com (mohammad kashif) Date: Mon, 4 Feb 2019 11:09:01 +0000 Subject: [Gluster-users] gluster remove-brick In-Reply-To: References: Message-ID: Hi Nithya Thanks for replying so quickly. It is very much appreciated. There are lots if " [No space left on device] " errors which I can not understand as there are much space on all of the nodes. A little bit of background will be useful in this case. I had cluster of seven nodes of varying capacity(73, 73, 73, 46, 46, 46,46 TB) . The cluster was almost 90% full so every node has almost 8 to 15 TB free space. I added two new nodes with 100TB each and ran fix-layout which completed successfully. After that I started remove-brick operation. I don't think that any point , any of the nodes were 100% full. Looking at my ganglia graph, there is minimum 5TB always available at every node. I was keeping an eye on remove-brick status and for very long time there was no failures and then at some point these 17000 failures appeared and it stayed like that. Thanks Kashif Let me explain a little bit of background. On Mon, Feb 4, 2019 at 5:09 AM Nithya Balachandran wrote: > Hi, > > The status shows quite a few failures. Please check the rebalance logs to > see why that happened. We can decide what to do based on the errors. > Once you run a commit, the brick will no longer be part of the volume and > you will not be able to access those files via the client. > Do you have sufficient space on the remaining bricks for the files on the > removed brick? > > Regards, > Nithya > > On Mon, 4 Feb 2019 at 03:50, mohammad kashif > wrote: > >> Hi >> >> I have a pure distributed gluster volume with nine nodes and trying to >> remove one node, I ran >> gluster volume remove-brick atlasglust >> nodename:/glusteratlas/brick007/gv0 start >> >> It completed but with around 17000 failures >> >> Node Rebalanced-files size scanned failures >> skipped status run time in h:m:s >> --------- ----------- ----------- >> ----------- ----------- ----------- ------------ >> -------------- >> nodename 4185858 27.5TB 6746030 >> 17488 0 completed 405:15:34 >> >> I can see that there is still 1.5 TB of data on the node which I was >> trying to remove. >> >> I am not sure what to do now? Should I run remove-brick command again so >> the files which has been failed can be tried again? >> >> or should I run commit first and then try to remove node again? >> >> Please advise as I don't want to remove files. >> >> Thanks >> >> Kashif >> >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedro at pmc.digital Mon Feb 4 11:27:38 2019 From: pedro at pmc.digital (Pedro Costa) Date: Mon, 4 Feb 2019 11:27:38 +0000 Subject: [Gluster-users] Help analise statedumps In-Reply-To: References: Message-ID: Hi Sanju, If it helps, here?s also a statedump (taken just now) since the reboot?s: https://pmcdigital.sharepoint.com/:u:/g/EbsT2RZsuc5BsRrf7F-fw-4BocyeogW-WvEike_sg8CpZg?e=a7nTqS Many thanks, P. From: Pedro Costa Sent: 04 February 2019 10:12 To: 'Sanju Rakonde' Cc: gluster-users Subject: RE: [Gluster-users] Help analise statedumps Hi Sanju, The process was `glusterfs`, yes I took the statedump for the same process (different PID since it was rebooted). Cheers, P. From: Sanju Rakonde Sent: 04 February 2019 06:10 To: Pedro Costa Cc: gluster-users Subject: Re: [Gluster-users] Help analise statedumps Hi, Can you please specify which process has leak? Have you took the statedump of the same process which has leak? Thanks, Sanju On Sat, Feb 2, 2019 at 3:15 PM Pedro Costa > wrote: Hi, I have a 3x replicated cluster running 4.1.7 on ubuntu 16.04.5, all 3 replicas are also clients hosting a Node.js/Nginx web server. The current configuration is as such: Volume Name: gvol1 Type: Replicate Volume ID: XXXXXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: vm000000:/srv/brick1/gvol1 Brick2: vm000001:/srv/brick1/gvol1 Brick3: vm000002:/srv/brick1/gvol1 Options Reconfigured: cluster.self-heal-readdir-size: 2KB cluster.self-heal-window-size: 2 cluster.background-self-heal-count: 20 network.ping-timeout: 5 disperse.eager-lock: off performance.parallel-readdir: on performance.readdir-ahead: on performance.rda-cache-limit: 128MB performance.cache-refresh-timeout: 10 performance.nl-cache-timeout: 600 performance.nl-cache: on cluster.nufa: on performance.enable-least-priority: off server.outstanding-rpc-limit: 128 performance.strict-o-direct: on cluster.shd-max-threads: 12 client.event-threads: 4 cluster.lookup-optimize: on network.inode-lru-limit: 90000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.cache-samba-metadata: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on features.utime: on storage.ctime: on server.event-threads: 4 performance.cache-size: 256MB performance.read-ahead: on cluster.readdir-optimize: on cluster.strict-readdir: on performance.io-thread-count: 8 server.allow-insecure: on cluster.read-hash-mode: 0 cluster.lookup-unhashed: auto cluster.choose-local: on I believe there?s a memory leak somewhere, it just keeps going up until it hangs one or more nodes taking the whole cluster down sometimes. I have taken 2 statedumps on one of the nodes, one where the memory is too high and another just after a reboot with the app running and the volume fully healed. https://pmcdigital.sharepoint.com/:u:/g/EYDsNqTf1UdEuE6B0ZNVPfIBf_I-AbaqHotB1lJOnxLlTg?e=boYP09 (high memory) https://pmcdigital.sharepoint.com/:u:/g/EWZBsnET2xBHl6OxO52RCfIBvQ0uIDQ1GKJZ1GrnviyMhg?e=wI3yaY (after reboot) Any help would be greatly appreciated, Kindest Regards, Pedro Maia Costa Senior Developer, pmc.digital _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Mon Feb 4 11:37:17 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Mon, 4 Feb 2019 17:07:17 +0530 Subject: [Gluster-users] gluster remove-brick In-Reply-To: References: Message-ID: Hi, On Mon, 4 Feb 2019 at 16:39, mohammad kashif wrote: > Hi Nithya > > Thanks for replying so quickly. It is very much appreciated. > > There are lots if " [No space left on device] " errors which I can not > understand as there are much space on all of the nodes. > This means that Gluster could not find sufficient space for the file. Would you be willing to share your rebalance log file? Please provide the following information: - The gluster version - The gluster volume info for the volume - How full are the individual bricks for the volume? > A little bit of background will be useful in this case. I had cluster of > seven nodes of varying capacity(73, 73, 73, 46, 46, 46,46 TB) . The > cluster was almost 90% full so every node has almost 8 to 15 TB free > space. I added two new nodes with 100TB each and ran fix-layout which > completed successfully. > > After that I started remove-brick operation. I don't think that any point > , any of the nodes were 100% full. Looking at my ganglia graph, there is > minimum 5TB always available at every node. > > I was keeping an eye on remove-brick status and for very long time there > was no failures and then at some point these 17000 failures appeared and it > stayed like that. > > Thanks > > Kashif > > > > > > Let me explain a little bit of background. > > > On Mon, Feb 4, 2019 at 5:09 AM Nithya Balachandran > wrote: > >> Hi, >> >> The status shows quite a few failures. Please check the rebalance logs to >> see why that happened. We can decide what to do based on the errors. >> Once you run a commit, the brick will no longer be part of the volume and >> you will not be able to access those files via the client. >> Do you have sufficient space on the remaining bricks for the files on the >> removed brick? >> >> Regards, >> Nithya >> >> On Mon, 4 Feb 2019 at 03:50, mohammad kashif >> wrote: >> >>> Hi >>> >>> I have a pure distributed gluster volume with nine nodes and trying to >>> remove one node, I ran >>> gluster volume remove-brick atlasglust >>> nodename:/glusteratlas/brick007/gv0 start >>> >>> It completed but with around 17000 failures >>> >>> Node Rebalanced-files size scanned failures >>> skipped status run time in h:m:s >>> --------- ----------- ----------- >>> ----------- ----------- ----------- ------------ >>> -------------- >>> nodename 4185858 27.5TB 6746030 >>> 17488 0 completed 405:15:34 >>> >>> I can see that there is still 1.5 TB of data on the node which I was >>> trying to remove. >>> >>> I am not sure what to do now? Should I run remove-brick command again >>> so the files which has been failed can be tried again? >>> >>> or should I run commit first and then try to remove node again? >>> >>> Please advise as I don't want to remove files. >>> >>> Thanks >>> >>> Kashif >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From kashif.alig at gmail.com Mon Feb 4 13:23:03 2019 From: kashif.alig at gmail.com (mohammad kashif) Date: Mon, 4 Feb 2019 13:23:03 +0000 Subject: [Gluster-users] gluster remove-brick In-Reply-To: References: Message-ID: Hi Nithya I tried attching the logs but it was tool big. So I have put it on one drive accessible by everyone https://drive.google.com/drive/folders/1744WcOfrqe_e3lRPxLpQ-CBuXHp_o44T?usp=sharing I am attaching rebalance-logs which is for the period when I ran fix-layout after adding new disk and then started remove-disk option. All of the nodes have atleast 8 TB disk available /dev/sdb 73T 65T 8.0T 90% /glusteratlas/brick001 /dev/sdb 73T 65T 8.0T 90% /glusteratlas/brick002 /dev/sdb 73T 65T 8.0T 90% /glusteratlas/brick003 /dev/sdb 73T 65T 8.0T 90% /glusteratlas/brick004 /dev/sdb 73T 65T 8.0T 90% /glusteratlas/brick005 /dev/sdb 80T 67T 14T 83% /glusteratlas/brick006 /dev/sdb 37T 1.6T 35T 5% /glusteratlas/brick007 /dev/sdb 89T 15T 75T 17% /glusteratlas/brick008 /dev/sdb 89T 14T 76T 16% /glusteratlas/brick009 brick007 is the one I am removing gluster volume info Volume Name: atlasglust Type: Distribute Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b Status: Started Snapshot Count: 0 Number of Bricks: 9 Transport-type: tcp Bricks: Brick1: pplxgluster01**:/glusteratlas/brick001/gv0 Brick2: pplxgluster02.**:/glusteratlas/brick002/gv0 Brick3: pplxgluster03.**:/glusteratlas/brick003/gv0 Brick4: pplxgluster04.**:/glusteratlas/brick004/gv0 Brick5: pplxgluster05.**:/glusteratlas/brick005/gv0 Brick6: pplxgluster06.**:/glusteratlas/brick006/gv0 Brick7: pplxgluster07.**:/glusteratlas/brick007/gv0 Brick8: pplxgluster08.**:/glusteratlas/brick008/gv0 Brick9: pplxgluster09.**:/glusteratlas/brick009/gv0 Options Reconfigured: nfs.disable: on performance.readdir-ahead: on transport.address-family: inet auth.allow: *** features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.stat-prefetch: on performance.md-cache-timeout: 600 performance.parallel-readdir: off performance.cache-size: 1GB performance.client-io-threads: on cluster.lookup-optimize: on client.event-threads: 4 server.event-threads: 4 performance.cache-invalidation: on diagnostics.brick-log-level: WARNING diagnostics.client-log-level: WARNING Thanks On Mon, Feb 4, 2019 at 11:37 AM Nithya Balachandran wrote: > Hi, > > > On Mon, 4 Feb 2019 at 16:39, mohammad kashif > wrote: > >> Hi Nithya >> >> Thanks for replying so quickly. It is very much appreciated. >> >> There are lots if " [No space left on device] " errors which I can not >> understand as there are much space on all of the nodes. >> > > This means that Gluster could not find sufficient space for the file. > Would you be willing to share your rebalance log file? > Please provide the following information: > > - The gluster version > - The gluster volume info for the volume > - How full are the individual bricks for the volume? > > > >> A little bit of background will be useful in this case. I had cluster of >> seven nodes of varying capacity(73, 73, 73, 46, 46, 46,46 TB) . The >> cluster was almost 90% full so every node has almost 8 to 15 TB free >> space. I added two new nodes with 100TB each and ran fix-layout which >> completed successfully. >> >> After that I started remove-brick operation. I don't think that any >> point , any of the nodes were 100% full. Looking at my ganglia graph, there >> is minimum 5TB always available at every node. >> >> I was keeping an eye on remove-brick status and for very long time there >> was no failures and then at some point these 17000 failures appeared and it >> stayed like that. >> >> Thanks >> >> Kashif >> >> >> >> >> >> Let me explain a little bit of background. >> >> >> On Mon, Feb 4, 2019 at 5:09 AM Nithya Balachandran >> wrote: >> >>> Hi, >>> >>> The status shows quite a few failures. Please check the rebalance logs >>> to see why that happened. We can decide what to do based on the errors. >>> Once you run a commit, the brick will no longer be part of the volume >>> and you will not be able to access those files via the client. >>> Do you have sufficient space on the remaining bricks for the files on >>> the removed brick? >>> >>> Regards, >>> Nithya >>> >>> On Mon, 4 Feb 2019 at 03:50, mohammad kashif >>> wrote: >>> >>>> Hi >>>> >>>> I have a pure distributed gluster volume with nine nodes and trying to >>>> remove one node, I ran >>>> gluster volume remove-brick atlasglust >>>> nodename:/glusteratlas/brick007/gv0 start >>>> >>>> It completed but with around 17000 failures >>>> >>>> Node Rebalanced-files size scanned failures >>>> skipped status run time in h:m:s >>>> --------- ----------- >>>> ----------- ----------- ----------- ----------- >>>> ------------ -------------- >>>> nodename 4185858 27.5TB 6746030 >>>> 17488 0 completed 405:15:34 >>>> >>>> I can see that there is still 1.5 TB of data on the node which I was >>>> trying to remove. >>>> >>>> I am not sure what to do now? Should I run remove-brick command again >>>> so the files which has been failed can be tried again? >>>> >>>> or should I run commit first and then try to remove node again? >>>> >>>> Please advise as I don't want to remove files. >>>> >>>> Thanks >>>> >>>> Kashif >>>> >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dieter.molketin at deutsche-telefon.de Mon Feb 4 14:48:29 2019 From: dieter.molketin at deutsche-telefon.de (Dieter Molketin) Date: Mon, 4 Feb 2019 15:48:29 +0100 Subject: [Gluster-users] 0-epoll: Failed to dispatch handler Message-ID: After upgrade from glusterfs 3.12 to version 5.3 I see following error message in all logfiles multiple times: [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler Also a fresh installation with glusterfs 5.3 produces this error message over and over again. What does the message mean, is something wrong with my installation or can I ignore it Can anyone help me with this issue? Thanks Dieter -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Mon Feb 4 17:20:38 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Mon, 4 Feb 2019 22:50:38 +0530 Subject: [Gluster-users] 0-epoll: Failed to dispatch handler In-Reply-To: References: Message-ID: On Mon, Feb 4, 2019 at 8:18 PM Dieter Molketin < dieter.molketin at deutsche-telefon.de> wrote: > After upgrade from glusterfs 3.12 to version 5.3 I see following error > message in all logfiles multiple times: > > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to > dispatch handler > > Also a fresh installation with glusterfs 5.3 produces this error message > over and over again. > A patch has been merged to fix this issue at review.gluster.org/r/I9375be98cc52cb969085333f3c7229a91207d1bd > What does the message mean, is something wrong with my installation or can > I ignore it > This message is benign and can be ignored. > Can anyone help me with this issue? > > Thanks > > Dieter > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Mon Feb 4 23:59:16 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Mon, 4 Feb 2019 15:59:16 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: The fuse crash happened two more times, but this time monit helped recover within 1 minute, so it's a great workaround for now. What's odd is that the crashes are only happening on one of 4 servers, and I don't know why. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii wrote: > The fuse crash happened again yesterday, to another volume. Are there any > mount options that could help mitigate this? > > In the meantime, I set up a monit (https://mmonit.com/monit/) task to > watch and restart the mount, which works and recovers the mount point > within a minute. Not ideal, but a temporary workaround. > > By the way, the way to reproduce this "Transport endpoint is not > connected" condition for testing purposes is to kill -9 the right > "glusterfs --process-name fuse" process. > > > monit check: > check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 > start program = "/bin/mount /mnt/glusterfs_data1" > stop program = "/bin/umount /mnt/glusterfs_data1" > if space usage > 90% for 5 times within 15 cycles > then alert else if succeeded for 10 cycles then alert > > > stack trace: > [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fa0249e4329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7fa0249e4329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 26 times between [2019-02-01 23:21:20.857333] and > [2019-02-01 23:21:56.164427] > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 0-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-3" > repeated 27 times between [2019-02-01 23:21:11.142467] and [2019-02-01 > 23:22:03.474036] > pending frames: > frame : type(1) op(LOOKUP) > frame : type(0) op(0) > patchset: git://git.gluster.org/glusterfs.git > signal received: 6 > time of crash: > 2019-02-01 23:22:03 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 5.3 > /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] > /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] > /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] > /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] > /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] > /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] > /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] > > /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] > /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] > /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] > /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] > /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] > /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] > /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii > wrote: > >> Hi, >> >> The first (and so far only) crash happened at 2am the next day after we >> upgraded, on only one of four servers and only to one of two mounts. >> >> I have no idea what caused it, but yeah, we do have a pretty busy site ( >> apkmirror.com), and it caused a disruption for any uploads or downloads >> from that server until I woke up and fixed the mount. >> >> I wish I could be more helpful but all I have is that stack trace. >> >> I'm glad it's a blocker and will hopefully be resolved soon. >> >> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >> atumball at redhat.com> wrote: >> >>> Hi Artem, >>> >>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a >>> clone of other bugs where recent discussions happened), and marked it as a >>> blocker for glusterfs-5.4 release. >>> >>> We already have fixes for log flooding - >>> https://review.gluster.org/22128, and are the process of identifying >>> and fixing the issue seen with crash. >>> >>> Can you please tell if the crashes happened as soon as upgrade ? or was >>> there any particular pattern you observed before the crash. >>> >>> -Amar >>> >>> >>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii >>> wrote: >>> >>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I already >>>> got a crash which others have mentioned in >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>> unmount, kill gluster, and remount: >>>> >>>> >>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fcccafcd329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fcccafcd329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fcccafcd329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fcccafcd329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>> The message "I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>> The message "E [MSGID: 101191] >>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>> [2019-01-31 09:38:04.696993] >>>> pending frames: >>>> frame : type(1) op(READ) >>>> frame : type(1) op(OPEN) >>>> frame : type(0) op(0) >>>> patchset: git://git.gluster.org/glusterfs.git >>>> signal received: 6 >>>> time of crash: >>>> 2019-01-31 09:38:04 >>>> configuration details: >>>> argp 1 >>>> backtrace 1 >>>> dlfcn 1 >>>> libpthread 1 >>>> llistxattr 1 >>>> setfsid 1 >>>> spinlock 1 >>>> epoll.h 1 >>>> xattr.h 1 >>>> st_atim.tv_nsec 1 >>>> package-string: glusterfs 5.3 >>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>> >>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>> --------- >>>> >>>> Do the pending patches fix the crash or only the repeated warnings? I'm >>>> running glusterfs on OpenSUSE 15.0 installed via >>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>> not too sure how to make it core dump. >>>> >>>> If it's not fixed by the patches above, has anyone already opened a >>>> ticket for the crashes that I can join and monitor? This is going to create >>>> a massive problem for us since production systems are crashing. >>>> >>>> Thanks. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>> rgowdapp at redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> Also, not sure if related or not, but I got a ton of these "Failed to >>>>>> dispatch handler" in my logs as well. Many people have been commenting >>>>>> about this issue here >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>> >>>>> >>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>> >>>>> >>>>>> ==> mnt-SITE_data1.log <== >>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fd966fcd329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>> ==> mnt-SITE_data3.log <== >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>> [2019-01-30 20:38:20.015593] >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>> ==> mnt-SITE_data1.log <== >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>> [2019-01-30 20:38:20.546355] >>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>> ==> mnt-SITE_data3.log <== >>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>> ==> mnt-SITE_data1.log <== >>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>> handler >>>>>> >>>>>> >>>>>> I'm hoping raising the issue here on the mailing list may bring some >>>>>> additional eyeballs and get them both fixed. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> I found a similar issue here: >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>> spam. >>>>>>> >>>>>>> Here's the command that repeats over and over: >>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fd966fcd329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>> >>>>>> >>>>> +Milind Changire Can you check why this message >>>>> is logged and send a fix? >>>>> >>>>> >>>>>>> Is there any fix for this issue? >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> >>> -- >>> Amar Tumballi (amarts) >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mauro.tridici at cmcc.it Mon Feb 4 10:25:46 2019 From: mauro.tridici at cmcc.it (Mauro Tridici) Date: Mon, 4 Feb 2019 11:25:46 +0100 Subject: [Gluster-users] RSYNC files renaming issue and timeout errors In-Reply-To: References: <96B07283-D8AB-4F06-909D-E00424625528@cmcc.it> <42758A0E-8BE9-497D-BDE3-55D7DC633BC7@cmcc.it> <6A8CF4A4-98EA-48C3-A059-D60D1B2721C7@cmcc.it> <2CF49168-9C1B-4931-8C34-8157262A137A@cmcc.it> <7A151CC9-A0AE-4A45-B450-A4063D216D9E@cmcc.it> <32D53ECE-3F49-4415-A6EE-241B351AC2BA@cmcc.it> <4685A75B-5978-4338-9C9F-4A02FB40B9BC@cmcc.it> Message-ID: <1E019B0D-5943-4949-8E2B-63EA5DF9DB8A@cmcc.it> Hi All, our users are experiencing the following two problems using our gluster storage based on a 12x(4+2) distributed dispersed volume: 1) sometimes, during RSYNC copies executions, they noticed this error message: rsync: rename "/tier2/CSP/sp1/ftp/lweprsn/.cmcc_CMCC-CM2-v20160423_hindcast_S2005040100_atmos_day_surface_lweprsn_n1-n40.sha256.rD38kX" -> "cmcc_CMCC-CM2-v20160423_hindcast_S2005040100_atmos_day_surface_lweprsn_n1-n40.sha256": Permission denied (13) rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6] The files seem to be correctly copied but rsync returns this kind of error and stop the sync process. 2) if the number of the total rsync sessions against the gluster volume grows, it seems that gluster volume starts suffering some particular workload, directories tree navigation becomes very slow and a lot of timeout errors appear on 192.168.0.54 gluster server (named ?s04") I just checked switch, cables, ports and so on, but everything seems to be ok. I also executed a lot of checks using iperf tool and the network seems to be ok. I dont? understand why I see only timeout errors related to s04 host (192.168.0.54). It is up and running? It seems that rsync copy processes started by the gluster volume users cause some noise (on gluster client log file) during temporary files creation. [2019-01-27 02:36:53.488942] E [socket.c:2376:socket_connect_finish] 0-tier2-client-57: connection to 192.168.0.54:49159 failed (Timeout della connessione); disconnecting socke t [2019-01-27 02:53:48.607832] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved _frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d 410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-57: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-01-2 7 02:53:10.568717 (xid=0x4aeefc8) [2019-01-27 02:55:01.631912] E [socket.c:2376:socket_connect_finish] 0-tier2-client-54: connection to 192.168.0.54:49158 failed (Timeout della connessione); disconnecting socke t [2019-01-27 02:56:05.643880] E [socket.c:2376:socket_connect_finish] 0-tier2-client-54: connection to 192.168.0.54:49158 failed (Timeout della connessione); disconnecting socke t [2019-01-27 02:57:09.653961] E [socket.c:2376:socket_connect_finish] 0-tier2-client-54: connection to 192.168.0.54:49158 failed (Timeout della connessione); disconnecting socke t [2019-01-27 03:50:56.951088] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved _frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d 410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-54: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) called at 2019-01 -27 03:39:48.938568 (xid=0x49e54c1) [2019-01-27 03:50:56.951111] E [MSGID: 114031] [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-54: remote operation failed [Transport endpoint is not connected] [2019-01-27 03:50:56.951471] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved _frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d 410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-54: forced unwinding frame type(GlusterFS 3.3) op(WRITE(13)) called at 2019-01-2 7 03:35:31.328387 (xid=0x49e54bf) [2019-01-27 03:50:56.958357] E [rpc-clnt.c:350:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] (--> /usr/lib64/libgfrpc.so.0(saved _frames_unwind+0x1e5)[0x3d0d410935] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d 410b45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] ))))) 0-tier2-client-54: forced unwinding frame type(GlusterFS 3.3) op(IPC(47)) called at 2019-01-27 03:37:49.652707 (xid=0x49e54c0) [2019-01-27 03:50:56.966845] E [MSGID: 114031] [client-rpc-fops.c:435:client3_3_open_cbk] 0-tier2-client-57: remote operation failed. Path: /CSP/sp1/SPS3/sps_201205/.rh2m_6hour ly_sps_201205_017.2012-06.nc.gz.DHNucn (5d0575fb-21c3-45b0-97bf-46d13f3c9262) [Il file esiste] [2019-01-27 03:50:56.967086] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4) [2019-01-27 03:50:56.967224] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4) [2019-01-27 03:50:56.992129] E [MSGID: 114031] [client-rpc-fops.c:435:client3_3_open_cbk] 0-tier2-client-57: remote operation failed. Path: /CSP/sp1/SPS3/sps_201205/.rh2m_6hour ly_sps_201205_017.2012-06.nc.gz.DHNucn (5d0575fb-21c3-45b0-97bf-46d13f3c9262) [Il file esiste] [2019-01-27 03:50:56.995954] E [MSGID: 114031] [client-rpc-fops.c:435:client3_3_open_cbk] 0-tier2-client-57: remote operation failed. Path: /CSP/sp1/SPS3/sps_201205/.rh2m_6hour ly_sps_201205_017.2012-06.nc.gz.DHNucn (5d0575fb-21c3-45b0-97bf-46d13f3c9262) [Il file esiste] [2019-01-27 03:50:56.996126] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4) [2019-01-27 03:50:56.996751] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4) [2019-01-27 03:50:57.001301] E [MSGID: 114031] [client-rpc-fops.c:435:client3_3_open_cbk] 0-tier2-client-57: remote operation failed. Path: /CSP/sp1/SPS3/sps_201205/.rh2m_6hour ly_sps_201205_017.2012-06.nc.gz.DHNucn (5d0575fb-21c3-45b0-97bf-46d13f3c9262) [Il file esiste] [2019-01-27 03:50:57.006375] E [MSGID: 122034] [ec-common.c:613:ec_child_select] 0-tier2-disperse-9: Insufficient available children for this request (have 0, need 4) [2019-01-27 04:02:09.269929] E [socket.c:2376:socket_connect_finish] 0-tier2-client-57: connection to 192.168.0.54:49159 failed (Timeout della connessione); disconnecting socke t I focused my attention on the errors occurred on gluster client during 25th of January 2019. So, I collected the following log files: - tier2.old.log.gz, (the client log file covering the period) - tier2.today.log.gz (the updated client log file reporting similar error related to ?renaming? and ?timeout? issues) - memory_usage.pdf, (weekly statistics for memory usage on 192.168.0.54 host from NAGIOS) - network_usage.pdf, (weekly statistics for network usage on 192.168.0.54 host from NAGIOS) - cpu_usage.pdf, (weekly statistics for network usage on 192.168.0.54 host from NAGIOS) - gluster-mnt6-brick.log-20190127, brick 6 (on host 192.168.0.54) log file - gluster-mnt7-brick.log-20190127, brick 7 (on host 192.168.0.54) log file I detected the involved bricks greping on ?gluster volume status? command output: [root at s04 bricks]# gluster volume status|grep 49158 Brick s02-stg:/gluster/mnt6/brick 49158 0 Y 4147 Brick s03-stg:/gluster/mnt6/brick 49158 0 Y 4272 Brick s01-stg:/gluster/mnt7/brick 49158 0 Y 3324 Brick s04-stg:/gluster/mnt7/brick 49158 0 Y 3787 Brick s05-stg:/gluster/mnt7/brick 49158 0 Y 3131 Brick s06-stg:/gluster/mnt7/brick 49158 0 Y 3254 On gluster server s04 (192.168.0.54), in messages log file, I can see a lot of errors like the following one and nothig more Jan 25 14:16:22 s04 kernel: traps: check_vol_utili[142140] general protection ip:7f8fcec7866d sp:7ffee3a86a30 error:0 in libglusterfs.so.0.0.1[7f8fcec25000+f7000] Jan 25 14:16:22 s04 abrt-hook-ccpp: Process 142140 (python2.7) of user 0 killed by SIGSEGV - dumping core Jan 25 14:16:22 s04 abrt-server: Generating core_backtrace Jan 25 14:16:22 s04 abrt-server: Error: Unable to open './coredump': No such file or directory Jan 25 14:17:00 s04 abrt-server: Duplicate: UUID Jan 25 14:17:00 s04 abrt-server: DUP_OF_DIR: /var/tmp/abrt/ccpp-2018-09-25-11:18:20-4471 Jan 25 14:17:00 s04 abrt-server: Deleting problem directory ccpp-2019-01-25-14:16:22-142140 (dup of ccpp-2018-09-25-11:18:20-4471) Jan 25 14:17:00 s04 dbus[1877]: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Jan 25 14:17:00 s04 dbus[1877]: [system] Successfully activated service 'org.freedesktop.problems' Jan 25 14:17:00 s04 abrt-server: Generating core_backtrace Jan 25 14:17:00 s04 abrt-server: Error: Unable to open './coredump': No such file or directory Jan 25 14:17:00 s04 abrt-server: Cannot notify '/var/tmp/abrt/ccpp-2018-09-25-11:18:20-4471' via uReport: Event 'report_uReport' exited with 1 Jan 25 14:17:01 s04 systemd: Created slice User Slice of root. Jan 25 14:17:01 s04 systemd: Starting User Slice of root. Jan 25 14:17:01 s04 systemd: Started Session 140059 of user root. Jan 25 14:17:01 s04 systemd: Starting Session 140059 of user root. Jan 25 14:17:01 s04 systemd: Removed slice User Slice of root. Jan 25 14:17:01 s04 systemd: Stopping User Slice of root. Jan 25 14:18:01 s04 systemd: Created slice User Slice of root. Jan 25 14:18:01 s04 systemd: Starting User Slice of root. Jan 25 14:18:01 s04 systemd: Started Session 140060 of user root. Jan 25 14:18:01 s04 systemd: Starting Session 140060 of user root. Jan 25 14:18:01 s04 systemd: Removed slice User Slice of root. Jan 25 14:18:01 s04 systemd: Stopping User Slice of root. Jan 25 14:18:35 s04 kernel: traps: check_vol_utili[142501] general protection ip:7f119fd7066d sp:7ffc399aa6f0 error:0 in libglusterfs.so.0.0.1[7f119fd1d000+f7000] Jan 25 14:18:35 s04 abrt-hook-ccpp: Process 142501 (python2.7) of user 0 killed by SIGSEGV - dumping core Jan 25 14:18:35 s04 abrt-server: Generating core_backtrace Jan 25 14:18:35 s04 abrt-server: Error: Unable to open './coredump': No such file or directory No errors in /var/log/messages log file related to the client machine. Anyone of us could help me to solve this issue? Thank you very much in advance, Mauro -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tier2.old.log.gz Type: application/x-gzip Size: 636711 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: tier2.today.log.gz Type: application/x-gzip Size: 325016 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: cpu_usage.pdf Type: application/pdf Size: 1011067 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: memory_usage.pdf Type: application/pdf Size: 65668 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: network_usage.pdf Type: application/pdf Size: 67425 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gluster-mnt6-brick.log-20190127.gz Type: application/x-gzip Size: 497687 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: gluster-mnt7-brick.log-20190127.gz Type: application/x-gzip Size: 526789 bytes Desc: not available URL: -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdeepugd at gmail.com Tue Feb 5 09:26:53 2019 From: sdeepugd at gmail.com (deepu srinivasan) Date: Tue, 5 Feb 2019 14:56:53 +0530 Subject: [Gluster-users] Getting timedout error while rebalancing Message-ID: HI everyone. I am getting "Error : Request timed out " while doing rebalance . I have aded new bricks to my replicated volume.i.e. First it was 1x3 volume and added three more bricks to make it distributed-replicated volume(2x3) . What should i do for the timeout error ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Tue Feb 5 10:53:30 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Tue, 5 Feb 2019 16:23:30 +0530 Subject: [Gluster-users] Getting timedout error while rebalancing In-Reply-To: References: Message-ID: Hi, Please provide the exact step at which you are seeing the error. It would be ideal if you could copy-paste the command and the error. Regards, Nithya On Tue, 5 Feb 2019 at 15:24, deepu srinivasan wrote: > HI everyone. I am getting "Error : Request timed out " while doing > rebalance . I have aded new bricks to my replicated volume.i.e. First it > was 1x3 volume and added three more bricks to make it > distributed-replicated volume(2x3) . What should i do for the timeout error > ? > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Tue Feb 5 15:12:46 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Tue, 5 Feb 2019 20:42:46 +0530 Subject: [Gluster-users] Getting timedout error while rebalancing In-Reply-To: References: Message-ID: On Tue, 5 Feb 2019 at 17:26, deepu srinivasan wrote: > HI Nithya > We have a test gluster setup.We are testing the rebalancing option of > gluster. So we started the volume which have 1x3 brick with some data on it > . > command : gluster volume create test-volume replica 3 > 192.168.xxx.xx1:/home/data/repl 192.168.xxx.xx2:/home/data/repl > 192.168.xxx.xx3:/home/data/repl. > > Now we tried to expand the cluster storage by adding three more bricks. > command : gluster volume add-brick test-volume 192.168.xxx.xx4:/home/data/repl > 192.168.xxx.xx5:/home/data/repl 192.168.xxx.xx6:/home/data/repl > > So after the brick addition we tried to rebalance the layout and the data. > command : gluster volume rebalance test-volume fix-layout start. > The command exited with status "Error : Request timed out". > This sounds like an error in the cli or glusterd. Can you send the glusterd.log from the node on which you ran the command? regards, Nithya > > After the failure of the command, we tried to view the status of the > command and it is something like this : > > Node Rebalanced-files size > scanned failures skipped status run time in > h:m:s > > --------- ----------- ----------- ----------- > ----------- ----------- ------------ -------------- > > localhost 41 41.0MB > 8200 0 0 completed > 0:00:09 > > 192.168.xxx.xx4 79 79.0MB > 8231 0 0 completed > 0:00:12 > > 192.168.xxx.xx6 58 58.0MB > 8281 0 0 completed > 0:00:10 > > 192.168.xxx.xx2 136 136.0MB > 8566 0 136 completed > 0:00:07 > > 192.168.xxx.xx4 129 129.0MB > 8566 0 129 completed > 0:00:07 > > 192.168.xxx.xx6 201 201.0MB > 8566 0 201 completed > 0:00:08 > > Is the rebalancing option working fine? Why did gluster throw the error > saying that "Error : Request timed out"? > .On Tue, Feb 5, 2019 at 4:23 PM Nithya Balachandran > wrote: > >> Hi, >> Please provide the exact step at which you are seeing the error. It would >> be ideal if you could copy-paste the command and the error. >> >> Regards, >> Nithya >> >> >> >> On Tue, 5 Feb 2019 at 15:24, deepu srinivasan wrote: >> >>> HI everyone. I am getting "Error : Request timed out " while doing >>> rebalance . I have aded new bricks to my replicated volume.i.e. First it >>> was 1x3 volume and added three more bricks to make it >>> distributed-replicated volume(2x3) . What should i do for the timeout error >>> ? >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rabhat at redhat.com Tue Feb 5 20:05:48 2019 From: rabhat at redhat.com (FNU Raghavendra Manjunath) Date: Tue, 5 Feb 2019 15:05:48 -0500 Subject: [Gluster-users] Corrupted File readable via FUSE? In-Reply-To: References: Message-ID: Hi David, Do you have any bricks down? Can you please share the output of the following commands and also the logs of the server and the client nodes? 1) gluster volume info 2) gluster volume status 3) gluster volume bitrot scrub status Few more questions 1) How many copies of the file were corrupted? (All? Or Just one?) 2 things I am trying to understand A) IIUC, if only one copy is corrupted, then the replication module from the gluster client should serve the data from the remaining good copy B) If all the copies were corrupted (or say more than quorum copies were corrupted which means 2 in case of 3 way replication) then there will be an error to the application. But the error to be reported should 'Input/Output Error'. Not 'Transport endpoint not connected' 'Transport endpoint not connected' error usually comes when a brick where the operation is being directed to is not connected to the client. Regards, Raghavendra On Mon, Feb 4, 2019 at 6:02 AM David Spisla wrote: > Hello Amar, > sounds good. Until now this patch is only merged into master. I think it > should be part of the next v5.x patch release! > > Regards > David > > Am Mo., 4. Feb. 2019 um 09:58 Uhr schrieb Amar Tumballi Suryanarayan < > atumball at redhat.com>: > >> Hi David, >> >> I guess https://review.gluster.org/#/c/glusterfs/+/21996/ helps to fix >> the issue. I will leave it to Raghavendra Bhat to reconfirm. >> >> Regards, >> Amar >> >> On Fri, Feb 1, 2019 at 8:45 PM David Spisla wrote: >> >>> Hello Gluster Community, >>> I have got a 4 Node Cluster with a Replica 4 Volume, so each node has a >>> brick with a copy of a file. Now I tried out the bitrot functionality and >>> corrupt the copy on the brick of node1. After this I scrub ondemand and the >>> file is marked correctly as corrupted. >>> >>> No I try to read that file from FUSE on node1 (with corrupt copy): >>> $ cat file1.txt >>> cat: file1.txt: Transport endpoint is not connected >>> FUSE log says: >>> >>> *[2019-02-01 15:02:19.191984] E [MSGID: 114031] >>> [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote >>> operation failed. Path: /data/file1.txt >>> (b432c1d6-ece2-42f2-8749-b11e058c4be3) [Input/output error]* >>> [2019-02-01 15:02:19.192269] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fc642471329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fc642682af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fc64a78d218] ) 0-dict: dict is NULL [Invalid argument] >>> [2019-02-01 15:02:19.192714] E [MSGID: 108009] >>> [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to >>> open /data/file1.txt on subvolume archive1-client-0 [Input/output error] >>> *[2019-02-01 15:02:19.193009] W [fuse-bridge.c:2371:fuse_readv_cbk] >>> 0-glusterfs-fuse: 147733: READ => -1 >>> gfid=b432c1d6-ece2-42f2-8749-b11e058c4be3 fd=0x7fc60408bbb8 (Transport >>> endpoint is not connected)* >>> [2019-02-01 15:02:19.193653] W [MSGID: 114028] >>> [client-lk.c:347:delete_granted_locks_owner] 0-archive1-client-0: fdctx not >>> valid [Invalid argument] >>> >>> And from FUSE on node2 (with heal copy): >>> $ cat file1.txt >>> file1 >>> >>> It seems to be that node1 wants to get the file from its own brick, but >>> the copy there is broken. Node2 gets the file from its own brick with a >>> heal copy, so reading the file succeed. >>> But I am wondering myself because sometimes reading the file from node1 >>> with the broken copy succeed >>> >>> What is the expected behaviour here? Is it possibly to read files with a >>> corrupted copy from any client access? >>> >>> Regards >>> David Spisla >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> -- >> Amar Tumballi (amarts) >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedro at pmc.digital Mon Feb 4 10:11:53 2019 From: pedro at pmc.digital (Pedro Costa) Date: Mon, 4 Feb 2019 10:11:53 +0000 Subject: [Gluster-users] Help analise statedumps In-Reply-To: References: Message-ID: Hi Sanju, The process was `glusterfs`, yes I took the statedump for the same process (different PID since it was rebooted). Cheers, P. From: Sanju Rakonde Sent: 04 February 2019 06:10 To: Pedro Costa Cc: gluster-users Subject: Re: [Gluster-users] Help analise statedumps Hi, Can you please specify which process has leak? Have you took the statedump of the same process which has leak? Thanks, Sanju On Sat, Feb 2, 2019 at 3:15 PM Pedro Costa > wrote: Hi, I have a 3x replicated cluster running 4.1.7 on ubuntu 16.04.5, all 3 replicas are also clients hosting a Node.js/Nginx web server. The current configuration is as such: Volume Name: gvol1 Type: Replicate Volume ID: XXXXXX Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: vm000000:/srv/brick1/gvol1 Brick2: vm000001:/srv/brick1/gvol1 Brick3: vm000002:/srv/brick1/gvol1 Options Reconfigured: cluster.self-heal-readdir-size: 2KB cluster.self-heal-window-size: 2 cluster.background-self-heal-count: 20 network.ping-timeout: 5 disperse.eager-lock: off performance.parallel-readdir: on performance.readdir-ahead: on performance.rda-cache-limit: 128MB performance.cache-refresh-timeout: 10 performance.nl-cache-timeout: 600 performance.nl-cache: on cluster.nufa: on performance.enable-least-priority: off server.outstanding-rpc-limit: 128 performance.strict-o-direct: on cluster.shd-max-threads: 12 client.event-threads: 4 cluster.lookup-optimize: on network.inode-lru-limit: 90000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.cache-samba-metadata: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on storage.fips-mode-rchecksum: on transport.address-family: inet nfs.disable: on performance.client-io-threads: on features.utime: on storage.ctime: on server.event-threads: 4 performance.cache-size: 256MB performance.read-ahead: on cluster.readdir-optimize: on cluster.strict-readdir: on performance.io-thread-count: 8 server.allow-insecure: on cluster.read-hash-mode: 0 cluster.lookup-unhashed: auto cluster.choose-local: on I believe there?s a memory leak somewhere, it just keeps going up until it hangs one or more nodes taking the whole cluster down sometimes. I have taken 2 statedumps on one of the nodes, one where the memory is too high and another just after a reboot with the app running and the volume fully healed. https://pmcdigital.sharepoint.com/:u:/g/EYDsNqTf1UdEuE6B0ZNVPfIBf_I-AbaqHotB1lJOnxLlTg?e=boYP09 (high memory) https://pmcdigital.sharepoint.com/:u:/g/EWZBsnET2xBHl6OxO52RCfIBvQ0uIDQ1GKJZ1GrnviyMhg?e=wI3yaY (after reboot) Any help would be greatly appreciated, Kindest Regards, Pedro Maia Costa Senior Developer, pmc.digital _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Wed Feb 6 08:19:55 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Wed, 6 Feb 2019 13:49:55 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Hi Artem, Do you still see the crashes with 5.3? If yes, please try mount the volume using the mount option lru-limit=0 and see if that helps. We are looking into the crashes and will update when have a fix. Also, please provide the gluster volume info for the volume in question. regards, Nithya On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii wrote: > The fuse crash happened two more times, but this time monit helped recover > within 1 minute, so it's a great workaround for now. > > What's odd is that the crashes are only happening on one of 4 servers, and > I don't know why. > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii > wrote: > >> The fuse crash happened again yesterday, to another volume. Are there any >> mount options that could help mitigate this? >> >> In the meantime, I set up a monit (https://mmonit.com/monit/) task to >> watch and restart the mount, which works and recovers the mount point >> within a minute. Not ideal, but a temporary workaround. >> >> By the way, the way to reproduce this "Transport endpoint is not >> connected" condition for testing purposes is to kill -9 the right >> "glusterfs --process-name fuse" process. >> >> >> monit check: >> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >> start program = "/bin/mount /mnt/glusterfs_data1" >> stop program = "/bin/umount /mnt/glusterfs_data1" >> if space usage > 90% for 5 times within 15 cycles >> then alert else if succeeded for 10 cycles then alert >> >> >> stack trace: >> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fa0249e4329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7fa0249e4329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >> [2019-02-01 23:21:56.164427] >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >> selecting local read_child SITE_data3-client-3" repeated 27 times between >> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >> pending frames: >> frame : type(1) op(LOOKUP) >> frame : type(0) op(0) >> patchset: git://git.gluster.org/glusterfs.git >> signal received: 6 >> time of crash: >> 2019-02-01 23:22:03 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 5.3 >> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >> >> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii >> wrote: >> >>> Hi, >>> >>> The first (and so far only) crash happened at 2am the next day after we >>> upgraded, on only one of four servers and only to one of two mounts. >>> >>> I have no idea what caused it, but yeah, we do have a pretty busy site ( >>> apkmirror.com), and it caused a disruption for any uploads or downloads >>> from that server until I woke up and fixed the mount. >>> >>> I wish I could be more helpful but all I have is that stack trace. >>> >>> I'm glad it's a blocker and will hopefully be resolved soon. >>> >>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>> atumball at redhat.com> wrote: >>> >>>> Hi Artem, >>>> >>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a >>>> clone of other bugs where recent discussions happened), and marked it as a >>>> blocker for glusterfs-5.4 release. >>>> >>>> We already have fixes for log flooding - >>>> https://review.gluster.org/22128, and are the process of identifying >>>> and fixing the issue seen with crash. >>>> >>>> Can you please tell if the crashes happened as soon as upgrade ? or was >>>> there any particular pattern you observed before the crash. >>>> >>>> -Amar >>>> >>>> >>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>> archon810 at gmail.com> wrote: >>>> >>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I already >>>>> got a crash which others have mentioned in >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>> unmount, kill gluster, and remount: >>>>> >>>>> >>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fcccafcd329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>> The message "I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>> The message "E [MSGID: 101191] >>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>> [2019-01-31 09:38:04.696993] >>>>> pending frames: >>>>> frame : type(1) op(READ) >>>>> frame : type(1) op(OPEN) >>>>> frame : type(0) op(0) >>>>> patchset: git://git.gluster.org/glusterfs.git >>>>> signal received: 6 >>>>> time of crash: >>>>> 2019-01-31 09:38:04 >>>>> configuration details: >>>>> argp 1 >>>>> backtrace 1 >>>>> dlfcn 1 >>>>> libpthread 1 >>>>> llistxattr 1 >>>>> setfsid 1 >>>>> spinlock 1 >>>>> epoll.h 1 >>>>> xattr.h 1 >>>>> st_atim.tv_nsec 1 >>>>> package-string: glusterfs 5.3 >>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>> >>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>> --------- >>>>> >>>>> Do the pending patches fix the crash or only the repeated warnings? >>>>> I'm running glusterfs on OpenSUSE 15.0 installed via >>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>> not too sure how to make it core dump. >>>>> >>>>> If it's not fixed by the patches above, has anyone already opened a >>>>> ticket for the crashes that I can join and monitor? This is going to create >>>>> a massive problem for us since production systems are crashing. >>>>> >>>>> Thanks. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>> rgowdapp at redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Also, not sure if related or not, but I got a ton of these "Failed >>>>>>> to dispatch handler" in my logs as well. Many people have been commenting >>>>>>> about this issue here >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>> >>>>>> >>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>> >>>>>> >>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fd966fcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler >>>>>>> >>>>>>> >>>>>>> I'm hoping raising the issue here on the mailing list may bring some >>>>>>> additional eyeballs and get them both fixed. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> I found a similar issue here: >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>> spam. >>>>>>>> >>>>>>>> Here's the command that repeats over and over: >>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fd966fcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> >>>>>>> >>>>>> +Milind Changire Can you check why this >>>>>> message is logged and send a fix? >>>>>> >>>>>> >>>>>>>> Is there any fix for this issue? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>>> >>>> -- >>>> Amar Tumballi (amarts) >>>> >>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Wed Feb 6 08:25:44 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Wed, 6 Feb 2019 13:55:44 +0530 Subject: [Gluster-users] gluster 5.3: transport endpoint gets disconnected - Assertion failed: GF_MEM_TRAILER_MAGIC In-Reply-To: References: Message-ID: Hi, The client logs indicates that the mount process has crashed. Please try mounting the volume with the volume option lru-limit=0 and see if it still crashes. Thanks, Nithya On Thu, 24 Jan 2019 at 12:47, Hu Bert wrote: > Good morning, > > we currently transfer some data to a new glusterfs volume; to check > the throughput of the new volume/setup while the transfer is running i > decided to create some files on one of the gluster servers with dd in > loop: > > while true; do dd if=/dev/urandom of=/shared/private/1G.file bs=1M > count=1024; rm /shared/private/1G.file; done > > /shared/private is the mount point of the glusterfs volume. The dd > should run for about an hour. But now it happened twice that during > this loop the transport endpoint gets disconnected: > > dd: failed to open '/shared/private/1G.file': Transport endpoint is > not connected > rm: cannot remove '/shared/private/1G.file': Transport endpoint is not > connected > > In the /var/log/glusterfs/shared-private.log i see: > > [2019-01-24 07:03:28.938745] W [MSGID: 108001] > [afr-transaction.c:1062:afr_handle_quorum] 0-persistent-replicate-0: > 7212652e-c437-426c-a0a9-a47f5972fffe: Failing WRITE as quorum i > s not met [Transport endpoint is not connected] > [2019-01-24 07:03:28.939280] E [mem-pool.c:331:__gf_free] > > (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be8c) > [0x7eff84248e8c] -->/usr/lib/x86_64-lin > ux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be18) > [0x7eff84248e18] > -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0xf6) > [0x7eff8a9485a6] ) 0-: Assertion failed: > GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) > [----snip----] > > The whole output can be found here: https://pastebin.com/qTMmFxx0 > gluster volume info here: https://pastebin.com/ENTWZ7j3 > > After umount + mount the transport endpoint is connected again - until > the next disconnect. A /core file gets generated. Maybe someone wants > to have a look at this file? > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From amukherj at redhat.com Wed Feb 6 08:31:36 2019 From: amukherj at redhat.com (Atin Mukherjee) Date: Wed, 6 Feb 2019 14:01:36 +0530 Subject: [Gluster-users] Getting timedout error while rebalancing In-Reply-To: References: Message-ID: On Tue, Feb 5, 2019 at 8:43 PM Nithya Balachandran wrote: > > > On Tue, 5 Feb 2019 at 17:26, deepu srinivasan wrote: > >> HI Nithya >> We have a test gluster setup.We are testing the rebalancing option of >> gluster. So we started the volume which have 1x3 brick with some data on it >> . >> command : gluster volume create test-volume replica 3 >> 192.168.xxx.xx1:/home/data/repl 192.168.xxx.xx2:/home/data/repl >> 192.168.xxx.xx3:/home/data/repl. >> >> Now we tried to expand the cluster storage by adding three more bricks. >> command : gluster volume add-brick test-volume 192.168.xxx.xx4:/home/data/repl >> 192.168.xxx.xx5:/home/data/repl 192.168.xxx.xx6:/home/data/repl >> >> So after the brick addition we tried to rebalance the layout and the data. >> command : gluster volume rebalance test-volume fix-layout start. >> The command exited with status "Error : Request timed out". >> > > This sounds like an error in the cli or glusterd. Can you send the > glusterd.log from the node on which you ran the command? > It seems to me that glusterd took more than 120 seconds to process the command and hence cli timed out. We can confirm the same by checking the status of the rebalance below which indicates rebalance did kick in and eventually completed. We need to understand why did it take such longer, so please pass on the cli and glusterd log from all the nodes as Nithya requested for. > regards, > Nithya > >> >> After the failure of the command, we tried to view the status of the >> command and it is something like this : >> >> Node Rebalanced-files size >> scanned failures skipped status run time >> in h:m:s >> >> --------- ----------- ----------- >> ----------- ----------- ----------- ------------ >> -------------- >> >> localhost 41 41.0MB >> 8200 0 0 completed >> 0:00:09 >> >> 192.168.xxx.xx4 79 79.0MB >> 8231 0 0 completed >> 0:00:12 >> >> 192.168.xxx.xx6 58 58.0MB >> 8281 0 0 completed >> 0:00:10 >> >> 192.168.xxx.xx2 136 136.0MB >> 8566 0 136 completed >> 0:00:07 >> >> 192.168.xxx.xx4 129 129.0MB >> 8566 0 129 completed >> 0:00:07 >> >> 192.168.xxx.xx6 201 201.0MB >> 8566 0 201 completed >> 0:00:08 >> >> Is the rebalancing option working fine? Why did gluster throw the error >> saying that "Error : Request timed out"? >> .On Tue, Feb 5, 2019 at 4:23 PM Nithya Balachandran >> wrote: >> >>> Hi, >>> Please provide the exact step at which you are seeing the error. It >>> would be ideal if you could copy-paste the command and the error. >>> >>> Regards, >>> Nithya >>> >>> >>> >>> On Tue, 5 Feb 2019 at 15:24, deepu srinivasan >>> wrote: >>> >>>> HI everyone. I am getting "Error : Request timed out " while doing >>>> rebalance . I have aded new bricks to my replicated volume.i.e. First it >>>> was 1x3 volume and added three more bricks to make it >>>> distributed-replicated volume(2x3) . What should i do for the timeout error >>>> ? >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Wed Feb 6 08:47:41 2019 From: revirii at googlemail.com (Hu Bert) Date: Wed, 6 Feb 2019 09:47:41 +0100 Subject: [Gluster-users] usage of harddisks: each hdd a brick? raid? In-Reply-To: References: <22faba73-4e60-9a55-75bf-e52ce59858b3@ya.ru> <258abbde-5a3a-0df2-988a-cb4d1b8b5347@ya.ru> Message-ID: Hey there, just a little update... This week we switched from our 3 "old" gluster servers to 3 new ones, and with that we threw some hardware at the problem... old: 3 servers, each has 4 * 10 TB disks; each disk is used as a brick -> 4 x 3 = 12 distribute-replicate new: 3 servers, each has 10 * 10 TB disks; we built 2 RAID10 (6 disks and 4 disks), each RAID10 is a brick -> we split our data into 2 volumes, 1 x 3 = 3 replicate; as filesystem we use XFS (instead of ext4) with mount options inode64,noatime,nodiratime now. What we've seen so far: the volumes are independent - if one volume is under load, the other one isn't affected by that. Throughput, latency etc. seems to be better now. Of course you waste a lot of disk space when using RAID10 and replicate setup: 100TB per server (so 300TB in total) result in ~50TB volume size, but during the last year we had problems due to hard disk errors and the resulting brick restore (reset-brick) which took very long. Was a hard time... :-/ So our conclusion was: as the heal can be really painful, take very long and influence performance very badly -> try to avoid heals by not having to do "big" heals at all. That's why we chose a RAID10: under normal circumstances (a disk failing from time to time) there may be a RAID resync, but that may be faster and cause fewer performance issues than having to restore a complete brick. Or, more general: if you have big, slow disk and quite high I/O -> think about not using single disks as bricks. If you have the hardware (and the money), think about using RAID1 or RAID10. The smaller and/or faster the disks are (e.g. you have a lot of 1TB SSD/NVMe), using them as bricks might work better as (in case of disk failure) the heal should be much faster. No information about RAID5/6 possible, wasn't taken into consideration... just my 2 ?cents from (still) a gluster amateur :-) Best regards, Hubert Am Di., 22. Jan. 2019 um 07:11 Uhr schrieb Amar Tumballi Suryanarayan : > > > > On Thu, Jan 10, 2019 at 1:56 PM Hu Bert wrote: >> >> Hi, >> >> > > We ara also using 10TB disks, heal takes 7-8 days. >> > > You can play with "cluster.shd-max-threads" setting. It is default 1 I >> > > think. I am using it with 4. >> > > Below you can find more info: >> > > https://access.redhat.com/solutions/882233 >> > cluster.shd-max-threads: 8 >> > cluster.shd-wait-qlength: 10000 >> >> Our setup: >> cluster.shd-max-threads: 2 >> cluster.shd-wait-qlength: 10000 >> >> > >> Volume Name: shared >> > >> Type: Distributed-Replicate >> > A, you have distributed-replicated volume, but I choose only replicated >> > (for beginning simplicity :) >> > May be replicated volume are healing faster? >> >> Well, maybe our setup with 3 servers and 4 disks=bricks == 12 bricks, >> resulting in a distributed-replicate volume (all /dev/sd{a,b,c,d} >> identical) , isn't optimal? And it would be better to create a >> replicate 3 volume with only 1 (big) brick per server (with 4 disks: >> either a logical volume or sw/hw raid)? >> >> But it would be interesting to know if a replicate volume is healing >> faster than a distributed-replicate volume - even if there was only 1 >> faulty brick. >> > > We don't have any data point to agree to this, but it may be true. Specially, as the crawling when DHT (ie, distribute) is involved can get little slower, which means, the healing would get slower too. > > We are trying to experiment few performance enhancement patches (like https://review.gluster.org/20636), would be great to see how things work with newer base. Will keep the list updated about performance numbers once we have some more data on them. > > -Amar > >> >> >> Thx >> Hubert >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> > > > -- > Amar Tumballi (amarts) From revirii at googlemail.com Wed Feb 6 09:04:22 2019 From: revirii at googlemail.com (Hu Bert) Date: Wed, 6 Feb 2019 10:04:22 +0100 Subject: [Gluster-users] gluster 5.3: transport endpoint gets disconnected - Assertion failed: GF_MEM_TRAILER_MAGIC In-Reply-To: References: Message-ID: Hi there, just curious - from man mount.glusterfs: lru-limit=N Set fuse module's limit for number of inodes kept in LRU list to N [default: 0] This seems to be the default already? Set it explicitly? Regards, Hubert Am Mi., 6. Feb. 2019 um 09:26 Uhr schrieb Nithya Balachandran : > > Hi, > > The client logs indicates that the mount process has crashed. > Please try mounting the volume with the volume option lru-limit=0 and see if it still crashes. > > Thanks, > Nithya > > On Thu, 24 Jan 2019 at 12:47, Hu Bert wrote: >> >> Good morning, >> >> we currently transfer some data to a new glusterfs volume; to check >> the throughput of the new volume/setup while the transfer is running i >> decided to create some files on one of the gluster servers with dd in >> loop: >> >> while true; do dd if=/dev/urandom of=/shared/private/1G.file bs=1M >> count=1024; rm /shared/private/1G.file; done >> >> /shared/private is the mount point of the glusterfs volume. The dd >> should run for about an hour. But now it happened twice that during >> this loop the transport endpoint gets disconnected: >> >> dd: failed to open '/shared/private/1G.file': Transport endpoint is >> not connected >> rm: cannot remove '/shared/private/1G.file': Transport endpoint is not connected >> >> In the /var/log/glusterfs/shared-private.log i see: >> >> [2019-01-24 07:03:28.938745] W [MSGID: 108001] >> [afr-transaction.c:1062:afr_handle_quorum] 0-persistent-replicate-0: >> 7212652e-c437-426c-a0a9-a47f5972fffe: Failing WRITE as quorum i >> s not met [Transport endpoint is not connected] >> [2019-01-24 07:03:28.939280] E [mem-pool.c:331:__gf_free] >> (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be8c) >> [0x7eff84248e8c] -->/usr/lib/x86_64-lin >> ux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be18) >> [0x7eff84248e18] >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0xf6) >> [0x7eff8a9485a6] ) 0-: Assertion failed: >> GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) >> [----snip----] >> >> The whole output can be found here: https://pastebin.com/qTMmFxx0 >> gluster volume info here: https://pastebin.com/ENTWZ7j3 >> >> After umount + mount the transport endpoint is connected again - until >> the next disconnect. A /core file gets generated. Maybe someone wants >> to have a look at this file? >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users From nbalacha at redhat.com Wed Feb 6 09:17:08 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Wed, 6 Feb 2019 14:47:08 +0530 Subject: [Gluster-users] gluster 5.3: transport endpoint gets disconnected - Assertion failed: GF_MEM_TRAILER_MAGIC In-Reply-To: References: Message-ID: On Wed, 6 Feb 2019 at 14:34, Hu Bert wrote: > Hi there, > > just curious - from man mount.glusterfs: > > lru-limit=N > Set fuse module's limit for number of inodes kept in LRU > list to N [default: 0] > Sorry, that is a bug in the man page and we will fix that. The current default is 131072: { .key = {"lru-limit"}, .type = GF_OPTION_TYPE_INT, .default_value = "131072", .min = 0, .description = "makes glusterfs invalidate kernel inodes after " "reaching this limit (0 means 'unlimited')", }, > > This seems to be the default already? Set it explicitly? > > Regards, > Hubert > > Am Mi., 6. Feb. 2019 um 09:26 Uhr schrieb Nithya Balachandran > : > > > > Hi, > > > > The client logs indicates that the mount process has crashed. > > Please try mounting the volume with the volume option lru-limit=0 and > see if it still crashes. > > > > Thanks, > > Nithya > > > > On Thu, 24 Jan 2019 at 12:47, Hu Bert wrote: > >> > >> Good morning, > >> > >> we currently transfer some data to a new glusterfs volume; to check > >> the throughput of the new volume/setup while the transfer is running i > >> decided to create some files on one of the gluster servers with dd in > >> loop: > >> > >> while true; do dd if=/dev/urandom of=/shared/private/1G.file bs=1M > >> count=1024; rm /shared/private/1G.file; done > >> > >> /shared/private is the mount point of the glusterfs volume. The dd > >> should run for about an hour. But now it happened twice that during > >> this loop the transport endpoint gets disconnected: > >> > >> dd: failed to open '/shared/private/1G.file': Transport endpoint is > >> not connected > >> rm: cannot remove '/shared/private/1G.file': Transport endpoint is not > connected > >> > >> In the /var/log/glusterfs/shared-private.log i see: > >> > >> [2019-01-24 07:03:28.938745] W [MSGID: 108001] > >> [afr-transaction.c:1062:afr_handle_quorum] 0-persistent-replicate-0: > >> 7212652e-c437-426c-a0a9-a47f5972fffe: Failing WRITE as quorum i > >> s not met [Transport endpoint is not connected] > >> [2019-01-24 07:03:28.939280] E [mem-pool.c:331:__gf_free] > >> > (-->/usr/lib/x86_64-linux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be8c) > >> [0x7eff84248e8c] -->/usr/lib/x86_64-lin > >> ux-gnu/glusterfs/5.3/xlator/cluster/replicate.so(+0x5be18) > >> [0x7eff84248e18] > >> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(__gf_free+0xf6) > >> [0x7eff8a9485a6] ) 0-: Assertion failed: > >> GF_MEM_TRAILER_MAGIC == *(uint32_t *)((char *)free_ptr + header->size) > >> [----snip----] > >> > >> The whole output can be found here: https://pastebin.com/qTMmFxx0 > >> gluster volume info here: https://pastebin.com/ENTWZ7j3 > >> > >> After umount + mount the transport endpoint is connected again - until > >> the next disconnect. A /core file gets generated. Maybe someone wants > >> to have a look at this file? > >> _______________________________________________ > >> Gluster-users mailing list > >> Gluster-users at gluster.org > >> https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hawk at tbi.univie.ac.at Wed Feb 6 13:38:57 2019 From: hawk at tbi.univie.ac.at (Richard Neuboeck) Date: Wed, 6 Feb 2019 14:38:57 +0100 Subject: [Gluster-users] gluster client 4.1 memory leak Message-ID: Hi Gluster-Group, I've stumbled upon a memory leak in the gluster client 4.1. It manifests itself the same way the last one [1] did in 3.12. Memory consumption of the glusterfs process climbs until the system is out of memory and the process gets killed. Excerpt from the system log: rnel: Out of memory: Kill process 77419 (glusterfs) score 505 or sacrifice child rnel: Killed process 77419 (glusterfs) total-vm:71549476kB, anon-rss:70730944kB, file-rss:196kB, shmem-rss:0kB I didn't find a bug report for this problem for version 4.1 on Bugzilla and I'm unsure if I should open a bug report there or somewhere else. I'm running gluster 4.1 from the CentOS repo on the client and the server. centos-release-gluster41-1.0-3.el7.centos.noarch glusterfs-client-xlators-4.1.7-1.el7.x86_64 glusterfs-4.1.7-1.el7.x86_64 glusterfs-libs-4.1.7-1.el7.x86_64 glusterfs-fuse-4.1.7-1.el7.x86_64 The files are transfered with rsync: Number of files: 15,143,321 (reg: 13,735,846, dir: 1,283,846, link: 123,471, special: 158) Total file size: 1,232,120,420,136 bytes I've created statedumps every three hours. I hope that helps to track down the problem. The dumps are here: www.tbi.univie.ac.at/~hawk/gluster/gluster_dump.tar.xz Is there any thing else I can help with to solve this problem? Cheers Richard [1]: https://lists.gluster.org/pipermail/gluster-users/2018-September/034831.html -- /dev/null -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: From spisla80 at gmail.com Wed Feb 6 15:36:24 2019 From: spisla80 at gmail.com (David Spisla) Date: Wed, 6 Feb 2019 16:36:24 +0100 Subject: [Gluster-users] Corrupted File readable via FUSE? In-Reply-To: References: Message-ID: Hello Raghavendra, I can not give you the output of the gluster commands because I repaired the system already. But beside of this this errors occurs randomly. I am sure that only one copy of the file was corrupted because it is part of a test and I corrupt one copy of the file manually on brick level and after this I check if it is still readable. During this conversation the error occurs again. Here is the Log from brick of node1: [2019-02-06 14:15:09.524638] E [MSGID: 115070] > [server-rpc-fops_v2.c:1503:server4_open_cbk] 0- > archive1-server: 32: OPEN /data/file1.txt > (23b623cb-7256-4fe6-85b0-1026b1531a86), client: CTX_ > > ID:e3871169-af62-44aa-a990-fa4248283c08-GRAPH_ID:0-PID:31830-HOST:fs-lrunning-c1-n1-PC_NAME:ar > chive1-client-0-RECON_NO:-0, error-xlator: archive1-bitrot-stub > [Input/output error] > [2019-02-06 14:15:09.535587] E [MSGID: 115070] > [server-rpc-fops_v2.c:1503:server4_open_cbk] 0- > archive1-server: 56: OPEN /data/file1.txt > (23b623cb-7256-4fe6-85b0-1026b1531a86), client: CTX_ > > ID:e3871169-af62-44aa-a990-fa4248283c08-GRAPH_ID:0-PID:31830-HOST:fs-lrunning-c1-n1-PC_NAME:ar > chive1-client-0-RECON_NO:-0, error-xlator: archive1-bitrot-stub > [Input/output error] > The message "E [MSGID: 116020] > [bit-rot-stub.c:647:br_stub_check_bad_object] 0-archive1-bitrot > -stub: 23b623cb-7256-4fe6-85b0-1026b1531a86 is a bad object. Returning" > repeated 2 times betwe > en [2019-02-06 14:15:09.524599] and [2019-02-06 14:15:09.549409] > [2019-02-06 14:15:09.549427] E [MSGID: 115070] > [server-rpc-fops_v2.c:1503:server4_open_cbk] 0- > archive1-server: 70: OPEN /data/file1.txt > (23b623cb-7256-4fe6-85b0-1026b1531a86), client: CTX_ > > ID:e3871169-af62-44aa-a990-fa4248283c08-GRAPH_ID:0-PID:31830-HOST:fs-lrunning-c1-n1-PC_NAME:ar > chive1-client-0-RECON_NO:-0, error-xlator: archive1-bitrot-stub > [Input/output error] > [2019-02-06 14:15:09.561450] I [MSGID: 115036] > [server.c:469:server_rpc_notify] 0-archive1-ser > ver: disconnecting connection from > CTX_ID:e3871169-af62-44aa-a990-fa4248283c08-GRAPH_ID:0-PID: > 31830-HOST:fs-lrunning-c1-n1-PC_NAME:archive1-client-0-RECON_NO:-0 > [2019-02-06 14:15:09.561568] I [MSGID: 101055] > [client_t.c:435:gf_client_unref] 0-archive1-ser > ver: Shutting down connection > CTX_ID:e3871169-af62-44aa-a990-fa4248283c08-GRAPH_ID:0-PID:31830-HOST:fs-lrunning-c1-n1-PC_NAME:archive1-client-0-RECON_NO:-0 > [2019-02-06 14:15:10.188406] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:10.201029] I [glusterfsd-mgmt.c:2005:mgmt_getspec_cbk] > 0-glusterfs: No change in volfile,continuing > [2019-02-06 14:15:10.514721] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:10.526216] I [glusterfsd-mgmt.c:2005:mgmt_getspec_cbk] > 0-glusterfs: No change in volfile,continuing > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 79 times between [2019-02-06 14:15:09.499105] and > [2019-02-06 14:15:10.682592] > [2019-02-06 14:15:10.684204] E [MSGID: 116020] > [bit-rot-stub.c:647:br_stub_check_bad_object] 0-archive1-bitrot-stub: > 23b623cb-7256-4fe6-85b0-1026b1531a86 is a bad object. Returning > [2019-02-06 14:15:10.684262] E [MSGID: 115070] > [server-rpc-fops_v2.c:1503:server4_open_cbk] 0-archive1-server: 2146148: > OPEN /data/file1.txt (23b623cb-7256-4fe6-85b0-1026b1531a86), client: > CTX_ID:0545b52c-2843-4833-a5fc-b11e062a72d3-GRAPH_ID:0-PID:2458-HOST:fs-lrunning-c1-n1-PC_NAME:archive1-client-0-RECON_NO:-3, > error-xlator: archive1-bitrot-stub [Input/output error] > [2019-02-06 14:15:10.684949] E [MSGID: 116020] > [bit-rot-stub.c:647:br_stub_check_bad_object] 0-archive1-bitrot-stub: > 23b623cb-7256-4fe6-85b0-1026b1531a86 is a bad object. Returning > [2019-02-06 14:15:10.684982] E [MSGID: 115070] > [server-rpc-fops_v2.c:1503:server4_open_cbk] 0-archive1-server: 2146149: > OPEN /data/file1.txt (23b623cb-7256-4fe6-85b0-1026b1531a86), client: > CTX_ID:0545b52c-2843-4833-a5fc-b11e062a72d3-GRAPH_ID:0-PID:2458-HOST:fs-lrunning-c1-n1-PC_NAME:archive1-client-0-RECON_NO:-3, > error-xlator: archive1-bitrot-stub [Input/output error] > [2019-02-06 14:15:10.686566] E [MSGID: 116020] > [bit-rot-stub.c:647:br_stub_check_bad_object] 0-archive1-bitrot-stub: > 23b623cb-7256-4fe6-85b0-1026b1531a86 is a bad object. Returning > [2019-02-06 14:15:10.686600] E [MSGID: 115070] > [server-rpc-fops_v2.c:1503:server4_open_cbk] 0-archive1-server: 2146150: > OPEN /data/file1.txt (23b623cb-7256-4fe6-85b0-1026b1531a86), client: > CTX_ID:0545b52c-2843-4833-a5fc-b11e062a72d3-GRAPH_ID:0-PID:2458-HOST:fs-lrunning-c1-n1-PC_NAME:archive1-client-0-RECON_NO:-3, > error-xlator: archive1-bitrot-stub [Input/output error] > [2019-02-06 14:15:11.189361] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:11.207835] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:11.220763] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > > One can see that there is bitrot file on brick of node1. This seems to be correct. Here the Log of the FUSE Mount Node1: [2019-02-06 14:15:10.684387] E [MSGID: 114031] > [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote > operation failed. Path: /data/file1.txt > (23b623cb-7256-4fe6-85b0-1026b1531a86) [Input/output error] > [2019-02-06 14:15:10.684556] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7feff4bde329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7feff4defaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7feffcecf218] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-06 14:15:10.685122] E [MSGID: 114031] > [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote > operation failed. Path: /data/file1.txt > (23b623cb-7256-4fe6-85b0-1026b1531a86) [Input/output error] > [2019-02-06 14:15:10.685127] E [MSGID: 108009] > [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to > open /data/file1.txt on subvolume archive1-client-0 [Input/output error] > [2019-02-06 14:15:10.686207] W [fuse-bridge.c:2371:fuse_readv_cbk] > 0-glusterfs-fuse: 4623583: READ => -1 > gfid=23b623cb-7256-4fe6-85b0-1026b1531a86 fd=0x7fefa8c5d618 (Transport > endpoint is not connected) > [2019-02-06 14:15:10.686306] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7feff4bde329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7feff4defaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7feffcecf218] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-06 14:15:10.686690] E [MSGID: 114031] > [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote > operation failed. Path: /data/file1.txt > (23b623cb-7256-4fe6-85b0-1026b1531a86) [Input/output error] > [2019-02-06 14:15:10.686714] E [MSGID: 108009] > [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to > open /data/file1.txt on subvolume archive1-client-0 [Input/output error] > [2019-02-06 14:15:10.686877] W [fuse-bridge.c:2371:fuse_readv_cbk] > 0-glusterfs-fuse: 4623584: READ => -1 > gfid=23b623cb-7256-4fe6-85b0-1026b1531a86 fd=0x7fefa8c5d618 (Transport > endpoint is not connected) > [2019-02-06 14:15:10.687500] W [MSGID: 114028] > [client-lk.c:347:delete_granted_locks_owner] 0-archive1-client-0: fdctx not > valid [Invalid argument] > > One can see an "Input/output error" because of the corrupted file from brick of node1. At this time the *brick on node 2 was really down* but on Node 3, 4 they were up. So still 2 good copies are reachable. Or not? The Log of the bricks from node 3,4 has no entry for this "file1.txt". It seems to be that the Client Stack did no requests to this bricks. Example Log of brick from node 3: [2019-02-06 14:15:09.561650] E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler > [2019-02-06 14:15:10.220218] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:10.236379] I [glusterfsd-mgmt.c:2005:mgmt_getspec_cbk] > 0-glusterfs: No change in volfile,continuing > [2019-02-06 14:15:10.541472] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:10.556125] I [glusterfsd-mgmt.c:2005:mgmt_getspec_cbk] > 0-glusterfs: No change in volfile,continuing > [2019-02-06 14:15:11.248253] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:11.264428] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > [2019-02-06 14:15:11.277016] I [glusterfsd-mgmt.c:58:mgmt_cbk_spec] > 0-mgmt: Volume file changed > > Is there a hidden quorum active? I have a 4-way Replica Volume, so 2 of 4 Copies are good and reachable Regards David Am Di., 5. Feb. 2019 um 21:06 Uhr schrieb FNU Raghavendra Manjunath < rabhat at redhat.com>: > > Hi David, > > Do you have any bricks down? Can you please share the output of the > following commands and also the logs of the server and the client nodes? > > 1) gluster volume info > 2) gluster volume status > 3) gluster volume bitrot scrub status > > Few more questions > > 1) How many copies of the file were corrupted? (All? Or Just one?) > > 2 things I am trying to understand > > A) IIUC, if only one copy is corrupted, then the replication module from > the gluster client should serve the data from the > remaining good copy > B) If all the copies were corrupted (or say more than quorum copies were > corrupted which means 2 in case of 3 way replication) > then there will be an error to the application. But the error to be > reported should 'Input/Output Error'. Not 'Transport endpoint not connected' > 'Transport endpoint not connected' error usually comes when a brick > where the operation is being directed to is not connected to the client. > > > > Regards, > Raghavendra > > On Mon, Feb 4, 2019 at 6:02 AM David Spisla wrote: > >> Hello Amar, >> sounds good. Until now this patch is only merged into master. I think it >> should be part of the next v5.x patch release! >> >> Regards >> David >> >> Am Mo., 4. Feb. 2019 um 09:58 Uhr schrieb Amar Tumballi Suryanarayan < >> atumball at redhat.com>: >> >>> Hi David, >>> >>> I guess https://review.gluster.org/#/c/glusterfs/+/21996/ helps to fix >>> the issue. I will leave it to Raghavendra Bhat to reconfirm. >>> >>> Regards, >>> Amar >>> >>> On Fri, Feb 1, 2019 at 8:45 PM David Spisla wrote: >>> >>>> Hello Gluster Community, >>>> I have got a 4 Node Cluster with a Replica 4 Volume, so each node has a >>>> brick with a copy of a file. Now I tried out the bitrot functionality and >>>> corrupt the copy on the brick of node1. After this I scrub ondemand and the >>>> file is marked correctly as corrupted. >>>> >>>> No I try to read that file from FUSE on node1 (with corrupt copy): >>>> $ cat file1.txt >>>> cat: file1.txt: Transport endpoint is not connected >>>> FUSE log says: >>>> >>>> *[2019-02-01 15:02:19.191984] E [MSGID: 114031] >>>> [client-rpc-fops_v2.c:281:client4_0_open_cbk] 0-archive1-client-0: remote >>>> operation failed. Path: /data/file1.txt >>>> (b432c1d6-ece2-42f2-8749-b11e058c4be3) [Input/output error]* >>>> [2019-02-01 15:02:19.192269] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fc642471329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fc642682af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fc64a78d218] ) 0-dict: dict is NULL [Invalid argument] >>>> [2019-02-01 15:02:19.192714] E [MSGID: 108009] >>>> [afr-open.c:220:afr_openfd_fix_open_cbk] 0-archive1-replicate-0: Failed to >>>> open /data/file1.txt on subvolume archive1-client-0 [Input/output error] >>>> *[2019-02-01 15:02:19.193009] W [fuse-bridge.c:2371:fuse_readv_cbk] >>>> 0-glusterfs-fuse: 147733: READ => -1 >>>> gfid=b432c1d6-ece2-42f2-8749-b11e058c4be3 fd=0x7fc60408bbb8 (Transport >>>> endpoint is not connected)* >>>> [2019-02-01 15:02:19.193653] W [MSGID: 114028] >>>> [client-lk.c:347:delete_granted_locks_owner] 0-archive1-client-0: fdctx not >>>> valid [Invalid argument] >>>> >>>> And from FUSE on node2 (with heal copy): >>>> $ cat file1.txt >>>> file1 >>>> >>>> It seems to be that node1 wants to get the file from its own brick, but >>>> the copy there is broken. Node2 gets the file from its own brick with a >>>> heal copy, so reading the file succeed. >>>> But I am wondering myself because sometimes reading the file from node1 >>>> with the broken copy succeed >>>> >>>> What is the expected behaviour here? Is it possibly to read files with >>>> a corrupted copy from any client access? >>>> >>>> Regards >>>> David Spisla >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> >>> -- >>> Amar Tumballi (amarts) >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Wed Feb 6 18:48:28 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Wed, 6 Feb 2019 10:48:28 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Hi Nithya, Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing crashes, and no further releases have been made yet. volume info: Type: Replicate Volume ID: ****SNIP**** Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: ****SNIP**** Brick2: ****SNIP**** Brick3: ****SNIP**** Brick4: ****SNIP**** Options Reconfigured: cluster.quorum-count: 1 cluster.quorum-type: fixed network.ping-timeout: 5 network.remote-dio: enable performance.rda-cache-limit: 256MB performance.readdir-ahead: on performance.parallel-readdir: on network.inode-lru-limit: 500000 performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: on features.cache-invalidation-timeout: 600 features.cache-invalidation: on cluster.readdir-optimize: on performance.io-thread-count: 32 server.event-threads: 4 client.event-threads: 4 performance.read-ahead: off cluster.lookup-optimize: on performance.cache-size: 1GB cluster.self-heal-daemon: enable transport.address-family: inet nfs.disable: on performance.client-io-threads: on cluster.granular-entry-heal: enable cluster.data-self-heal-algorithm: full Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran wrote: > Hi Artem, > > Do you still see the crashes with 5.3? If yes, please try mount the volume > using the mount option lru-limit=0 and see if that helps. We are looking > into the crashes and will update when have a fix. > > Also, please provide the gluster volume info for the volume in question. > > > regards, > Nithya > > On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii > wrote: > >> The fuse crash happened two more times, but this time monit helped >> recover within 1 minute, so it's a great workaround for now. >> >> What's odd is that the crashes are only happening on one of 4 servers, >> and I don't know why. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii >> wrote: >> >>> The fuse crash happened again yesterday, to another volume. Are there >>> any mount options that could help mitigate this? >>> >>> In the meantime, I set up a monit (https://mmonit.com/monit/) task to >>> watch and restart the mount, which works and recovers the mount point >>> within a minute. Not ideal, but a temporary workaround. >>> >>> By the way, the way to reproduce this "Transport endpoint is not >>> connected" condition for testing purposes is to kill -9 the right >>> "glusterfs --process-name fuse" process. >>> >>> >>> monit check: >>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>> start program = "/bin/mount /mnt/glusterfs_data1" >>> stop program = "/bin/umount /mnt/glusterfs_data1" >>> if space usage > 90% for 5 times within 15 cycles >>> then alert else if succeeded for 10 cycles then alert >>> >>> >>> stack trace: >>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fa0249e4329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7fa0249e4329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>> The message "E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>> [2019-02-01 23:21:56.164427] >>> The message "I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>> pending frames: >>> frame : type(1) op(LOOKUP) >>> frame : type(0) op(0) >>> patchset: git://git.gluster.org/glusterfs.git >>> signal received: 6 >>> time of crash: >>> 2019-02-01 23:22:03 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 5.3 >>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>> >>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii >>> wrote: >>> >>>> Hi, >>>> >>>> The first (and so far only) crash happened at 2am the next day after we >>>> upgraded, on only one of four servers and only to one of two mounts. >>>> >>>> I have no idea what caused it, but yeah, we do have a pretty busy site ( >>>> apkmirror.com), and it caused a disruption for any uploads or >>>> downloads from that server until I woke up and fixed the mount. >>>> >>>> I wish I could be more helpful but all I have is that stack trace. >>>> >>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>> >>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>> atumball at redhat.com> wrote: >>>> >>>>> Hi Artem, >>>>> >>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a >>>>> clone of other bugs where recent discussions happened), and marked it as a >>>>> blocker for glusterfs-5.4 release. >>>>> >>>>> We already have fixes for log flooding - >>>>> https://review.gluster.org/22128, and are the process of identifying >>>>> and fixing the issue seen with crash. >>>>> >>>>> Can you please tell if the crashes happened as soon as upgrade ? or >>>>> was there any particular pattern you observed before the crash. >>>>> >>>>> -Amar >>>>> >>>>> >>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I already >>>>>> got a crash which others have mentioned in >>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>> unmount, kill gluster, and remount: >>>>>> >>>>>> >>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fcccafcd329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>> [2019-01-31 09:38:04.696993] >>>>>> pending frames: >>>>>> frame : type(1) op(READ) >>>>>> frame : type(1) op(OPEN) >>>>>> frame : type(0) op(0) >>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>> signal received: 6 >>>>>> time of crash: >>>>>> 2019-01-31 09:38:04 >>>>>> configuration details: >>>>>> argp 1 >>>>>> backtrace 1 >>>>>> dlfcn 1 >>>>>> libpthread 1 >>>>>> llistxattr 1 >>>>>> setfsid 1 >>>>>> spinlock 1 >>>>>> epoll.h 1 >>>>>> xattr.h 1 >>>>>> st_atim.tv_nsec 1 >>>>>> package-string: glusterfs 5.3 >>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>> --------- >>>>>> >>>>>> Do the pending patches fix the crash or only the repeated warnings? >>>>>> I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>> not too sure how to make it core dump. >>>>>> >>>>>> If it's not fixed by the patches above, has anyone already opened a >>>>>> ticket for the crashes that I can join and monitor? This is going to create >>>>>> a massive problem for us since production systems are crashing. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>> rgowdapp at redhat.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Also, not sure if related or not, but I got a ton of these "Failed >>>>>>>> to dispatch handler" in my logs as well. Many people have been commenting >>>>>>>> about this issue here >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>> >>>>>>> >>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>>> >>>>>>> >>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fd966fcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler >>>>>>>> >>>>>>>> >>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> I found a similar issue here: >>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>> spam. >>>>>>>>> >>>>>>>>> Here's the command that repeats over and over: >>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fd966fcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> >>>>>>>> >>>>>>> +Milind Changire Can you check why this >>>>>>> message is logged and send a fix? >>>>>>> >>>>>>> >>>>>>>>> Is there any fix for this issue? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>>>> >>>>> -- >>>>> Amar Tumballi (amarts) >>>>> >>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdeepugd at gmail.com Tue Feb 5 11:56:09 2019 From: sdeepugd at gmail.com (deepu srinivasan) Date: Tue, 5 Feb 2019 17:26:09 +0530 Subject: [Gluster-users] Getting timedout error while rebalancing In-Reply-To: References: Message-ID: HI Nithya We have a test gluster setup.We are testing the rebalancing option of gluster. So we started the volume which have 1x3 brick with some data on it . command : gluster volume create test-volume replica 3 192.168.xxx.xx1:/home/data/repl 192.168.xxx.xx2:/home/data/repl 192.168.xxx.xx3:/home/data/repl. Now we tried to expand the cluster storage by adding three more bricks. command : gluster volume add-brick test-volume 192.168.xxx.xx4:/home/data/repl 192.168.xxx.xx5:/home/data/repl 192.168.xxx.xx6:/home/data/repl So after the brick addition we tried to rebalance the layout and the data. command : gluster volume rebalance test-volume fix-layout start. The command exited with status "Error : Request timed out". After the failure of the command, we tried to view the status of the command and it is something like this : Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 41 41.0MB 8200 0 0 completed 0:00:09 192.168.xxx.xx4 79 79.0MB 8231 0 0 completed 0:00:12 192.168.xxx.xx6 58 58.0MB 8281 0 0 completed 0:00:10 192.168.xxx.xx2 136 136.0MB 8566 0 136 completed 0:00:07 192.168.xxx.xx4 129 129.0MB 8566 0 129 completed 0:00:07 192.168.xxx.xx6 201 201.0MB 8566 0 201 completed 0:00:08 Is the rebalancing option working fine? Why did gluster throw the error saying that "Error : Request timed out"? .On Tue, Feb 5, 2019 at 4:23 PM Nithya Balachandran wrote: > Hi, > Please provide the exact step at which you are seeing the error. It would > be ideal if you could copy-paste the command and the error. > > Regards, > Nithya > > > > On Tue, 5 Feb 2019 at 15:24, deepu srinivasan wrote: > >> HI everyone. I am getting "Error : Request timed out " while doing >> rebalance . I have aded new bricks to my replicated volume.i.e. First it >> was 1x3 volume and added three more bricks to make it >> distributed-replicated volume(2x3) . What should i do for the timeout error >> ? >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdeepugd at gmail.com Wed Feb 6 13:37:41 2019 From: sdeepugd at gmail.com (deepu srinivasan) Date: Wed, 6 Feb 2019 19:07:41 +0530 Subject: [Gluster-users] Getting timedout error while rebalancing In-Reply-To: References: Message-ID: Please find the glusterd.log file attached. On Wed, Feb 6, 2019 at 2:01 PM Atin Mukherjee wrote: > > > On Tue, Feb 5, 2019 at 8:43 PM Nithya Balachandran > wrote: > >> >> >> On Tue, 5 Feb 2019 at 17:26, deepu srinivasan wrote: >> >>> HI Nithya >>> We have a test gluster setup.We are testing the rebalancing option of >>> gluster. So we started the volume which have 1x3 brick with some data on it >>> . >>> command : gluster volume create test-volume replica 3 >>> 192.168.xxx.xx1:/home/data/repl 192.168.xxx.xx2:/home/data/repl >>> 192.168.xxx.xx3:/home/data/repl. >>> >>> Now we tried to expand the cluster storage by adding three more bricks. >>> command : gluster volume add-brick test-volume 192.168.xxx.xx4:/home/data/repl >>> 192.168.xxx.xx5:/home/data/repl 192.168.xxx.xx6:/home/data/repl >>> >>> So after the brick addition we tried to rebalance the layout and the >>> data. >>> command : gluster volume rebalance test-volume fix-layout start. >>> The command exited with status "Error : Request timed out". >>> >> >> This sounds like an error in the cli or glusterd. Can you send the >> glusterd.log from the node on which you ran the command? >> > > It seems to me that glusterd took more than 120 seconds to process the > command and hence cli timed out. We can confirm the same by checking the > status of the rebalance below which indicates rebalance did kick in and > eventually completed. We need to understand why did it take such longer, so > please pass on the cli and glusterd log from all the nodes as Nithya > requested for. > > >> regards, >> Nithya >> >>> >>> After the failure of the command, we tried to view the status of the >>> command and it is something like this : >>> >>> Node Rebalanced-files size >>> scanned failures skipped status run time >>> in h:m:s >>> >>> --------- ----------- ----------- >>> ----------- ----------- ----------- ------------ >>> -------------- >>> >>> localhost 41 41.0MB >>> 8200 0 0 completed >>> 0:00:09 >>> >>> 192.168.xxx.xx4 79 79.0MB >>> 8231 0 0 completed >>> 0:00:12 >>> >>> 192.168.xxx.xx6 58 58.0MB >>> 8281 0 0 completed >>> 0:00:10 >>> >>> 192.168.xxx.xx2 136 136.0MB >>> 8566 0 136 completed >>> 0:00:07 >>> >>> 192.168.xxx.xx4 129 129.0MB >>> 8566 0 129 completed >>> 0:00:07 >>> >>> 192.168.xxx.xx6 201 201.0MB >>> 8566 0 201 completed >>> 0:00:08 >>> >>> Is the rebalancing option working fine? Why did gluster throw the error >>> saying that "Error : Request timed out"? >>> .On Tue, Feb 5, 2019 at 4:23 PM Nithya Balachandran >>> wrote: >>> >>>> Hi, >>>> Please provide the exact step at which you are seeing the error. It >>>> would be ideal if you could copy-paste the command and the error. >>>> >>>> Regards, >>>> Nithya >>>> >>>> >>>> >>>> On Tue, 5 Feb 2019 at 15:24, deepu srinivasan >>>> wrote: >>>> >>>>> HI everyone. I am getting "Error : Request timed out " while doing >>>>> rebalance . I have aded new bricks to my replicated volume.i.e. First it >>>>> was 1x3 volume and added three more bricks to make it >>>>> distributed-replicated volume(2x3) . What should i do for the timeout error >>>>> ? >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: glusterd.log Type: application/octet-stream Size: 4051106 bytes Desc: not available URL: From mabi at protonmail.ch Thu Feb 7 07:35:26 2019 From: mabi at protonmail.ch (mabi) Date: Thu, 07 Feb 2019 07:35:26 +0000 Subject: [Gluster-users] quotad error log warnings repeated Message-ID: Hello, I am running a 3 node (with arbiter) GlusterFS 4.1.6 cluster with one replicated volume where I have quotas enabled. Now I checked my quotad.log file on one of the nodes and can see a lot of these warning messages which are repeated a lot: The message "W [MSGID: 101016] [glusterfs3.h:743:dict_to_xdr] 0-dict: key 'trusted.glusterfs.quota.size' is not sent on wire [Invalid argument]" repeated 224 times between [2019-02-07 07:28:15.291923] and [2019-02-07 07:30:02.625004] The message "W [MSGID: 101016] [glusterfs3.h:743:dict_to_xdr] 0-dict: key 'volume-uuid' is not sent on wire [Invalid argument]" repeated 224 times between [2019-02-07 07:28:15.291949] and [2019-02-07 07:30:02.625004] [2019-02-07 07:30:07.747135] W [MSGID: 101016] [glusterfs3.h:743:dict_to_xdr] 0-dict: key 'trusted.glusterfs.quota.size' is not sent on wire [Invalid argument] [2019-02-07 07:30:07.747164] W [MSGID: 101016] [glusterfs3.h:743:dict_to_xdr] 0-dict: key 'volume-uuid' is not sent on wire [Invalid argument] I can re-trigger these warning messages on demand for example by running $ gluster volume quota myvolume list Does anyone know if this is bad? is it a bug? and what can I do about it? Best regards, Mabi From nicolas.schrevel at l3ia.fr Thu Feb 7 12:18:41 2019 From: nicolas.schrevel at l3ia.fr (Nicolas SCHREVEL) Date: Thu, 7 Feb 2019 13:18:41 +0100 Subject: [Gluster-users] Mounting Gluster volume from "old" Ubuntu 14 Message-ID: <67b3c233-d266-00f8-5a10-04d1e5fc5e54@l3ia.fr> Hy, I'm trying to mount a gluster volume from an old Ubuntu 14 server (it's an NFS Server, i want to move data to Gluster volumes). With NFS : root at ubuntu01:~# mount -t nfs -o vers=3 hqn-gluster-01:/hqn-preprod /mnt/gluster-preprod mount.nfs: requested NFS version or transport protocol is not supported With gluster-client : /etc/fstab : hqn-gluster-01:/hqn-preprod /mnt/gluster-preprod??????? glusterfs defaults,_netdev,log-level=WARNING,log-file=/var/log/gluster-preprod.log,backup-volfile-servers=hqn-gluster-02:hqn-gluster-03 0 0 root at ubuntu01:~# glusterfs --version glusterfs 3.4.2 built on Jan 14 2014 18:05:35 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2013 Red Hat, Inc. GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. [2019-02-07 12:06:31.622907] E [mount.c:162:fuse_mount_fusermount] 0-glusterfs-fuse: failed to exec fusermount: No such file or directory [2019-02-07 12:06:31.623356] E [mount.c:298:gf_fuse_mount] 0-glusterfs-fuse: mount of hqn-gluster-01:/hqn-preprod to /mnt/gluster-preprod (default_permissions,backup-volfile-servers=hqn-gluster-02:hqn-gluster-03,allow_other,max_read=131072) failed [2019-02-07 12:06:31.624010] E [glusterfsd.c:1744:daemonize] 0-daemonize: mount failed [2019-02-07 12:06:31.625060] W [xlator.c:185:xlator_dynload] 0-xlator: /usr/lib/x86_64-linux-gnu/glusterfs/3.4.2/xlator/performance/readdir-ahead.so: cannot open shared object file: No such file or directory [2019-02-07 12:06:31.625087] E [graph.y:212:volume_type] 0-parser: Volume 'hqn-preprod-readdir-ahead', line 68: type 'performance/readdir-ahead' is not valid or not found on this machine [2019-02-07 12:06:31.625111] E [graph.y:321:volume_end] 0-parser: "type" not specified for volume hqn-preprod-readdir-ahead [2019-02-07 12:06:31.625141] E [glusterfsd.c:1774:glusterfs_process_volfp] 0-: failed to construct the graph [2019-02-07 12:06:31.625314] W [glusterfsd.c:1002:cleanup_and_exit] (-->/usr/lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_handle_reply+0x90) [0x7f70cc02dc10] (-->/usr/sbin/glusterfs(mgmt_getspec_cbk+0x309) [0x7f70cc6ecbc9] (-->/usr/sbin/glusterfs(glusterfs_process_volfp+0x103) [0x7f70cc6e85e3]))) 0-: received signum (0), shutting down Looks like packages installed on my Ubuntu Server is a bit old. Any idea before i mount NFS / Gluster on another server and copy data ? Thanks, Nicolas -- Nicolas SCHREVEL From mauritslamers at gmail.com Thu Feb 7 12:31:55 2019 From: mauritslamers at gmail.com (Maurits Lamers) Date: Thu, 7 Feb 2019 13:31:55 +0100 Subject: [Gluster-users] glusterfs 4.1.7 + nfs-ganesha 2.7.1 freeze during write Message-ID: <1FBA8430-F957-40B3-8422-2E0D25265B68@gmail.com> Hi all, I am trying to find out more about why a nfs mount through nfs-ganesha of a glusterfs volume freezes. Little bit of a background: The system consists of one glusterfs volume across 5 nodes. Every node runs Ubuntu 16.04, gluster 4.1.7 and nfs-ganesha 2.7.1 The gluster volume is exported using the setup described on the first half of https://docs.gluster.org/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/ The node which freezes is running Nextcloud in a docker setup, where the entire application is stored on a path, which is a nfs-ganesha mount of the glusterfs volume. When I am running a synchronisation operation with this nextcloud instance, at some point the entire system freezes. The only solution is to completely restart the node, Just before this freeze the /var/log/ganesha/ganesha-gfapi.log file contains an error, which seems to result to timeouts after a short while. The node running the nextcloud instance is the only one freezing, the rest of the cluster seems to not be affected. 2019-02-07 10:11:17.342132] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.345776] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.346079] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.396853] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.397650] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.398036] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.407839] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:24.812606] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(y???? Mz???SL4_@) failed [2019-02-07 10:11:24.819376] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(eTn?EU?H. References: <1FBA8430-F957-40B3-8422-2E0D25265B68@gmail.com> Message-ID: On 2/7/19 6:01 PM, Maurits Lamers wrote: > Hi all, > > I am trying to find out more about why a nfs mount through nfs-ganesha of a glusterfs volume freezes. > > Little bit of a background: > The system consists of one glusterfs volume across 5 nodes. Every node runs Ubuntu 16.04, gluster 4.1.7 and nfs-ganesha 2.7.1 > The gluster volume is exported using the setup described on the first half of https://docs.gluster.org/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/ > > The node which freezes is running Nextcloud in a docker setup, where the entire application is stored on a path, which is a nfs-ganesha mount of the glusterfs volume. > When I am running a synchronisation operation with this nextcloud instance, at some point the entire system freezes. The only solution is to completely restart the node, > Just before this freeze the /var/log/ganesha/ganesha-gfapi.log file contains an error, which seems to result to timeouts after a short while. > > The node running the nextcloud instance is the only one freezing, the rest of the cluster seems to not be affected. > > 2019-02-07 10:11:17.342132] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-07 10:11:17.345776] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-07 10:11:17.346079] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-07 10:11:17.396853] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-07 10:11:17.397650] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-07 10:11:17.398036] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > [2019-02-07 10:11:17.407839] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] There is a patch [1] submitted and under review which fixes above error messages. > [2019-02-07 10:11:24.812606] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(y???? > Mz???SL4_@) failed > [2019-02-07 10:11:24.819376] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(eTn?EU?H. [2019-02-07 10:11:24.833299] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(g?L??F??0b??k) failed > [2019-02-07 10:25:01.642509] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-2: server [node1]:49152 has not responded in the last 42 seconds, disconnecting. > [2019-02-07 10:25:01.642805] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: server [node2]:49152 has not responded in the last 42 seconds, disconnecting. > [2019-02-07 10:25:01.642946] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-4: server [node3]:49152 has not responded in the last 42 seconds, disconnecting. > [2019-02-07 10:25:02.643120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-3: server 127.0.1.1:49152 has not responded in the last 42 seconds, disconnecting. > [2019-02-07 10:25:02.643314] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-0: server [node4]:49152 has not responded in the last 42 seconds, disconnecting. > Strange that synctask failed. Could you please turn off features.cache-invalidation volume option and check if the issue still persists. Also next time the system freezes, please check following: 1) if all the brick servers of that volume are up and running. Sometimes if the brick servers are not reachable, it takes a while for outstanding requests to timeout and get application (nfs-ganesha) back to normal state. Please wait for a while and check if the mount becomes accessible. 2) check if nfs-ganesha server is responding to other requests - #showmount -e localhost (on the node server is running) # try mount and I/Os from any other client. 3) and if the ganesha server isnt responding to any client, please try collecting 2-3 cores/stack traces of nfs-ganesha server with brief interval (say 5min) in between using below commands . This is to check if any threads are stuck in deadlock. to collect core: #gcore to collect stack trace: #gstack Thanks, Soumya [1] https://review.gluster.org/#/c/glusterfs/+/22126/ > > The log in /var/log/glusterfs/glusterd.log contains nothing after the default messages at startup. > > The log in /var/log/ganesha/ganesha.log starts warning about its health after the messages above: > > 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- > 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED > 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- > 07/02/2019 09:43:28 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE > 07/02/2019 11:26:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 170321, old: 170320; deq new: 170314, old: 170314 > 07/02/2019 11:28:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 170325, old: 170324; deq new: 170317, old: 170317 > 07/02/2019 11:29:21 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 170327, old: 170326; deq new: 170318, old: 170318 > > Any hints on how to solve this would be greatly appreciated. > > cheers > > Maurits > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > From mauritslamers at gmail.com Thu Feb 7 19:50:32 2019 From: mauritslamers at gmail.com (Maurits Lamers) Date: Thu, 7 Feb 2019 20:50:32 +0100 Subject: [Gluster-users] glusterfs 4.1.7 + nfs-ganesha 2.7.1 freeze during write In-Reply-To: References: <1FBA8430-F957-40B3-8422-2E0D25265B68@gmail.com> Message-ID: Hi, > Op 7 feb. 2019, om 18:51 heeft Soumya Koduri het volgende geschreven: > > On 2/7/19 6:01 PM, Maurits Lamers wrote: >> Hi all, >> I am trying to find out more about why a nfs mount through nfs-ganesha of a glusterfs volume freezes. >> Little bit of a background: >> The system consists of one glusterfs volume across 5 nodes. Every node runs Ubuntu 16.04, gluster 4.1.7 and nfs-ganesha 2.7.1 >> The gluster volume is exported using the setup described on the first half of https://docs.gluster.org/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/ >> The node which freezes is running Nextcloud in a docker setup, where the entire application is stored on a path, which is a nfs-ganesha mount of the glusterfs volume. >> When I am running a synchronisation operation with this nextcloud instance, at some point the entire system freezes. The only solution is to completely restart the node, >> Just before this freeze the /var/log/ganesha/ganesha-gfapi.log file contains an error, which seems to result to timeouts after a short while. >> The node running the nextcloud instance is the only one freezing, the rest of the cluster seems to not be affected. >> 2019-02-07 10:11:17.342132] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.345776] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.346079] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.396853] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.397650] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.398036] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] >> [2019-02-07 10:11:17.407839] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] > > There is a patch [1] submitted and under review which fixes above error messages. > >> [2019-02-07 10:11:24.812606] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(y???? >> Mz???SL4_@) failed >> [2019-02-07 10:11:24.819376] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(eTn?EU?H.> [2019-02-07 10:11:24.833299] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(g?L??F??0b??k) failed >> [2019-02-07 10:25:01.642509] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-2: server [node1]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:01.642805] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: server [node2]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:01.642946] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-4: server [node3]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:02.643120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-3: server 127.0.1.1:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:02.643314] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-0: server [node4]:49152 has not responded in the last 42 seconds, disconnecting. > > Strange that synctask failed. Could you please turn off features.cache-invalidation volume option and check if the issue still persists. Will try to do this. > > Also next time the system freezes, please check following: > > 1) if all the brick servers of that volume are up and running. Sometimes if the brick servers are not reachable, it takes a while for outstanding requests to timeout and get application (nfs-ganesha) back to normal state. Please wait for a while and check if the mount becomes accessible. All brick servers are up and running. I have had this freeze already a few times, and the rest of the system continues to work, both nfs-ganesha mounts on different nodes as well as glusterfs mounts on those nodes. > > 2) check if nfs-ganesha server is responding to other requests - > #showmount -e localhost (on the node server is running) > # try mount and I/Os from any other client. I don't think it is, but it seems to be that the NFS system itself is simply waiting on the volume to become available again, which it never does. I will try to check though. > > 3) and if the ganesha server isnt responding to any client, please try collecting 2-3 cores/stack traces of nfs-ganesha server with brief interval (say 5min) in between using below commands . This is to check if any threads are stuck in deadlock. > > to collect core: > #gcore > > to collect stack trace: > #gstack > > Thanks, > Soumya And also thanks! cheers Maurits > > [1] https://review.gluster.org/#/c/glusterfs/+/22126/ >> The log in /var/log/glusterfs/glusterd.log contains nothing after the default messages at startup. >> The log in /var/log/ganesha/ganesha.log starts warning about its health after the messages above: >> 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- >> 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] nfs_start :NFS STARTUP :EVENT : NFS SERVER INITIALIZED >> 07/02/2019 09:42:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[main] nfs_start :NFS STARTUP :EVENT :------------------------------------------------- >> 07/02/2019 09:43:28 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE >> 07/02/2019 11:26:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 170321, old: 170320; deq new: 170314, old: 170314 >> 07/02/2019 11:28:17 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 170325, old: 170324; deq new: 170317, old: 170317 >> 07/02/2019 11:29:21 : epoch 5c5bef56 : [myhost] : ganesha.nfsd-1384[dbus_heartbeat] nfs_health :DBUS :WARN :Health status is unhealthy. enq new: 170327, old: 170326; deq new: 170318, old: 170318 >> Any hints on how to solve this would be greatly appreciated. >> cheers >> Maurits >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amye at redhat.com Thu Feb 7 20:53:03 2019 From: amye at redhat.com (Amye Scavarda) Date: Thu, 7 Feb 2019 12:53:03 -0800 Subject: [Gluster-users] KubeCon Shanghai CFP open through 2019-02-22 Message-ID: Deadline for Proposals is February 22, 2019 at 11:59PM PT In 2019, Open Source Summit and KubeCon + CloudNativeCon are combining together into a single event in China. This is a great opportunity to showcase our work, please let me know if you're interested in submitting to either KubeCon or Open Source Summit. https://www.lfasiallc.com/events/kubecon-cloudnativecon-china-2019/oss-cfp/ https://www.lfasiallc.com/events/kubecon-cloudnativecon-china-2019/kccnc-cfp/ -- Amye Scavarda | amye at redhat.com | Gluster Community Lead From archon810 at gmail.com Thu Feb 7 21:28:08 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Thu, 7 Feb 2019 13:28:08 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: I've added the lru-limit=0 parameter to the mounts, and I see it's taken effect correctly: "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse --volfile-server=localhost --volfile-id=/ /mnt/" Let's see if it stops crashing or not. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii wrote: > Hi Nithya, > > Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing > crashes, and no further releases have been made yet. > > volume info: > Type: Replicate > Volume ID: ****SNIP**** > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 4 = 4 > Transport-type: tcp > Bricks: > Brick1: ****SNIP**** > Brick2: ****SNIP**** > Brick3: ****SNIP**** > Brick4: ****SNIP**** > Options Reconfigured: > cluster.quorum-count: 1 > cluster.quorum-type: fixed > network.ping-timeout: 5 > network.remote-dio: enable > performance.rda-cache-limit: 256MB > performance.readdir-ahead: on > performance.parallel-readdir: on > network.inode-lru-limit: 500000 > performance.md-cache-timeout: 600 > performance.cache-invalidation: on > performance.stat-prefetch: on > features.cache-invalidation-timeout: 600 > features.cache-invalidation: on > cluster.readdir-optimize: on > performance.io-thread-count: 32 > server.event-threads: 4 > client.event-threads: 4 > performance.read-ahead: off > cluster.lookup-optimize: on > performance.cache-size: 1GB > cluster.self-heal-daemon: enable > transport.address-family: inet > nfs.disable: on > performance.client-io-threads: on > cluster.granular-entry-heal: enable > cluster.data-self-heal-algorithm: full > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran > wrote: > >> Hi Artem, >> >> Do you still see the crashes with 5.3? If yes, please try mount the >> volume using the mount option lru-limit=0 and see if that helps. We are >> looking into the crashes and will update when have a fix. >> >> Also, please provide the gluster volume info for the volume in question. >> >> >> regards, >> Nithya >> >> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >> wrote: >> >>> The fuse crash happened two more times, but this time monit helped >>> recover within 1 minute, so it's a great workaround for now. >>> >>> What's odd is that the crashes are only happening on one of 4 servers, >>> and I don't know why. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii >>> wrote: >>> >>>> The fuse crash happened again yesterday, to another volume. Are there >>>> any mount options that could help mitigate this? >>>> >>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task to >>>> watch and restart the mount, which works and recovers the mount point >>>> within a minute. Not ideal, but a temporary workaround. >>>> >>>> By the way, the way to reproduce this "Transport endpoint is not >>>> connected" condition for testing purposes is to kill -9 the right >>>> "glusterfs --process-name fuse" process. >>>> >>>> >>>> monit check: >>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>> if space usage > 90% for 5 times within 15 cycles >>>> then alert else if succeeded for 10 cycles then alert >>>> >>>> >>>> stack trace: >>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fa0249e4329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7fa0249e4329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>> The message "E [MSGID: 101191] >>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>> [2019-02-01 23:21:56.164427] >>>> The message "I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>> pending frames: >>>> frame : type(1) op(LOOKUP) >>>> frame : type(0) op(0) >>>> patchset: git://git.gluster.org/glusterfs.git >>>> signal received: 6 >>>> time of crash: >>>> 2019-02-01 23:22:03 >>>> configuration details: >>>> argp 1 >>>> backtrace 1 >>>> dlfcn 1 >>>> libpthread 1 >>>> llistxattr 1 >>>> setfsid 1 >>>> spinlock 1 >>>> epoll.h 1 >>>> xattr.h 1 >>>> st_atim.tv_nsec 1 >>>> package-string: glusterfs 5.3 >>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>> >>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> The first (and so far only) crash happened at 2am the next day after >>>>> we upgraded, on only one of four servers and only to one of two mounts. >>>>> >>>>> I have no idea what caused it, but yeah, we do have a pretty busy site >>>>> (apkmirror.com), and it caused a disruption for any uploads or >>>>> downloads from that server until I woke up and fixed the mount. >>>>> >>>>> I wish I could be more helpful but all I have is that stack trace. >>>>> >>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>> >>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>> atumball at redhat.com> wrote: >>>>> >>>>>> Hi Artem, >>>>>> >>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a >>>>>> clone of other bugs where recent discussions happened), and marked it as a >>>>>> blocker for glusterfs-5.4 release. >>>>>> >>>>>> We already have fixes for log flooding - >>>>>> https://review.gluster.org/22128, and are the process of identifying >>>>>> and fixing the issue seen with crash. >>>>>> >>>>>> Can you please tell if the crashes happened as soon as upgrade ? or >>>>>> was there any particular pattern you observed before the crash. >>>>>> >>>>>> -Amar >>>>>> >>>>>> >>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I already >>>>>>> got a crash which others have mentioned in >>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>> unmount, kill gluster, and remount: >>>>>>> >>>>>>> >>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fcccafcd329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fcccafcd329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fcccafcd329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fcccafcd329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>> [2019-01-31 09:38:04.696993] >>>>>>> pending frames: >>>>>>> frame : type(1) op(READ) >>>>>>> frame : type(1) op(OPEN) >>>>>>> frame : type(0) op(0) >>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>> signal received: 6 >>>>>>> time of crash: >>>>>>> 2019-01-31 09:38:04 >>>>>>> configuration details: >>>>>>> argp 1 >>>>>>> backtrace 1 >>>>>>> dlfcn 1 >>>>>>> libpthread 1 >>>>>>> llistxattr 1 >>>>>>> setfsid 1 >>>>>>> spinlock 1 >>>>>>> epoll.h 1 >>>>>>> xattr.h 1 >>>>>>> st_atim.tv_nsec 1 >>>>>>> package-string: glusterfs 5.3 >>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>> --------- >>>>>>> >>>>>>> Do the pending patches fix the crash or only the repeated warnings? >>>>>>> I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>> not too sure how to make it core dump. >>>>>>> >>>>>>> If it's not fixed by the patches above, has anyone already opened a >>>>>>> ticket for the crashes that I can join and monitor? This is going to create >>>>>>> a massive problem for us since production systems are crashing. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>> rgowdapp at redhat.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Also, not sure if related or not, but I got a ton of these "Failed >>>>>>>>> to dispatch handler" in my logs as well. Many people have been commenting >>>>>>>>> about this issue here >>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>> >>>>>>>> >>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>>>> >>>>>>>> >>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>> handler >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I found a similar issue here: >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>> spam. >>>>>>>>>> >>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> >>>>>>>>> >>>>>>>> +Milind Changire Can you check why this >>>>>>>> message is logged and send a fix? >>>>>>>> >>>>>>>> >>>>>>>>>> Is there any fix for this issue? >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Amar Tumballi (amarts) >>>>>> >>>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From mauritslamers at gmail.com Thu Feb 7 21:50:02 2019 From: mauritslamers at gmail.com (Maurits Lamers) Date: Thu, 7 Feb 2019 22:50:02 +0100 Subject: [Gluster-users] glusterfs 4.1.7 + nfs-ganesha 2.7.1 freeze during write In-Reply-To: References: <1FBA8430-F957-40B3-8422-2E0D25265B68@gmail.com> Message-ID: <22FDC703-87F4-472D-8229-9B26F440FAB1@gmail.com> Hi, > >> [2019-02-07 10:11:24.812606] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(y???? >> Mz???SL4_@) failed >> [2019-02-07 10:11:24.819376] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(eTn?EU?H.> [2019-02-07 10:11:24.833299] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(g?L??F??0b??k) failed >> [2019-02-07 10:25:01.642509] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-2: server [node1]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:01.642805] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: server [node2]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:01.642946] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-4: server [node3]:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:02.643120] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-3: server 127.0.1.1:49152 has not responded in the last 42 seconds, disconnecting. >> [2019-02-07 10:25:02.643314] C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-0: server [node4]:49152 has not responded in the last 42 seconds, disconnecting. > > Strange that synctask failed. Could you please turn off features.cache-invalidation volume option and check if the issue still persists. >> Turning the cache invalidation option off seems to have solved the freeze. Still testing, but it looks promising. cheers Maurits -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdeepugd at gmail.com Thu Feb 7 11:01:24 2019 From: sdeepugd at gmail.com (deepu srinivasan) Date: Thu, 7 Feb 2019 16:31:24 +0530 Subject: [Gluster-users] Web Ui for gluster Message-ID: Dear Users Is there a default Web ui provided by gluster for monitoring the nodes and configuring them -------------- next part -------------- An HTML attachment was scrubbed... URL: From sdeepugd at gmail.com Thu Feb 7 11:03:24 2019 From: sdeepugd at gmail.com (deepu srinivasan) Date: Thu, 7 Feb 2019 16:33:24 +0530 Subject: [Gluster-users] Inter switching master slave in Gluster Geo Replication Message-ID: Dear Users, Is there any possibility that the slave node in geo replication could act as master and master as slave ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mauritslamers at gmail.com Thu Feb 7 11:19:54 2019 From: mauritslamers at gmail.com (Maurits Lamers) Date: Thu, 7 Feb 2019 12:19:54 +0100 Subject: [Gluster-users] glusterfs 4.1.7 + nfs-ganesha 2.7.1 freeze during write Message-ID: <5B86B58E-2EBC-4938-B06A-857F93351DBA@gmail.com> Hi all, I am trying to find out more about why a nfs mount through nfs-ganesha of a glusterfs volume freezes. Little bit of a background: The system consists of one glusterfs volume across 5 nodes. Every node runs Ubuntu 16.04, gluster 4.1.7 and nfs-ganesha 2.7.1 The gluster volume is exported using the setup described on the first half of https://docs.gluster.org/en/latest/Administrator%20Guide/NFS-Ganesha%20GlusterFS%20Integration/ The node which freezes is running Nextcloud in a docker setup, where the entire application is stored on a path, which is a nfs-ganesha mount of the glusterfs volume. When I am running a synchronisation operation with this nextcloud instance, at some point the entire system freezes. The only solution is to completely restart the node, Just before this freeze the /var/log/ganesha/ganesha-gfapi.log file contains an error, which seems to result to timeouts after a short while. The node running the nextcloud instance is the only one freezing, the rest of the cluster seems to not be affected. 2019-02-07 10:11:17.342132] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.345776] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.346079] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.396853] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.397650] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.398036] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:17.407839] W [dict.c:671:dict_ref] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/quick-read.so(+0x59b4) [0x7f2f035139b4] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.7/xlator/performance/io-cache.so(+0xa2cd) [0x7f2f037242cd] -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_ref+0x50) [0x7f2f0f312370] ) 0-dict: dict is NULL [Invalid argument] [2019-02-07 10:11:24.812606] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(y???? Mz???SL4_@) failed [2019-02-07 10:11:24.819376] E [MSGID: 104055] [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall event_type(1) and gfid(eTn?EU?H./usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Sorry to disappoint, but the crash just happened again, so lru-limit=0 didn't help. Here's the snippet of the crash and the subsequent remount by monit. [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7f4402b99329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7f440b6b5218] ) 0-dict: dict is NULL [In valid argument] The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: selecting local read_child _data1-client-3" repeated 39 times between [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 515 times between [2019-02-08 01:11:17.932515] and [2019-02-08 01:13:09.311554] pending frames: frame : type(1) op(LOOKUP) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2019-02-08 01:13:09 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.3 /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] /lib64/libc.so.6(+0x36160)[0x7f440a887160] /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] --------- [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) [2019-02-08 01:13:35.637830] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-02-08 01:13:35.651405] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2019-02-08 01:13:35.651628] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3 [2019-02-08 01:13:35.651747] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4 [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] 0-_data1-client-0: parent translators are ready, attempting connect on transport [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] 0-_data1-client-1: parent translators are ready, attempting connect on transport [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] 0-_data1-client-2: parent translators are ready, attempting connect on transport [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] 0-_data1-client-3: parent translators are ready, attempting connect on transport [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] 0-_data1-client-0: changing port to 49153 (from 0) Final graph: Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii wrote: > I've added the lru-limit=0 parameter to the mounts, and I see it's taken > effect correctly: > "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse > --volfile-server=localhost --volfile-id=/ /mnt/" > > Let's see if it stops crashing or not. > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii > wrote: > >> Hi Nithya, >> >> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >> crashes, and no further releases have been made yet. >> >> volume info: >> Type: Replicate >> Volume ID: ****SNIP**** >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x 4 = 4 >> Transport-type: tcp >> Bricks: >> Brick1: ****SNIP**** >> Brick2: ****SNIP**** >> Brick3: ****SNIP**** >> Brick4: ****SNIP**** >> Options Reconfigured: >> cluster.quorum-count: 1 >> cluster.quorum-type: fixed >> network.ping-timeout: 5 >> network.remote-dio: enable >> performance.rda-cache-limit: 256MB >> performance.readdir-ahead: on >> performance.parallel-readdir: on >> network.inode-lru-limit: 500000 >> performance.md-cache-timeout: 600 >> performance.cache-invalidation: on >> performance.stat-prefetch: on >> features.cache-invalidation-timeout: 600 >> features.cache-invalidation: on >> cluster.readdir-optimize: on >> performance.io-thread-count: 32 >> server.event-threads: 4 >> client.event-threads: 4 >> performance.read-ahead: off >> cluster.lookup-optimize: on >> performance.cache-size: 1GB >> cluster.self-heal-daemon: enable >> transport.address-family: inet >> nfs.disable: on >> performance.client-io-threads: on >> cluster.granular-entry-heal: enable >> cluster.data-self-heal-algorithm: full >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran >> wrote: >> >>> Hi Artem, >>> >>> Do you still see the crashes with 5.3? If yes, please try mount the >>> volume using the mount option lru-limit=0 and see if that helps. We are >>> looking into the crashes and will update when have a fix. >>> >>> Also, please provide the gluster volume info for the volume in question. >>> >>> >>> regards, >>> Nithya >>> >>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>> wrote: >>> >>>> The fuse crash happened two more times, but this time monit helped >>>> recover within 1 minute, so it's a great workaround for now. >>>> >>>> What's odd is that the crashes are only happening on one of 4 servers, >>>> and I don't know why. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii >>>> wrote: >>>> >>>>> The fuse crash happened again yesterday, to another volume. Are there >>>>> any mount options that could help mitigate this? >>>>> >>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task to >>>>> watch and restart the mount, which works and recovers the mount point >>>>> within a minute. Not ideal, but a temporary workaround. >>>>> >>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>> connected" condition for testing purposes is to kill -9 the right >>>>> "glusterfs --process-name fuse" process. >>>>> >>>>> >>>>> monit check: >>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>> if space usage > 90% for 5 times within 15 cycles >>>>> then alert else if succeeded for 10 cycles then alert >>>>> >>>>> >>>>> stack trace: >>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fa0249e4329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7fa0249e4329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>> The message "E [MSGID: 101191] >>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>> [2019-02-01 23:21:56.164427] >>>>> The message "I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>> pending frames: >>>>> frame : type(1) op(LOOKUP) >>>>> frame : type(0) op(0) >>>>> patchset: git://git.gluster.org/glusterfs.git >>>>> signal received: 6 >>>>> time of crash: >>>>> 2019-02-01 23:22:03 >>>>> configuration details: >>>>> argp 1 >>>>> backtrace 1 >>>>> dlfcn 1 >>>>> libpthread 1 >>>>> llistxattr 1 >>>>> setfsid 1 >>>>> spinlock 1 >>>>> epoll.h 1 >>>>> xattr.h 1 >>>>> st_atim.tv_nsec 1 >>>>> package-string: glusterfs 5.3 >>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>> >>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> The first (and so far only) crash happened at 2am the next day after >>>>>> we upgraded, on only one of four servers and only to one of two mounts. >>>>>> >>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>> site (apkmirror.com), and it caused a disruption for any uploads or >>>>>> downloads from that server until I woke up and fixed the mount. >>>>>> >>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>> >>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>> >>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>> atumball at redhat.com> wrote: >>>>>> >>>>>>> Hi Artem, >>>>>>> >>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as >>>>>>> a clone of other bugs where recent discussions happened), and marked it as >>>>>>> a blocker for glusterfs-5.4 release. >>>>>>> >>>>>>> We already have fixes for log flooding - >>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>> identifying and fixing the issue seen with crash. >>>>>>> >>>>>>> Can you please tell if the crashes happened as soon as upgrade ? or >>>>>>> was there any particular pattern you observed before the crash. >>>>>>> >>>>>>> -Amar >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>> already got a crash which others have mentioned in >>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>> unmount, kill gluster, and remount: >>>>>>>> >>>>>>>> >>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fcccafcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fcccafcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fcccafcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fcccafcd329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>> pending frames: >>>>>>>> frame : type(1) op(READ) >>>>>>>> frame : type(1) op(OPEN) >>>>>>>> frame : type(0) op(0) >>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>> signal received: 6 >>>>>>>> time of crash: >>>>>>>> 2019-01-31 09:38:04 >>>>>>>> configuration details: >>>>>>>> argp 1 >>>>>>>> backtrace 1 >>>>>>>> dlfcn 1 >>>>>>>> libpthread 1 >>>>>>>> llistxattr 1 >>>>>>>> setfsid 1 >>>>>>>> spinlock 1 >>>>>>>> epoll.h 1 >>>>>>>> xattr.h 1 >>>>>>>> st_atim.tv_nsec 1 >>>>>>>> package-string: glusterfs 5.3 >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>> --------- >>>>>>>> >>>>>>>> Do the pending patches fix the crash or only the repeated warnings? >>>>>>>> I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>> not too sure how to make it core dump. >>>>>>>> >>>>>>>> If it's not fixed by the patches above, has anyone already opened a >>>>>>>> ticket for the crashes that I can join and monitor? This is going to create >>>>>>>> a massive problem for us since production systems are crashing. >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>> commenting about this issue here >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>> >>>>>>>>> >>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>>>>> >>>>>>>>> >>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>> handler >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I found a similar issue here: >>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>> spam. >>>>>>>>>>> >>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> +Milind Changire Can you check why this >>>>>>>>> message is logged and send a fix? >>>>>>>>> >>>>>>>>> >>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Amar Tumballi (amarts) >>>>>>> >>>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Fri Feb 8 03:05:12 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Fri, 8 Feb 2019 08:35:12 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Thanks Artem. Can you send us the coredump or the bt with symbols from the crash? Regards, Nithya On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii wrote: > Sorry to disappoint, but the crash just happened again, so lru-limit=0 > didn't help. > > Here's the snippet of the crash and the subsequent remount by monit. > > > [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7f4402b99329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7f440b6b5218] ) 0-dict: dict is NULL [In > valid argument] > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 0-_data1-replicate-0: selecting local read_child > _data1-client-3" repeated 39 times between [2019-02-08 > 01:11:18.043286] and [2019-02-08 01:13:07.915604] > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 515 times between [2019-02-08 01:11:17.932515] and > [2019-02-08 01:13:09.311554] > pending frames: > frame : type(1) op(LOOKUP) > frame : type(0) op(0) > patchset: git://git.gluster.org/glusterfs.git > signal received: 6 > time of crash: > 2019-02-08 01:13:09 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 5.3 > /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] > /lib64/libc.so.6(+0x36160)[0x7f440a887160] > /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] > /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] > /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] > /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] > /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] > > /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] > /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] > /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] > /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] > /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] > /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] > /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] > --------- > [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] > 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 > (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse > --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) > [2019-02-08 01:13:35.637830] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-02-08 01:13:35.651405] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 2 > [2019-02-08 01:13:35.651628] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 3 > [2019-02-08 01:13:35.651747] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 4 > [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-0: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-1: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-2: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-3: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] > 0-_data1-client-0: changing port to 49153 (from 0) > Final graph: > > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii > wrote: > >> I've added the lru-limit=0 parameter to the mounts, and I see it's taken >> effect correctly: >> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >> --volfile-server=localhost --volfile-id=/ /mnt/" >> >> Let's see if it stops crashing or not. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii >> wrote: >> >>> Hi Nithya, >>> >>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>> crashes, and no further releases have been made yet. >>> >>> volume info: >>> Type: Replicate >>> Volume ID: ****SNIP**** >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 4 = 4 >>> Transport-type: tcp >>> Bricks: >>> Brick1: ****SNIP**** >>> Brick2: ****SNIP**** >>> Brick3: ****SNIP**** >>> Brick4: ****SNIP**** >>> Options Reconfigured: >>> cluster.quorum-count: 1 >>> cluster.quorum-type: fixed >>> network.ping-timeout: 5 >>> network.remote-dio: enable >>> performance.rda-cache-limit: 256MB >>> performance.readdir-ahead: on >>> performance.parallel-readdir: on >>> network.inode-lru-limit: 500000 >>> performance.md-cache-timeout: 600 >>> performance.cache-invalidation: on >>> performance.stat-prefetch: on >>> features.cache-invalidation-timeout: 600 >>> features.cache-invalidation: on >>> cluster.readdir-optimize: on >>> performance.io-thread-count: 32 >>> server.event-threads: 4 >>> client.event-threads: 4 >>> performance.read-ahead: off >>> cluster.lookup-optimize: on >>> performance.cache-size: 1GB >>> cluster.self-heal-daemon: enable >>> transport.address-family: inet >>> nfs.disable: on >>> performance.client-io-threads: on >>> cluster.granular-entry-heal: enable >>> cluster.data-self-heal-algorithm: full >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran >>> wrote: >>> >>>> Hi Artem, >>>> >>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>> looking into the crashes and will update when have a fix. >>>> >>>> Also, please provide the gluster volume info for the volume in question. >>>> >>>> >>>> regards, >>>> Nithya >>>> >>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>> wrote: >>>> >>>>> The fuse crash happened two more times, but this time monit helped >>>>> recover within 1 minute, so it's a great workaround for now. >>>>> >>>>> What's odd is that the crashes are only happening on one of 4 servers, >>>>> and I don't know why. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> The fuse crash happened again yesterday, to another volume. Are there >>>>>> any mount options that could help mitigate this? >>>>>> >>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>> >>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>> "glusterfs --process-name fuse" process. >>>>>> >>>>>> >>>>>> monit check: >>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>> then alert else if succeeded for 10 cycles then alert >>>>>> >>>>>> >>>>>> stack trace: >>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fa0249e4329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fa0249e4329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>> [2019-02-01 23:21:56.164427] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>> pending frames: >>>>>> frame : type(1) op(LOOKUP) >>>>>> frame : type(0) op(0) >>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>> signal received: 6 >>>>>> time of crash: >>>>>> 2019-02-01 23:22:03 >>>>>> configuration details: >>>>>> argp 1 >>>>>> backtrace 1 >>>>>> dlfcn 1 >>>>>> libpthread 1 >>>>>> llistxattr 1 >>>>>> setfsid 1 >>>>>> spinlock 1 >>>>>> epoll.h 1 >>>>>> xattr.h 1 >>>>>> st_atim.tv_nsec 1 >>>>>> package-string: glusterfs 5.3 >>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> The first (and so far only) crash happened at 2am the next day after >>>>>>> we upgraded, on only one of four servers and only to one of two mounts. >>>>>>> >>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>> site (apkmirror.com), and it caused a disruption for any uploads or >>>>>>> downloads from that server until I woke up and fixed the mount. >>>>>>> >>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>> >>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>> >>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>> atumball at redhat.com> wrote: >>>>>>> >>>>>>>> Hi Artem, >>>>>>>> >>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as >>>>>>>> a clone of other bugs where recent discussions happened), and marked it as >>>>>>>> a blocker for glusterfs-5.4 release. >>>>>>>> >>>>>>>> We already have fixes for log flooding - >>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>> >>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? or >>>>>>>> was there any particular pattern you observed before the crash. >>>>>>>> >>>>>>>> -Amar >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>> already got a crash which others have mentioned in >>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>> >>>>>>>>> >>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>> pending frames: >>>>>>>>> frame : type(1) op(READ) >>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>> frame : type(0) op(0) >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>> signal received: 6 >>>>>>>>> time of crash: >>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>> configuration details: >>>>>>>>> argp 1 >>>>>>>>> backtrace 1 >>>>>>>>> dlfcn 1 >>>>>>>>> libpthread 1 >>>>>>>>> llistxattr 1 >>>>>>>>> setfsid 1 >>>>>>>>> spinlock 1 >>>>>>>>> epoll.h 1 >>>>>>>>> xattr.h 1 >>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>> --------- >>>>>>>>> >>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>> not too sure how to make it core dump. >>>>>>>>> >>>>>>>>> If it's not fixed by the patches above, has anyone already opened >>>>>>>>> a ticket for the crashes that I can join and monitor? This is going to >>>>>>>>> create a massive problem for us since production systems are crashing. >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>> commenting about this issue here >>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>> spam. >>>>>>>>>>>> >>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>> message is logged and send a fix? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Amar Tumballi (amarts) >>>>>>>> >>>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Fri Feb 8 03:18:55 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Fri, 8 Feb 2019 08:48:55 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: One possible reason could be https://review.gluster.org/r/18b6d7ce7d490e807815270918a17a4b392a829d as that changed some code in epoll handler. Though the change is largely on server side, the epoll and socket changes are relevant for client too. I'll try to see whether there is anything wrong with that. On Fri, Feb 8, 2019 at 8:36 AM Nithya Balachandran wrote: > Thanks Artem. Can you send us the coredump or the bt with symbols from the > crash? > > Regards, > Nithya > > On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii > wrote: > >> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >> didn't help. >> >> Here's the snippet of the crash and the subsequent remount by monit. >> >> >> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7f4402b99329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >> valid argument] >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >> selecting local read_child _data1-client-3" repeated 39 times between >> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >> [2019-02-08 01:13:09.311554] >> pending frames: >> frame : type(1) op(LOOKUP) >> frame : type(0) op(0) >> patchset: git://git.gluster.org/glusterfs.git >> signal received: 6 >> time of crash: >> 2019-02-08 01:13:09 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 5.3 >> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >> >> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >> --------- >> [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] >> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 >> (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse >> --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) >> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 1 >> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 2 >> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 3 >> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 4 >> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-0: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-1: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-2: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-3: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >> 0-_data1-client-0: changing port to 49153 (from 0) >> Final graph: >> >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii >> wrote: >> >>> I've added the lru-limit=0 parameter to the mounts, and I see it's taken >>> effect correctly: >>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>> --volfile-server=localhost --volfile-id=/ /mnt/" >>> >>> Let's see if it stops crashing or not. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii >>> wrote: >>> >>>> Hi Nithya, >>>> >>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>> crashes, and no further releases have been made yet. >>>> >>>> volume info: >>>> Type: Replicate >>>> Volume ID: ****SNIP**** >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x 4 = 4 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: ****SNIP**** >>>> Brick2: ****SNIP**** >>>> Brick3: ****SNIP**** >>>> Brick4: ****SNIP**** >>>> Options Reconfigured: >>>> cluster.quorum-count: 1 >>>> cluster.quorum-type: fixed >>>> network.ping-timeout: 5 >>>> network.remote-dio: enable >>>> performance.rda-cache-limit: 256MB >>>> performance.readdir-ahead: on >>>> performance.parallel-readdir: on >>>> network.inode-lru-limit: 500000 >>>> performance.md-cache-timeout: 600 >>>> performance.cache-invalidation: on >>>> performance.stat-prefetch: on >>>> features.cache-invalidation-timeout: 600 >>>> features.cache-invalidation: on >>>> cluster.readdir-optimize: on >>>> performance.io-thread-count: 32 >>>> server.event-threads: 4 >>>> client.event-threads: 4 >>>> performance.read-ahead: off >>>> cluster.lookup-optimize: on >>>> performance.cache-size: 1GB >>>> cluster.self-heal-daemon: enable >>>> transport.address-family: inet >>>> nfs.disable: on >>>> performance.client-io-threads: on >>>> cluster.granular-entry-heal: enable >>>> cluster.data-self-heal-algorithm: full >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>> nbalacha at redhat.com> wrote: >>>> >>>>> Hi Artem, >>>>> >>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>> looking into the crashes and will update when have a fix. >>>>> >>>>> Also, please provide the gluster volume info for the volume in >>>>> question. >>>>> >>>>> >>>>> regards, >>>>> Nithya >>>>> >>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>>> wrote: >>>>> >>>>>> The fuse crash happened two more times, but this time monit helped >>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>> >>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>> servers, and I don't know why. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>> there any mount options that could help mitigate this? >>>>>>> >>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>>> >>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>> "glusterfs --process-name fuse" process. >>>>>>> >>>>>>> >>>>>>> monit check: >>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>> >>>>>>> >>>>>>> stack trace: >>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fa0249e4329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fa0249e4329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>> [2019-02-01 23:21:56.164427] >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>> pending frames: >>>>>>> frame : type(1) op(LOOKUP) >>>>>>> frame : type(0) op(0) >>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>> signal received: 6 >>>>>>> time of crash: >>>>>>> 2019-02-01 23:22:03 >>>>>>> configuration details: >>>>>>> argp 1 >>>>>>> backtrace 1 >>>>>>> dlfcn 1 >>>>>>> libpthread 1 >>>>>>> llistxattr 1 >>>>>>> setfsid 1 >>>>>>> spinlock 1 >>>>>>> epoll.h 1 >>>>>>> xattr.h 1 >>>>>>> st_atim.tv_nsec 1 >>>>>>> package-string: glusterfs 5.3 >>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>> mounts. >>>>>>>> >>>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>>> site (apkmirror.com), and it caused a disruption for any uploads >>>>>>>> or downloads from that server until I woke up and fixed the mount. >>>>>>>> >>>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>>> >>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>> >>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>> atumball at redhat.com> wrote: >>>>>>>> >>>>>>>>> Hi Artem, >>>>>>>>> >>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>> >>>>>>>>> We already have fixes for log flooding - >>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>> >>>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? >>>>>>>>> or was there any particular pattern you observed before the crash. >>>>>>>>> >>>>>>>>> -Amar >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>> pending frames: >>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>> frame : type(0) op(0) >>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>> signal received: 6 >>>>>>>>>> time of crash: >>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>> configuration details: >>>>>>>>>> argp 1 >>>>>>>>>> backtrace 1 >>>>>>>>>> dlfcn 1 >>>>>>>>>> libpthread 1 >>>>>>>>>> llistxattr 1 >>>>>>>>>> setfsid 1 >>>>>>>>>> spinlock 1 >>>>>>>>>> epoll.h 1 >>>>>>>>>> xattr.h 1 >>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>> --------- >>>>>>>>>> >>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>> >>>>>>>>>> If it's not fixed by the patches above, has anyone already opened >>>>>>>>>> a ticket for the crashes that I can join and monitor? This is going to >>>>>>>>>> create a massive problem for us since production systems are crashing. >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>> this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's >>>>>>>>>>>>> a comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>>> spam. >>>>>>>>>>>>> >>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>>> message is logged and send a fix? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Amar Tumballi (amarts) >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Fri Feb 8 03:20:03 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Fri, 8 Feb 2019 08:50:03 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: On Fri, Feb 8, 2019 at 8:48 AM Raghavendra Gowdappa wrote: > One possible reason could be > https://review.gluster.org/r/18b6d7ce7d490e807815270918a17a4b392a829d > https://review.gluster.org/#/c/glusterfs/+/19997/ as that changed some code in epoll handler. Though the change is largely on > server side, the epoll and socket changes are relevant for client too. I'll > try to see whether there is anything wrong with that. > > On Fri, Feb 8, 2019 at 8:36 AM Nithya Balachandran > wrote: > >> Thanks Artem. Can you send us the coredump or the bt with symbols from >> the crash? >> >> Regards, >> Nithya >> >> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >> wrote: >> >>> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >>> didn't help. >>> >>> Here's the snippet of the crash and the subsequent remount by monit. >>> >>> >>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7f4402b99329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>> valid argument] >>> The message "I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>> selecting local read_child _data1-client-3" repeated 39 times between >>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>> The message "E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>> [2019-02-08 01:13:09.311554] >>> pending frames: >>> frame : type(1) op(LOOKUP) >>> frame : type(0) op(0) >>> patchset: git://git.gluster.org/glusterfs.git >>> signal received: 6 >>> time of crash: >>> 2019-02-08 01:13:09 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 5.3 >>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>> >>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>> --------- >>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] >>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 >>> (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>> --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) >>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 1 >>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 2 >>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 3 >>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 4 >>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-0: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-1: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-2: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-3: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>> 0-_data1-client-0: changing port to 49153 (from 0) >>> Final graph: >>> >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii >>> wrote: >>> >>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>> taken effect correctly: >>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>> >>>> Let's see if it stops crashing or not. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii >>>> wrote: >>>> >>>>> Hi Nithya, >>>>> >>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>>> crashes, and no further releases have been made yet. >>>>> >>>>> volume info: >>>>> Type: Replicate >>>>> Volume ID: ****SNIP**** >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 1 x 4 = 4 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: ****SNIP**** >>>>> Brick2: ****SNIP**** >>>>> Brick3: ****SNIP**** >>>>> Brick4: ****SNIP**** >>>>> Options Reconfigured: >>>>> cluster.quorum-count: 1 >>>>> cluster.quorum-type: fixed >>>>> network.ping-timeout: 5 >>>>> network.remote-dio: enable >>>>> performance.rda-cache-limit: 256MB >>>>> performance.readdir-ahead: on >>>>> performance.parallel-readdir: on >>>>> network.inode-lru-limit: 500000 >>>>> performance.md-cache-timeout: 600 >>>>> performance.cache-invalidation: on >>>>> performance.stat-prefetch: on >>>>> features.cache-invalidation-timeout: 600 >>>>> features.cache-invalidation: on >>>>> cluster.readdir-optimize: on >>>>> performance.io-thread-count: 32 >>>>> server.event-threads: 4 >>>>> client.event-threads: 4 >>>>> performance.read-ahead: off >>>>> cluster.lookup-optimize: on >>>>> performance.cache-size: 1GB >>>>> cluster.self-heal-daemon: enable >>>>> transport.address-family: inet >>>>> nfs.disable: on >>>>> performance.client-io-threads: on >>>>> cluster.granular-entry-heal: enable >>>>> cluster.data-self-heal-algorithm: full >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>> nbalacha at redhat.com> wrote: >>>>> >>>>>> Hi Artem, >>>>>> >>>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>>> looking into the crashes and will update when have a fix. >>>>>> >>>>>> Also, please provide the gluster volume info for the volume in >>>>>> question. >>>>>> >>>>>> >>>>>> regards, >>>>>> Nithya >>>>>> >>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>>>> wrote: >>>>>> >>>>>>> The fuse crash happened two more times, but this time monit helped >>>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>>> >>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>> servers, and I don't know why. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>> there any mount options that could help mitigate this? >>>>>>>> >>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>>>> >>>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>> >>>>>>>> >>>>>>>> monit check: >>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>> >>>>>>>> >>>>>>>> stack trace: >>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fa0249e4329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fa0249e4329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>> pending frames: >>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>> frame : type(0) op(0) >>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>> signal received: 6 >>>>>>>> time of crash: >>>>>>>> 2019-02-01 23:22:03 >>>>>>>> configuration details: >>>>>>>> argp 1 >>>>>>>> backtrace 1 >>>>>>>> dlfcn 1 >>>>>>>> libpthread 1 >>>>>>>> llistxattr 1 >>>>>>>> setfsid 1 >>>>>>>> spinlock 1 >>>>>>>> epoll.h 1 >>>>>>>> xattr.h 1 >>>>>>>> st_atim.tv_nsec 1 >>>>>>>> package-string: glusterfs 5.3 >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>> mounts. >>>>>>>>> >>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>>>> site (apkmirror.com), and it caused a disruption for any uploads >>>>>>>>> or downloads from that server until I woke up and fixed the mount. >>>>>>>>> >>>>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>>>> >>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>> >>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Artem, >>>>>>>>>> >>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>>> >>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>> >>>>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? >>>>>>>>>> or was there any particular pattern you observed before the crash. >>>>>>>>>> >>>>>>>>>> -Amar >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>> pending frames: >>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>> signal received: 6 >>>>>>>>>>> time of crash: >>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>> configuration details: >>>>>>>>>>> argp 1 >>>>>>>>>>> backtrace 1 >>>>>>>>>>> dlfcn 1 >>>>>>>>>>> libpthread 1 >>>>>>>>>>> llistxattr 1 >>>>>>>>>>> setfsid 1 >>>>>>>>>>> spinlock 1 >>>>>>>>>>> epoll.h 1 >>>>>>>>>>> xattr.h 1 >>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>> --------- >>>>>>>>>>> >>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>> >>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>>> this. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's >>>>>>>>>>>>>> a comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>>>> spam. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>>>> message is logged and send a fix? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skoduri at redhat.com Fri Feb 8 06:23:09 2019 From: skoduri at redhat.com (Soumya Koduri) Date: Fri, 8 Feb 2019 11:53:09 +0530 Subject: [Gluster-users] glusterfs 4.1.7 + nfs-ganesha 2.7.1 freeze during write In-Reply-To: <22FDC703-87F4-472D-8229-9B26F440FAB1@gmail.com> References: <1FBA8430-F957-40B3-8422-2E0D25265B68@gmail.com> <22FDC703-87F4-472D-8229-9B26F440FAB1@gmail.com> Message-ID: On 2/8/19 3:20 AM, Maurits Lamers wrote: > Hi, > >> >>> [2019-02-07 10:11:24.812606] E [MSGID: 104055] >>> [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall >>> event_type(1) and gfid(y???? >>> ?????????Mz???SL4_@) failed >>> [2019-02-07 10:11:24.819376] E [MSGID: 104055] >>> [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall >>> event_type(1) and gfid(eTn?EU?H.>> [2019-02-07 10:11:24.833299] E [MSGID: 104055] >>> [glfs-fops.c:4955:glfs_cbk_upcall_data] 0-gfapi: Synctak for Upcall >>> event_type(1) and gfid(g?L??F??0b??k) failed >>> [2019-02-07 10:25:01.642509] C >>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-2: >>> server [node1]:49152 has not responded in the last 42 seconds, >>> disconnecting. >>> [2019-02-07 10:25:01.642805] C >>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-1: >>> server [node2]:49152 has not responded in the last 42 seconds, >>> disconnecting. >>> [2019-02-07 10:25:01.642946] C >>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-4: >>> server [node3]:49152 has not responded in the last 42 seconds, >>> disconnecting. >>> [2019-02-07 10:25:02.643120] C >>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-3: >>> server 127.0.1.1:49152 has not responded in the last 42 seconds, >>> disconnecting. >>> [2019-02-07 10:25:02.643314] C >>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-gv0-client-0: >>> server [node4]:49152 has not responded in the last 42 seconds, >>> disconnecting. >> >> Strange that synctask failed. Could you please turn off >> features.cache-invalidation volume option and check if the issue still >> persists. >>> > > Turning the cache invalidation option off seems to have solved the > freeze. Still testing, but it looks promising. > If thats the case, please turn on cache invalidation option back and collect couple of stack traces (using gstack) when the system freezes again. Thanks, Soumya > cheers > > Maurits > From nico at van-royen.nl Fri Feb 8 06:33:36 2019 From: nico at van-royen.nl (Nico van Royen) Date: Fri, 8 Feb 2019 06:33:36 +0000 (UTC) Subject: [Gluster-users] Web Ui for gluster In-Reply-To: References: Message-ID: <1542558386.4523.1549607616040.JavaMail.zimbra@van-royen.nl> Hi Deepu, For monitoring the nodes (and most other GlusterFS related stuff like volumes and bricks) there is Tendrl[1]. Setup is manual (ssh / cli) but parts can be done through Heketi's restAPI if you provide it with a topology first. [1] http://tendrl.org/ Regards, Nico van Roijen ----- Oorspronkelijk bericht ----- Van: "deepu srinivasan" Aan: "gluster-users" Verzonden: Donderdag 7 februari 2019 12:01:24 Onderwerp: [Gluster-users] Web Ui for gluster Dear Users Is there a default Web ui provided by gluster for monitoring the nodes and configuring them _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users From srakonde at redhat.com Fri Feb 8 09:03:24 2019 From: srakonde at redhat.com (Sanju Rakonde) Date: Fri, 8 Feb 2019 14:33:24 +0530 Subject: [Gluster-users] Getting timedout error while rebalancing In-Reply-To: References: Message-ID: Hi Deepu, I can see multiple errors in glusterd log. [2019-02-06 13:22:21.012490] E [glusterd-rpc-ops.c:1429:__glusterd_commit_op_cbk] (-->/lib64/libgfrpc.so.0(+0xec20) [0x7f278d201c20] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x7762a) [0x7f2781f1d62a] -->/usr/lib64/glusterfs/4.1.7/xlator/mgmt/glusterd.so(+0x75213) [0x7f2781f1b213] ) 0-: Assertion failed: rsp.op == txn_op_info.op ----> error has repeated multiple times in log. [2019-02-06 11:16:32.474268] E [MSGID: 106218] [glusterd-rebalance.c:460:glusterd_rebalance_cmd_validate] 0-glusterd: Volume test-volume is not a distribute type or contains only 1 brick [2019-02-06 11:16:32.474361] E [MSGID: 106301] [glusterd-op-sm.c:4669:glusterd_op_ac_send_stage_op] 0-management: Staging of operation 'Volume Rebalance' failed on localhost : Volume test-volume is not a distribute volume or contains only 1 brick. Not performing rebalance [2019-02-06 13:18:35.253045] I [MSGID: 106482] [glusterd-brick-ops.c:448:__glusterd_handle_add_brick] 0-management: Received add brick req [2019-02-06 13:18:35.253080] E [MSGID: 106026] [glusterd-brick-ops.c:483:__glusterd_handle_add_brick] 0-management: Volume 192.168.185.xxx:/home/data/repl does not exist [Invalid argument] ----> Is the add-brick success? It is difficult to confirm anything by only looking at the glusterd logs. Please share glusterd, cli and cmd_history logs from all the nodes and also provide output of below commands. 1. gluster --version 2. gluster vol info 3. gluster vol status Thanks, Sanju On Thu, Feb 7, 2019 at 1:26 AM deepu srinivasan wrote: > Please find the glusterd.log file attached. > > On Wed, Feb 6, 2019 at 2:01 PM Atin Mukherjee wrote: > >> >> >> On Tue, Feb 5, 2019 at 8:43 PM Nithya Balachandran >> wrote: >> >>> >>> >>> On Tue, 5 Feb 2019 at 17:26, deepu srinivasan >>> wrote: >>> >>>> HI Nithya >>>> We have a test gluster setup.We are testing the rebalancing option of >>>> gluster. So we started the volume which have 1x3 brick with some data on it >>>> . >>>> command : gluster volume create test-volume replica 3 >>>> 192.168.xxx.xx1:/home/data/repl 192.168.xxx.xx2:/home/data/repl >>>> 192.168.xxx.xx3:/home/data/repl. >>>> >>>> Now we tried to expand the cluster storage by adding three more bricks. >>>> command : gluster volume add-brick test-volume 192.168.xxx.xx4:/home/data/repl >>>> 192.168.xxx.xx5:/home/data/repl 192.168.xxx.xx6:/home/data/repl >>>> >>>> So after the brick addition we tried to rebalance the layout and the >>>> data. >>>> command : gluster volume rebalance test-volume fix-layout start. >>>> The command exited with status "Error : Request timed out". >>>> >>> >>> This sounds like an error in the cli or glusterd. Can you send the >>> glusterd.log from the node on which you ran the command? >>> >> >> It seems to me that glusterd took more than 120 seconds to process the >> command and hence cli timed out. We can confirm the same by checking the >> status of the rebalance below which indicates rebalance did kick in and >> eventually completed. We need to understand why did it take such longer, so >> please pass on the cli and glusterd log from all the nodes as Nithya >> requested for. >> >> >>> regards, >>> Nithya >>> >>>> >>>> After the failure of the command, we tried to view the status of the >>>> command and it is something like this : >>>> >>>> Node Rebalanced-files size >>>> scanned failures skipped status run >>>> time in h:m:s >>>> >>>> --------- ----------- ----------- >>>> ----------- ----------- ----------- ------------ >>>> -------------- >>>> >>>> localhost 41 41.0MB >>>> 8200 0 0 completed >>>> 0:00:09 >>>> >>>> 192.168.xxx.xx4 79 79.0MB >>>> 8231 0 0 completed >>>> 0:00:12 >>>> >>>> 192.168.xxx.xx6 58 58.0MB >>>> 8281 0 0 completed >>>> 0:00:10 >>>> >>>> 192.168.xxx.xx2 136 136.0MB >>>> 8566 0 136 completed >>>> 0:00:07 >>>> >>>> 192.168.xxx.xx4 129 129.0MB >>>> 8566 0 129 completed >>>> 0:00:07 >>>> >>>> 192.168.xxx.xx6 201 201.0MB >>>> 8566 0 201 completed >>>> 0:00:08 >>>> >>>> Is the rebalancing option working fine? Why did gluster throw the >>>> error saying that "Error : Request timed out"? >>>> .On Tue, Feb 5, 2019 at 4:23 PM Nithya Balachandran < >>>> nbalacha at redhat.com> wrote: >>>> >>>>> Hi, >>>>> Please provide the exact step at which you are seeing the error. It >>>>> would be ideal if you could copy-paste the command and the error. >>>>> >>>>> Regards, >>>>> Nithya >>>>> >>>>> >>>>> >>>>> On Tue, 5 Feb 2019 at 15:24, deepu srinivasan >>>>> wrote: >>>>> >>>>>> HI everyone. I am getting "Error : Request timed out " while doing >>>>>> rebalance . I have aded new bricks to my replicated volume.i.e. First it >>>>>> was 1x3 volume and added three more bricks to make it >>>>>> distributed-replicated volume(2x3) . What should i do for the timeout error >>>>>> ? >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Thanks, Sanju -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Fri Feb 8 11:44:17 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Fri, 8 Feb 2019 17:14:17 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: On Fri, Feb 8, 2019 at 8:50 AM Raghavendra Gowdappa wrote: > > > On Fri, Feb 8, 2019 at 8:48 AM Raghavendra Gowdappa > wrote: > >> One possible reason could be >> https://review.gluster.org/r/18b6d7ce7d490e807815270918a17a4b392a829d >> > > https://review.gluster.org/#/c/glusterfs/+/19997/ > This patch is not in release-5.0 branch. > as that changed some code in epoll handler. Though the change is largely >> on server side, the epoll and socket changes are relevant for client too. >> I'll try to see whether there is anything wrong with that. >> >> On Fri, Feb 8, 2019 at 8:36 AM Nithya Balachandran >> wrote: >> >>> Thanks Artem. Can you send us the coredump or the bt with symbols from >>> the crash? >>> >>> Regards, >>> Nithya >>> >>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >>> wrote: >>> >>>> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >>>> didn't help. >>>> >>>> Here's the snippet of the crash and the subsequent remount by monit. >>>> >>>> >>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7f4402b99329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>> valid argument] >>>> The message "I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>> selecting local read_child _data1-client-3" repeated 39 times between >>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>> The message "E [MSGID: 101191] >>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>> [2019-02-08 01:13:09.311554] >>>> pending frames: >>>> frame : type(1) op(LOOKUP) >>>> frame : type(0) op(0) >>>> patchset: git://git.gluster.org/glusterfs.git >>>> signal received: 6 >>>> time of crash: >>>> 2019-02-08 01:13:09 >>>> configuration details: >>>> argp 1 >>>> backtrace 1 >>>> dlfcn 1 >>>> libpthread 1 >>>> llistxattr 1 >>>> setfsid 1 >>>> spinlock 1 >>>> epoll.h 1 >>>> xattr.h 1 >>>> st_atim.tv_nsec 1 >>>> package-string: glusterfs 5.3 >>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>> >>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>> --------- >>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] >>>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 >>>> (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>> --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) >>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 1 >>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 2 >>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 3 >>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 4 >>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-0: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-1: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-2: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-3: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>> Final graph: >>>> >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii >>>> wrote: >>>> >>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>> taken effect correctly: >>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>> >>>>> Let's see if it stops crashing or not. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> Hi Nithya, >>>>>> >>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>>>> crashes, and no further releases have been made yet. >>>>>> >>>>>> volume info: >>>>>> Type: Replicate >>>>>> Volume ID: ****SNIP**** >>>>>> Status: Started >>>>>> Snapshot Count: 0 >>>>>> Number of Bricks: 1 x 4 = 4 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: ****SNIP**** >>>>>> Brick2: ****SNIP**** >>>>>> Brick3: ****SNIP**** >>>>>> Brick4: ****SNIP**** >>>>>> Options Reconfigured: >>>>>> cluster.quorum-count: 1 >>>>>> cluster.quorum-type: fixed >>>>>> network.ping-timeout: 5 >>>>>> network.remote-dio: enable >>>>>> performance.rda-cache-limit: 256MB >>>>>> performance.readdir-ahead: on >>>>>> performance.parallel-readdir: on >>>>>> network.inode-lru-limit: 500000 >>>>>> performance.md-cache-timeout: 600 >>>>>> performance.cache-invalidation: on >>>>>> performance.stat-prefetch: on >>>>>> features.cache-invalidation-timeout: 600 >>>>>> features.cache-invalidation: on >>>>>> cluster.readdir-optimize: on >>>>>> performance.io-thread-count: 32 >>>>>> server.event-threads: 4 >>>>>> client.event-threads: 4 >>>>>> performance.read-ahead: off >>>>>> cluster.lookup-optimize: on >>>>>> performance.cache-size: 1GB >>>>>> cluster.self-heal-daemon: enable >>>>>> transport.address-family: inet >>>>>> nfs.disable: on >>>>>> performance.client-io-threads: on >>>>>> cluster.granular-entry-heal: enable >>>>>> cluster.data-self-heal-algorithm: full >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>> nbalacha at redhat.com> wrote: >>>>>> >>>>>>> Hi Artem, >>>>>>> >>>>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>> looking into the crashes and will update when have a fix. >>>>>>> >>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>> question. >>>>>>> >>>>>>> >>>>>>> regards, >>>>>>> Nithya >>>>>>> >>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>>>>> wrote: >>>>>>> >>>>>>>> The fuse crash happened two more times, but this time monit helped >>>>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>>>> >>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>> servers, and I don't know why. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>>> there any mount options that could help mitigate this? >>>>>>>>> >>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>> >>>>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>> >>>>>>>>> >>>>>>>>> monit check: >>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>> >>>>>>>>> >>>>>>>>> stack trace: >>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fa0249e4329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fa0249e4329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>> pending frames: >>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>> frame : type(0) op(0) >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>> signal received: 6 >>>>>>>>> time of crash: >>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>> configuration details: >>>>>>>>> argp 1 >>>>>>>>> backtrace 1 >>>>>>>>> dlfcn 1 >>>>>>>>> libpthread 1 >>>>>>>>> llistxattr 1 >>>>>>>>> setfsid 1 >>>>>>>>> spinlock 1 >>>>>>>>> epoll.h 1 >>>>>>>>> xattr.h 1 >>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>>> mounts. >>>>>>>>>> >>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>>>>> site (apkmirror.com), and it caused a disruption for any uploads >>>>>>>>>> or downloads from that server until I woke up and fixed the mount. >>>>>>>>>> >>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>> trace. >>>>>>>>>> >>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>> >>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Artem, >>>>>>>>>>> >>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>>>> >>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>> >>>>>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? >>>>>>>>>>> or was there any particular pattern you observed before the crash. >>>>>>>>>>> >>>>>>>>>>> -Amar >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>> pending frames: >>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>> signal received: 6 >>>>>>>>>>>> time of crash: >>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>> configuration details: >>>>>>>>>>>> argp 1 >>>>>>>>>>>> backtrace 1 >>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>> libpthread 1 >>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>> setfsid 1 >>>>>>>>>>>> spinlock 1 >>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>> --------- >>>>>>>>>>>> >>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>> >>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>>>> this. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>>>>> message is logged and send a fix? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Fri Feb 8 12:57:10 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Fri, 8 Feb 2019 18:27:10 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Hi Artem, We have found the cause of one crash. Unfortunately we have not managed to reproduce the one you reported so we don't know if it is the same cause. Can you disable write-behind on the volume and let us know if it solves the problem? If yes, it is likely to be the same issue. regards, Nithya On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii wrote: > Sorry to disappoint, but the crash just happened again, so lru-limit=0 > didn't help. > > Here's the snippet of the crash and the subsequent remount by monit. > > > [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] > (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) > [0x7f4402b99329] > -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) > [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) > [0x7f440b6b5218] ) 0-dict: dict is NULL [In > valid argument] > The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] > 0-_data1-replicate-0: selecting local read_child > _data1-client-3" repeated 39 times between [2019-02-08 > 01:11:18.043286] and [2019-02-08 01:13:07.915604] > The message "E [MSGID: 101191] > [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch > handler" repeated 515 times between [2019-02-08 01:11:17.932515] and > [2019-02-08 01:13:09.311554] > pending frames: > frame : type(1) op(LOOKUP) > frame : type(0) op(0) > patchset: git://git.gluster.org/glusterfs.git > signal received: 6 > time of crash: > 2019-02-08 01:13:09 > configuration details: > argp 1 > backtrace 1 > dlfcn 1 > libpthread 1 > llistxattr 1 > setfsid 1 > spinlock 1 > epoll.h 1 > xattr.h 1 > st_atim.tv_nsec 1 > package-string: glusterfs 5.3 > /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] > /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] > /lib64/libc.so.6(+0x36160)[0x7f440a887160] > /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] > /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] > /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] > /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] > /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] > > /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] > > /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] > /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] > /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] > /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] > /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] > /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] > /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] > /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] > --------- > [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] > 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 > (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse > --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) > [2019-02-08 01:13:35.637830] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 1 > [2019-02-08 01:13:35.651405] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 2 > [2019-02-08 01:13:35.651628] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 3 > [2019-02-08 01:13:35.651747] I [MSGID: 101190] > [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread > with index 4 > [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-0: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-1: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-2: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] > 0-_data1-client-3: parent translators are ready, attempting connect > on transport > [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] > 0-_data1-client-0: changing port to 49153 (from 0) > Final graph: > > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii > wrote: > >> I've added the lru-limit=0 parameter to the mounts, and I see it's taken >> effect correctly: >> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >> --volfile-server=localhost --volfile-id=/ /mnt/" >> >> Let's see if it stops crashing or not. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii >> wrote: >> >>> Hi Nithya, >>> >>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>> crashes, and no further releases have been made yet. >>> >>> volume info: >>> Type: Replicate >>> Volume ID: ****SNIP**** >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 1 x 4 = 4 >>> Transport-type: tcp >>> Bricks: >>> Brick1: ****SNIP**** >>> Brick2: ****SNIP**** >>> Brick3: ****SNIP**** >>> Brick4: ****SNIP**** >>> Options Reconfigured: >>> cluster.quorum-count: 1 >>> cluster.quorum-type: fixed >>> network.ping-timeout: 5 >>> network.remote-dio: enable >>> performance.rda-cache-limit: 256MB >>> performance.readdir-ahead: on >>> performance.parallel-readdir: on >>> network.inode-lru-limit: 500000 >>> performance.md-cache-timeout: 600 >>> performance.cache-invalidation: on >>> performance.stat-prefetch: on >>> features.cache-invalidation-timeout: 600 >>> features.cache-invalidation: on >>> cluster.readdir-optimize: on >>> performance.io-thread-count: 32 >>> server.event-threads: 4 >>> client.event-threads: 4 >>> performance.read-ahead: off >>> cluster.lookup-optimize: on >>> performance.cache-size: 1GB >>> cluster.self-heal-daemon: enable >>> transport.address-family: inet >>> nfs.disable: on >>> performance.client-io-threads: on >>> cluster.granular-entry-heal: enable >>> cluster.data-self-heal-algorithm: full >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran >>> wrote: >>> >>>> Hi Artem, >>>> >>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>> looking into the crashes and will update when have a fix. >>>> >>>> Also, please provide the gluster volume info for the volume in question. >>>> >>>> >>>> regards, >>>> Nithya >>>> >>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>> wrote: >>>> >>>>> The fuse crash happened two more times, but this time monit helped >>>>> recover within 1 minute, so it's a great workaround for now. >>>>> >>>>> What's odd is that the crashes are only happening on one of 4 servers, >>>>> and I don't know why. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> The fuse crash happened again yesterday, to another volume. Are there >>>>>> any mount options that could help mitigate this? >>>>>> >>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>> >>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>> "glusterfs --process-name fuse" process. >>>>>> >>>>>> >>>>>> monit check: >>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>> then alert else if succeeded for 10 cycles then alert >>>>>> >>>>>> >>>>>> stack trace: >>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fa0249e4329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7fa0249e4329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>> [2019-02-01 23:21:56.164427] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>> pending frames: >>>>>> frame : type(1) op(LOOKUP) >>>>>> frame : type(0) op(0) >>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>> signal received: 6 >>>>>> time of crash: >>>>>> 2019-02-01 23:22:03 >>>>>> configuration details: >>>>>> argp 1 >>>>>> backtrace 1 >>>>>> dlfcn 1 >>>>>> libpthread 1 >>>>>> llistxattr 1 >>>>>> setfsid 1 >>>>>> spinlock 1 >>>>>> epoll.h 1 >>>>>> xattr.h 1 >>>>>> st_atim.tv_nsec 1 >>>>>> package-string: glusterfs 5.3 >>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> The first (and so far only) crash happened at 2am the next day after >>>>>>> we upgraded, on only one of four servers and only to one of two mounts. >>>>>>> >>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>> site (apkmirror.com), and it caused a disruption for any uploads or >>>>>>> downloads from that server until I woke up and fixed the mount. >>>>>>> >>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>> >>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>> >>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>> atumball at redhat.com> wrote: >>>>>>> >>>>>>>> Hi Artem, >>>>>>>> >>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as >>>>>>>> a clone of other bugs where recent discussions happened), and marked it as >>>>>>>> a blocker for glusterfs-5.4 release. >>>>>>>> >>>>>>>> We already have fixes for log flooding - >>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>> >>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? or >>>>>>>> was there any particular pattern you observed before the crash. >>>>>>>> >>>>>>>> -Amar >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>> already got a crash which others have mentioned in >>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>> >>>>>>>>> >>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fcccafcd329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>> pending frames: >>>>>>>>> frame : type(1) op(READ) >>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>> frame : type(0) op(0) >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>> signal received: 6 >>>>>>>>> time of crash: >>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>> configuration details: >>>>>>>>> argp 1 >>>>>>>>> backtrace 1 >>>>>>>>> dlfcn 1 >>>>>>>>> libpthread 1 >>>>>>>>> llistxattr 1 >>>>>>>>> setfsid 1 >>>>>>>>> spinlock 1 >>>>>>>>> epoll.h 1 >>>>>>>>> xattr.h 1 >>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>> --------- >>>>>>>>> >>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>> not too sure how to make it core dump. >>>>>>>>> >>>>>>>>> If it's not fixed by the patches above, has anyone already opened >>>>>>>>> a ticket for the crashes that I can join and monitor? This is going to >>>>>>>>> create a massive problem for us since production systems are crashing. >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>> commenting about this issue here >>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a >>>>>>>>>>>> comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>> spam. >>>>>>>>>>>> >>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>> message is logged and send a fix? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Amar Tumballi (amarts) >>>>>>>> >>>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlaib01 at gmail.com Fri Feb 8 16:04:45 2019 From: jlaib01 at gmail.com (Jim Laib) Date: Fri, 8 Feb 2019 10:04:45 -0600 Subject: [Gluster-users] Client failover question Message-ID: I may have missed this in the documentation, but is there a way to make a Gluster client server redundant? I have several bricks setup being controlled by one cluster client server and if that server blows, there's quite a bit of rebuilding to do. So looking for a way to ovoid that. Any suggestions would be most appreciated. Thanks in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nico at van-royen.nl Fri Feb 8 18:33:36 2019 From: nico at van-royen.nl (Nico van Royen) Date: Fri, 8 Feb 2019 18:33:36 +0000 (UTC) Subject: [Gluster-users] Client failover question In-Reply-To: References: Message-ID: <1211339554.4774.1549650816577.JavaMail.zimbra@van-royen.nl> Hello, What do you mean by a gluster 'client server' ? A node is either a server (running the glusterd and glusterfsd/brick processes), and you'd need at least 2 (3 is better, and the volume to be set as a replica-3) or a gluster client (uses the gluster-client to mount a volume from a gluster-cluster). You could of-course also run the 'client' on the actual gluster-server to mount to itself (wouldn't recommend that). Nico ----- Oorspronkelijk bericht ----- Van: "Jim Laib" Aan: "gluster-users" Verzonden: Vrijdag 8 februari 2019 17:04:45 Onderwerp: [Gluster-users] Client failover question I may have missed this in the documentation, but is there a way to make a Gluster client server redundant? I have several bricks setup being controlled by one cluster client server and if that server blows, there's quite a bit of rebuilding to do. So looking for a way to ovoid that. Any suggestions would be most appreciated. Thanks in advance. _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users From archon810 at gmail.com Fri Feb 8 19:21:56 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Fri, 8 Feb 2019 11:21:56 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Hi Nithya, I can try to disable write-behind as long as it doesn't heavily impact performance for us. Which option is it exactly? I don't see it set in my list of changed volume variables that I sent you guys earlier. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran wrote: > Hi Artem, > > We have found the cause of one crash. Unfortunately we have not managed to > reproduce the one you reported so we don't know if it is the same cause. > > Can you disable write-behind on the volume and let us know if it solves > the problem? If yes, it is likely to be the same issue. > > > regards, > Nithya > > On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii > wrote: > >> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >> didn't help. >> >> Here's the snippet of the crash and the subsequent remount by monit. >> >> >> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >> [0x7f4402b99329] >> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >> valid argument] >> The message "I [MSGID: 108031] >> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >> selecting local read_child _data1-client-3" repeated 39 times between >> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >> The message "E [MSGID: 101191] >> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >> [2019-02-08 01:13:09.311554] >> pending frames: >> frame : type(1) op(LOOKUP) >> frame : type(0) op(0) >> patchset: git://git.gluster.org/glusterfs.git >> signal received: 6 >> time of crash: >> 2019-02-08 01:13:09 >> configuration details: >> argp 1 >> backtrace 1 >> dlfcn 1 >> libpthread 1 >> llistxattr 1 >> setfsid 1 >> spinlock 1 >> epoll.h 1 >> xattr.h 1 >> st_atim.tv_nsec 1 >> package-string: glusterfs 5.3 >> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >> >> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >> >> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >> --------- >> [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] >> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 >> (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse >> --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) >> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 1 >> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 2 >> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 3 >> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >> with index 4 >> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-0: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-1: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-2: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >> 0-_data1-client-3: parent translators are ready, attempting connect >> on transport >> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >> 0-_data1-client-0: changing port to 49153 (from 0) >> Final graph: >> >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii >> wrote: >> >>> I've added the lru-limit=0 parameter to the mounts, and I see it's taken >>> effect correctly: >>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>> --volfile-server=localhost --volfile-id=/ /mnt/" >>> >>> Let's see if it stops crashing or not. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii >>> wrote: >>> >>>> Hi Nithya, >>>> >>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>> crashes, and no further releases have been made yet. >>>> >>>> volume info: >>>> Type: Replicate >>>> Volume ID: ****SNIP**** >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 1 x 4 = 4 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: ****SNIP**** >>>> Brick2: ****SNIP**** >>>> Brick3: ****SNIP**** >>>> Brick4: ****SNIP**** >>>> Options Reconfigured: >>>> cluster.quorum-count: 1 >>>> cluster.quorum-type: fixed >>>> network.ping-timeout: 5 >>>> network.remote-dio: enable >>>> performance.rda-cache-limit: 256MB >>>> performance.readdir-ahead: on >>>> performance.parallel-readdir: on >>>> network.inode-lru-limit: 500000 >>>> performance.md-cache-timeout: 600 >>>> performance.cache-invalidation: on >>>> performance.stat-prefetch: on >>>> features.cache-invalidation-timeout: 600 >>>> features.cache-invalidation: on >>>> cluster.readdir-optimize: on >>>> performance.io-thread-count: 32 >>>> server.event-threads: 4 >>>> client.event-threads: 4 >>>> performance.read-ahead: off >>>> cluster.lookup-optimize: on >>>> performance.cache-size: 1GB >>>> cluster.self-heal-daemon: enable >>>> transport.address-family: inet >>>> nfs.disable: on >>>> performance.client-io-threads: on >>>> cluster.granular-entry-heal: enable >>>> cluster.data-self-heal-algorithm: full >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>> nbalacha at redhat.com> wrote: >>>> >>>>> Hi Artem, >>>>> >>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>> looking into the crashes and will update when have a fix. >>>>> >>>>> Also, please provide the gluster volume info for the volume in >>>>> question. >>>>> >>>>> >>>>> regards, >>>>> Nithya >>>>> >>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>>> wrote: >>>>> >>>>>> The fuse crash happened two more times, but this time monit helped >>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>> >>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>> servers, and I don't know why. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>> there any mount options that could help mitigate this? >>>>>>> >>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>>> >>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>> "glusterfs --process-name fuse" process. >>>>>>> >>>>>>> >>>>>>> monit check: >>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>> >>>>>>> >>>>>>> stack trace: >>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fa0249e4329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7fa0249e4329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>> [2019-02-01 23:21:56.164427] >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>> pending frames: >>>>>>> frame : type(1) op(LOOKUP) >>>>>>> frame : type(0) op(0) >>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>> signal received: 6 >>>>>>> time of crash: >>>>>>> 2019-02-01 23:22:03 >>>>>>> configuration details: >>>>>>> argp 1 >>>>>>> backtrace 1 >>>>>>> dlfcn 1 >>>>>>> libpthread 1 >>>>>>> llistxattr 1 >>>>>>> setfsid 1 >>>>>>> spinlock 1 >>>>>>> epoll.h 1 >>>>>>> xattr.h 1 >>>>>>> st_atim.tv_nsec 1 >>>>>>> package-string: glusterfs 5.3 >>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>> mounts. >>>>>>>> >>>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>>> site (apkmirror.com), and it caused a disruption for any uploads >>>>>>>> or downloads from that server until I woke up and fixed the mount. >>>>>>>> >>>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>>> >>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>> >>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>> atumball at redhat.com> wrote: >>>>>>>> >>>>>>>>> Hi Artem, >>>>>>>>> >>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>> >>>>>>>>> We already have fixes for log flooding - >>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>> >>>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? >>>>>>>>> or was there any particular pattern you observed before the crash. >>>>>>>>> >>>>>>>>> -Amar >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>> pending frames: >>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>> frame : type(0) op(0) >>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>> signal received: 6 >>>>>>>>>> time of crash: >>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>> configuration details: >>>>>>>>>> argp 1 >>>>>>>>>> backtrace 1 >>>>>>>>>> dlfcn 1 >>>>>>>>>> libpthread 1 >>>>>>>>>> llistxattr 1 >>>>>>>>>> setfsid 1 >>>>>>>>>> spinlock 1 >>>>>>>>>> epoll.h 1 >>>>>>>>>> xattr.h 1 >>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>> --------- >>>>>>>>>> >>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>> >>>>>>>>>> If it's not fixed by the patches above, has anyone already opened >>>>>>>>>> a ticket for the crashes that I can join and monitor? This is going to >>>>>>>>>> create a massive problem for us since production systems are crashing. >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>> this. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may bring >>>>>>>>>>>> some additional eyeballs and get them both fixed. >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's >>>>>>>>>>>>> a comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>>> spam. >>>>>>>>>>>>> >>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>>> message is logged and send a fix? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Amar Tumballi (amarts) >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From demic198 at gmail.com Fri Feb 8 23:22:26 2019 From: demic198 at gmail.com (John Quinoz) Date: Fri, 8 Feb 2019 18:22:26 -0500 Subject: [Gluster-users] Glusterfs server Message-ID: Hello Board! Was wonder is someone could help me out. Im learning gluster and have inherited a server running it. Running the gluster peer status cmd on any of the four nodes does not provide any results. Each server has two sets of bonded nics. The primary nic is not used for the gluster peering. The secondary bond has a static route or lives on the same network as its gluster peer. How do i get the gluster cli commands to be sourced from the secondary nic? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Sat Feb 9 03:22:36 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Sat, 9 Feb 2019 08:52:36 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii wrote: > Hi Nithya, > > I can try to disable write-behind as long as it doesn't heavily impact > performance for us. Which option is it exactly? I don't see it set in my > list of changed volume variables that I sent you guys earlier. > The option is performance.write-behind > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran > wrote: > >> Hi Artem, >> >> We have found the cause of one crash. Unfortunately we have not managed >> to reproduce the one you reported so we don't know if it is the same cause. >> >> Can you disable write-behind on the volume and let us know if it solves >> the problem? If yes, it is likely to be the same issue. >> >> >> regards, >> Nithya >> >> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >> wrote: >> >>> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >>> didn't help. >>> >>> Here's the snippet of the crash and the subsequent remount by monit. >>> >>> >>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>> [0x7f4402b99329] >>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>> valid argument] >>> The message "I [MSGID: 108031] >>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>> selecting local read_child _data1-client-3" repeated 39 times between >>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>> The message "E [MSGID: 101191] >>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>> [2019-02-08 01:13:09.311554] >>> pending frames: >>> frame : type(1) op(LOOKUP) >>> frame : type(0) op(0) >>> patchset: git://git.gluster.org/glusterfs.git >>> signal received: 6 >>> time of crash: >>> 2019-02-08 01:13:09 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 5.3 >>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>> >>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>> >>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>> --------- >>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] >>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 >>> (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>> --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) >>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 1 >>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 2 >>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 3 >>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>> with index 4 >>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-0: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-1: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-2: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >>> 0-_data1-client-3: parent translators are ready, attempting connect >>> on transport >>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>> 0-_data1-client-0: changing port to 49153 (from 0) >>> Final graph: >>> >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii >>> wrote: >>> >>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>> taken effect correctly: >>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>> >>>> Let's see if it stops crashing or not. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii >>>> wrote: >>>> >>>>> Hi Nithya, >>>>> >>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>>> crashes, and no further releases have been made yet. >>>>> >>>>> volume info: >>>>> Type: Replicate >>>>> Volume ID: ****SNIP**** >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 1 x 4 = 4 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: ****SNIP**** >>>>> Brick2: ****SNIP**** >>>>> Brick3: ****SNIP**** >>>>> Brick4: ****SNIP**** >>>>> Options Reconfigured: >>>>> cluster.quorum-count: 1 >>>>> cluster.quorum-type: fixed >>>>> network.ping-timeout: 5 >>>>> network.remote-dio: enable >>>>> performance.rda-cache-limit: 256MB >>>>> performance.readdir-ahead: on >>>>> performance.parallel-readdir: on >>>>> network.inode-lru-limit: 500000 >>>>> performance.md-cache-timeout: 600 >>>>> performance.cache-invalidation: on >>>>> performance.stat-prefetch: on >>>>> features.cache-invalidation-timeout: 600 >>>>> features.cache-invalidation: on >>>>> cluster.readdir-optimize: on >>>>> performance.io-thread-count: 32 >>>>> server.event-threads: 4 >>>>> client.event-threads: 4 >>>>> performance.read-ahead: off >>>>> cluster.lookup-optimize: on >>>>> performance.cache-size: 1GB >>>>> cluster.self-heal-daemon: enable >>>>> transport.address-family: inet >>>>> nfs.disable: on >>>>> performance.client-io-threads: on >>>>> cluster.granular-entry-heal: enable >>>>> cluster.data-self-heal-algorithm: full >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>> nbalacha at redhat.com> wrote: >>>>> >>>>>> Hi Artem, >>>>>> >>>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>>> looking into the crashes and will update when have a fix. >>>>>> >>>>>> Also, please provide the gluster volume info for the volume in >>>>>> question. >>>>>> >>>>>> >>>>>> regards, >>>>>> Nithya >>>>>> >>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>>>> wrote: >>>>>> >>>>>>> The fuse crash happened two more times, but this time monit helped >>>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>>> >>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>> servers, and I don't know why. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>> there any mount options that could help mitigate this? >>>>>>>> >>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) task >>>>>>>> to watch and restart the mount, which works and recovers the mount point >>>>>>>> within a minute. Not ideal, but a temporary workaround. >>>>>>>> >>>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>> >>>>>>>> >>>>>>>> monit check: >>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>> >>>>>>>> >>>>>>>> stack trace: >>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fa0249e4329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7fa0249e4329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>> pending frames: >>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>> frame : type(0) op(0) >>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>> signal received: 6 >>>>>>>> time of crash: >>>>>>>> 2019-02-01 23:22:03 >>>>>>>> configuration details: >>>>>>>> argp 1 >>>>>>>> backtrace 1 >>>>>>>> dlfcn 1 >>>>>>>> libpthread 1 >>>>>>>> llistxattr 1 >>>>>>>> setfsid 1 >>>>>>>> spinlock 1 >>>>>>>> epoll.h 1 >>>>>>>> xattr.h 1 >>>>>>>> st_atim.tv_nsec 1 >>>>>>>> package-string: glusterfs 5.3 >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>> mounts. >>>>>>>>> >>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>>>> site (apkmirror.com), and it caused a disruption for any uploads >>>>>>>>> or downloads from that server until I woke up and fixed the mount. >>>>>>>>> >>>>>>>>> I wish I could be more helpful but all I have is that stack trace. >>>>>>>>> >>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>> >>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Artem, >>>>>>>>>> >>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>>> >>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>> >>>>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? >>>>>>>>>> or was there any particular pattern you observed before the crash. >>>>>>>>>> >>>>>>>>>> -Amar >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>> pending frames: >>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>> signal received: 6 >>>>>>>>>>> time of crash: >>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>> configuration details: >>>>>>>>>>> argp 1 >>>>>>>>>>> backtrace 1 >>>>>>>>>>> dlfcn 1 >>>>>>>>>>> libpthread 1 >>>>>>>>>>> llistxattr 1 >>>>>>>>>>> setfsid 1 >>>>>>>>>>> spinlock 1 >>>>>>>>>>> epoll.h 1 >>>>>>>>>>> xattr.h 1 >>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>> --------- >>>>>>>>>>> >>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>> >>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>> >>>>>>>>>>> Thanks. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>>> this. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's >>>>>>>>>>>>>> a comment from 3 days ago from someone else with 5.3 who started seeing the >>>>>>>>>>>>>> spam. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>>>> message is logged and send a fix? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlaib01 at gmail.com Sat Feb 9 14:59:47 2019 From: jlaib01 at gmail.com (Jim Laib) Date: Sat, 9 Feb 2019 08:59:47 -0600 Subject: [Gluster-users] Client failover question In-Reply-To: <1211339554.4774.1549650816577.JavaMail.zimbra@van-royen.nl> References: <1211339554.4774.1549650816577.JavaMail.zimbra@van-royen.nl> Message-ID: I'm using a single server (gluster client) to mount a gluster replicated gluster cvolume using two servers. so the single server is acting as the head to the gluster volume. My users access the cluster client head using Samba. What I'm wondering is. what's the best way to insure continuity if my single head server takes a nap? And yes, I agree 3 is better and I plan to move down that path when I reconfigure my system. On Fri, Feb 8, 2019 at 12:51 PM Nico van Royen wrote: > Hello, > > What do you mean by a gluster 'client server' ? > A node is either a server (running the glusterd and glusterfsd/brick > processes), and you'd need at least 2 (3 is better, and the volume to be > set as a replica-3) or a gluster client (uses the gluster-client to mount a > volume from a gluster-cluster). > You could of-course also run the 'client' on the actual gluster-server to > mount to itself (wouldn't recommend that). > > Nico > > ----- Oorspronkelijk bericht ----- > Van: "Jim Laib" > Aan: "gluster-users" > Verzonden: Vrijdag 8 februari 2019 17:04:45 > Onderwerp: [Gluster-users] Client failover question > > I may have missed this in the documentation, but is there a way to make a > Gluster client server redundant? I have several bricks setup being > controlled by one cluster client server and if that server blows, there's > quite a bit of rebuilding to do. So looking for a way to ovoid that. Any > suggestions would be most appreciated. > > Thanks in advance. > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hunter86_bg at yahoo.com Sat Feb 9 18:15:09 2019 From: hunter86_bg at yahoo.com (Strahil) Date: Sat, 09 Feb 2019 20:15:09 +0200 Subject: [Gluster-users] Client failover question In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From archon810 at gmail.com Sat Feb 9 22:17:57 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Sat, 9 Feb 2019 14:17:57 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Alright. I've enabled core-dumping (hopefully), so now I'm waiting for the next crash to see if it dumps a core for you guys to remotely debug. Then I can consider setting performance.write-behind to off and monitoring for further crashes. Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa wrote: > > > On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii > wrote: > >> Hi Nithya, >> >> I can try to disable write-behind as long as it doesn't heavily impact >> performance for us. Which option is it exactly? I don't see it set in my >> list of changed volume variables that I sent you guys earlier. >> > > The option is performance.write-behind > > >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran >> wrote: >> >>> Hi Artem, >>> >>> We have found the cause of one crash. Unfortunately we have not managed >>> to reproduce the one you reported so we don't know if it is the same cause. >>> >>> Can you disable write-behind on the volume and let us know if it solves >>> the problem? If yes, it is likely to be the same issue. >>> >>> >>> regards, >>> Nithya >>> >>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >>> wrote: >>> >>>> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >>>> didn't help. >>>> >>>> Here's the snippet of the crash and the subsequent remount by monit. >>>> >>>> >>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>> [0x7f4402b99329] >>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>> valid argument] >>>> The message "I [MSGID: 108031] >>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>> selecting local read_child _data1-client-3" repeated 39 times between >>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>> The message "E [MSGID: 101191] >>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>> [2019-02-08 01:13:09.311554] >>>> pending frames: >>>> frame : type(1) op(LOOKUP) >>>> frame : type(0) op(0) >>>> patchset: git://git.gluster.org/glusterfs.git >>>> signal received: 6 >>>> time of crash: >>>> 2019-02-08 01:13:09 >>>> configuration details: >>>> argp 1 >>>> backtrace 1 >>>> dlfcn 1 >>>> libpthread 1 >>>> llistxattr 1 >>>> setfsid 1 >>>> spinlock 1 >>>> epoll.h 1 >>>> xattr.h 1 >>>> st_atim.tv_nsec 1 >>>> package-string: glusterfs 5.3 >>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>> >>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>> >>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>> --------- >>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] [glusterfsd.c:2715:main] >>>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 >>>> (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>> --volfile-server=localhost --volfile-id=/_data1 /mnt/_data1) >>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 1 >>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 2 >>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 3 >>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>> with index 4 >>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-0: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-1: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-2: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >>>> 0-_data1-client-3: parent translators are ready, attempting connect >>>> on transport >>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>> Final graph: >>>> >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii >>>> wrote: >>>> >>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>> taken effect correctly: >>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>> >>>>> Let's see if it stops crashing or not. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> Hi Nithya, >>>>>> >>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>>>> crashes, and no further releases have been made yet. >>>>>> >>>>>> volume info: >>>>>> Type: Replicate >>>>>> Volume ID: ****SNIP**** >>>>>> Status: Started >>>>>> Snapshot Count: 0 >>>>>> Number of Bricks: 1 x 4 = 4 >>>>>> Transport-type: tcp >>>>>> Bricks: >>>>>> Brick1: ****SNIP**** >>>>>> Brick2: ****SNIP**** >>>>>> Brick3: ****SNIP**** >>>>>> Brick4: ****SNIP**** >>>>>> Options Reconfigured: >>>>>> cluster.quorum-count: 1 >>>>>> cluster.quorum-type: fixed >>>>>> network.ping-timeout: 5 >>>>>> network.remote-dio: enable >>>>>> performance.rda-cache-limit: 256MB >>>>>> performance.readdir-ahead: on >>>>>> performance.parallel-readdir: on >>>>>> network.inode-lru-limit: 500000 >>>>>> performance.md-cache-timeout: 600 >>>>>> performance.cache-invalidation: on >>>>>> performance.stat-prefetch: on >>>>>> features.cache-invalidation-timeout: 600 >>>>>> features.cache-invalidation: on >>>>>> cluster.readdir-optimize: on >>>>>> performance.io-thread-count: 32 >>>>>> server.event-threads: 4 >>>>>> client.event-threads: 4 >>>>>> performance.read-ahead: off >>>>>> cluster.lookup-optimize: on >>>>>> performance.cache-size: 1GB >>>>>> cluster.self-heal-daemon: enable >>>>>> transport.address-family: inet >>>>>> nfs.disable: on >>>>>> performance.client-io-threads: on >>>>>> cluster.granular-entry-heal: enable >>>>>> cluster.data-self-heal-algorithm: full >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>> nbalacha at redhat.com> wrote: >>>>>> >>>>>>> Hi Artem, >>>>>>> >>>>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>> looking into the crashes and will update when have a fix. >>>>>>> >>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>> question. >>>>>>> >>>>>>> >>>>>>> regards, >>>>>>> Nithya >>>>>>> >>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii >>>>>>> wrote: >>>>>>> >>>>>>>> The fuse crash happened two more times, but this time monit helped >>>>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>>>> >>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>> servers, and I don't know why. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>>> there any mount options that could help mitigate this? >>>>>>>>> >>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>> >>>>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>> >>>>>>>>> >>>>>>>>> monit check: >>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>> >>>>>>>>> >>>>>>>>> stack trace: >>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fa0249e4329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7fa0249e4329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>> pending frames: >>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>> frame : type(0) op(0) >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>> signal received: 6 >>>>>>>>> time of crash: >>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>> configuration details: >>>>>>>>> argp 1 >>>>>>>>> backtrace 1 >>>>>>>>> dlfcn 1 >>>>>>>>> libpthread 1 >>>>>>>>> llistxattr 1 >>>>>>>>> setfsid 1 >>>>>>>>> spinlock 1 >>>>>>>>> epoll.h 1 >>>>>>>>> xattr.h 1 >>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>>> mounts. >>>>>>>>>> >>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty busy >>>>>>>>>> site (apkmirror.com), and it caused a disruption for any uploads >>>>>>>>>> or downloads from that server until I woke up and fixed the mount. >>>>>>>>>> >>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>> trace. >>>>>>>>>> >>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>> >>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Artem, >>>>>>>>>>> >>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>>>> >>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>> >>>>>>>>>>> Can you please tell if the crashes happened as soon as upgrade ? >>>>>>>>>>> or was there any particular pattern you observed before the crash. >>>>>>>>>>> >>>>>>>>>>> -Amar >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to >>>>>>>>>>>> unmount, kill gluster, and remount: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>> pending frames: >>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>> signal received: 6 >>>>>>>>>>>> time of crash: >>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>> configuration details: >>>>>>>>>>>> argp 1 >>>>>>>>>>>> backtrace 1 >>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>> libpthread 1 >>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>> setfsid 1 >>>>>>>>>>>> spinlock 1 >>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>> --------- >>>>>>>>>>>> >>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>> >>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>>>> this. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> +Milind Changire Can you check why this >>>>>>>>>>>>> message is logged and send a fix? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgurusid at redhat.com Sun Feb 10 04:53:25 2019 From: pgurusid at redhat.com (Poornima Gurusiddaiah) Date: Sun, 10 Feb 2019 10:23:25 +0530 Subject: [Gluster-users] Client failover question In-Reply-To: References: <1211339554.4774.1549650816577.JavaMail.zimbra@van-royen.nl> Message-ID: If you are referring single head server as the samba node, then have samba deployed on other server nodes and create a samba cluster using ctdb. Regards, Poornima On Sat, Feb 9, 2019, 8:31 PM Jim Laib I'm using a single server (gluster client) to mount a gluster replicated > gluster cvolume using two servers. so the single server is acting as the > head to the gluster volume. My users access the cluster client head using > Samba. What I'm wondering is. what's the best way to insure continuity if > my single head server takes a nap? > And yes, I agree 3 is better and I plan to move down that path when I > reconfigure my system. > > On Fri, Feb 8, 2019 at 12:51 PM Nico van Royen wrote: > >> Hello, >> >> What do you mean by a gluster 'client server' ? >> A node is either a server (running the glusterd and glusterfsd/brick >> processes), and you'd need at least 2 (3 is better, and the volume to be >> set as a replica-3) or a gluster client (uses the gluster-client to mount a >> volume from a gluster-cluster). >> You could of-course also run the 'client' on the actual gluster-server to >> mount to itself (wouldn't recommend that). >> >> Nico >> >> ----- Oorspronkelijk bericht ----- >> Van: "Jim Laib" >> Aan: "gluster-users" >> Verzonden: Vrijdag 8 februari 2019 17:04:45 >> Onderwerp: [Gluster-users] Client failover question >> >> I may have missed this in the documentation, but is there a way to make a >> Gluster client server redundant? I have several bricks setup being >> controlled by one cluster client server and if that server blows, there's >> quite a bit of rebuilding to do. So looking for a way to ovoid that. Any >> suggestions would be most appreciated. >> >> Thanks in advance. >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From pnixon at gmail.com Mon Feb 11 01:19:58 2019 From: pnixon at gmail.com (Patrick Nixon) Date: Sun, 10 Feb 2019 20:19:58 -0500 Subject: [Gluster-users] Files on Brick not showing up in ls command Message-ID: Hello! I have an 8 node distribute volume setup. I have one node that accept files and stores them on disk, but when doing an ls, none of the files on that specific node are being returned. Can someone give some guidance on what should be the best place to start troubleshooting this? # gluster volume info Volume Name: gfs Type: Distribute Volume ID: 44c8c4f1-2dfb-4c03-9bca-d1ae4f314a78 Status: Started Snapshot Count: 0 Number of Bricks: 8 Transport-type: tcp Bricks: Brick1: gfs01:/data/brick1/gv0 Brick2: gfs02:/data/brick1/gv0 Brick3: gfs03:/data/brick1/gv0 Brick4: gfs05:/data/brick1/gv0 Brick5: gfs06:/data/brick1/gv0 Brick6: gfs07:/data/brick1/gv0 Brick7: gfs08:/data/brick1/gv0 Brick8: gfs04:/data/brick1/gv0 Options Reconfigured: cluster.min-free-disk: 10% nfs.disable: on performance.readdir-ahead: on # gluster peer status Number of Peers: 7 Hostname: gfs03 Uuid: 4a2d4deb-f8dd-49fc-a2ab-74e39dc25e20 State: Peer in Cluster (Connected) Hostname: gfs08 Uuid: 17705b3a-ed6f-4123-8e2e-4dc5ab6d807d State: Peer in Cluster (Connected) Hostname: gfs07 Uuid: dd699f55-1a27-4e51-b864-b4600d630732 State: Peer in Cluster (Connected) Hostname: gfs06 Uuid: 8eb2a965-2c1e-4a64-b5b5-b7b7136ddede State: Peer in Cluster (Connected) Hostname: gfs04 Uuid: cd866191-f767-40d0-bf7b-81ca0bc032b7 State: Peer in Cluster (Connected) Hostname: gfs02 Uuid: 6864c6ac-6ff4-423a-ae3c-f5fd25621851 State: Peer in Cluster (Connected) Hostname: gfs05 Uuid: dcecb55a-87b8-4441-ab09-b52e485e5f62 State: Peer in Cluster (Connected) All gluster nodes are running glusterfs 4.0.2 The clients accessing the files are also running glusterfs 4.0.2 Both are Ubuntu Thanks! -------------- next part -------------- An HTML attachment was scrubbed... URL: From atin.mukherjee83 at gmail.com Mon Feb 11 05:01:03 2019 From: atin.mukherjee83 at gmail.com (Atin Mukherjee) Date: Mon, 11 Feb 2019 10:31:03 +0530 Subject: [Gluster-users] Glusterfs server In-Reply-To: References: Message-ID: On Sat, 9 Feb 2019 at 04:52, John Quinoz wrote: > Hello Board! > > Was wonder is someone could help me out. > > Im learning gluster and have inherited a server running it. > > Running the gluster peer status cmd on any of the four nodes does not > provide any results. > > Each server has two sets of bonded nics. > > The primary nic is not used for the gluster peering. > > The secondary bond has a static route or lives on the same network as its > gluster peer. > > How do i get the gluster cli commands to be sourced from the secondary nic? > Can you probe the hosts with the secondary nic and retry. Gluster provides a mechanism to have multiple nics probed for the same host. _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- --Atin -------------- next part -------------- An HTML attachment was scrubbed... URL: From joao.bauto at neuro.fchampalimaud.org Mon Feb 11 10:18:51 2019 From: joao.bauto at neuro.fchampalimaud.org (=?UTF-8?B?Sm/Do28gQmHDunRv?=) Date: Mon, 11 Feb 2019 10:18:51 +0000 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Although I don't have these error messages, I'm having fuse crashes as frequent as you. I have disabled write-behind and the mount has been running over the weekend with heavy usage and no issues. I can provide coredumps before disabling write-behind if needed. I opened a BZ report with the crashes that I was having. *Jo?o Ba?to* --------------- *Scientific Computing and Software Platform* Champalimaud Research Champalimaud Center for the Unknown Av. Bras?lia, Doca de Pedrou?os 1400-038 Lisbon, Portugal fchampalimaud.org Artem Russakovskii escreveu no dia s?bado, 9/02/2019 ?(s) 22:18: > Alright. I've enabled core-dumping (hopefully), so now I'm waiting for the > next crash to see if it dumps a core for you guys to remotely debug. > > Then I can consider setting performance.write-behind to off and monitoring > for further crashes. > > Sincerely, > Artem > > -- > Founder, Android Police , APK Mirror > , Illogical Robot LLC > beerpla.net | +ArtemRussakovskii > | @ArtemR > > > > On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa > wrote: > >> >> >> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii >> wrote: >> >>> Hi Nithya, >>> >>> I can try to disable write-behind as long as it doesn't heavily impact >>> performance for us. Which option is it exactly? I don't see it set in my >>> list of changed volume variables that I sent you guys earlier. >>> >> >> The option is performance.write-behind >> >> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran >>> wrote: >>> >>>> Hi Artem, >>>> >>>> We have found the cause of one crash. Unfortunately we have not managed >>>> to reproduce the one you reported so we don't know if it is the same cause. >>>> >>>> Can you disable write-behind on the volume and let us know if it solves >>>> the problem? If yes, it is likely to be the same issue. >>>> >>>> >>>> regards, >>>> Nithya >>>> >>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >>>> wrote: >>>> >>>>> Sorry to disappoint, but the crash just happened again, so lru-limit=0 >>>>> didn't help. >>>>> >>>>> Here's the snippet of the crash and the subsequent remount by monit. >>>>> >>>>> >>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>> [0x7f4402b99329] >>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>> valid argument] >>>>> The message "I [MSGID: 108031] >>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>> The message "E [MSGID: 101191] >>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>> [2019-02-08 01:13:09.311554] >>>>> pending frames: >>>>> frame : type(1) op(LOOKUP) >>>>> frame : type(0) op(0) >>>>> patchset: git://git.gluster.org/glusterfs.git >>>>> signal received: 6 >>>>> time of crash: >>>>> 2019-02-08 01:13:09 >>>>> configuration details: >>>>> argp 1 >>>>> backtrace 1 >>>>> dlfcn 1 >>>>> libpthread 1 >>>>> llistxattr 1 >>>>> setfsid 1 >>>>> spinlock 1 >>>>> epoll.h 1 >>>>> xattr.h 1 >>>>> st_atim.tv_nsec 1 >>>>> package-string: glusterfs 5.3 >>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>> >>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>> >>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>> --------- >>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>> /mnt/_data1) >>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>> with index 1 >>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>> with index 2 >>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>> with index 3 >>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>> with index 4 >>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >>>>> 0-_data1-client-0: parent translators are ready, attempting connect >>>>> on transport >>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >>>>> 0-_data1-client-1: parent translators are ready, attempting connect >>>>> on transport >>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >>>>> 0-_data1-client-2: parent translators are ready, attempting connect >>>>> on transport >>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >>>>> 0-_data1-client-3: parent translators are ready, attempting connect >>>>> on transport >>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>>> Final graph: >>>>> >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii >>>>> wrote: >>>>> >>>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>>> taken effect correctly: >>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>> >>>>>> Let's see if it stops crashing or not. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Hi Nithya, >>>>>>> >>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>>>>> crashes, and no further releases have been made yet. >>>>>>> >>>>>>> volume info: >>>>>>> Type: Replicate >>>>>>> Volume ID: ****SNIP**** >>>>>>> Status: Started >>>>>>> Snapshot Count: 0 >>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>> Transport-type: tcp >>>>>>> Bricks: >>>>>>> Brick1: ****SNIP**** >>>>>>> Brick2: ****SNIP**** >>>>>>> Brick3: ****SNIP**** >>>>>>> Brick4: ****SNIP**** >>>>>>> Options Reconfigured: >>>>>>> cluster.quorum-count: 1 >>>>>>> cluster.quorum-type: fixed >>>>>>> network.ping-timeout: 5 >>>>>>> network.remote-dio: enable >>>>>>> performance.rda-cache-limit: 256MB >>>>>>> performance.readdir-ahead: on >>>>>>> performance.parallel-readdir: on >>>>>>> network.inode-lru-limit: 500000 >>>>>>> performance.md-cache-timeout: 600 >>>>>>> performance.cache-invalidation: on >>>>>>> performance.stat-prefetch: on >>>>>>> features.cache-invalidation-timeout: 600 >>>>>>> features.cache-invalidation: on >>>>>>> cluster.readdir-optimize: on >>>>>>> performance.io-thread-count: 32 >>>>>>> server.event-threads: 4 >>>>>>> client.event-threads: 4 >>>>>>> performance.read-ahead: off >>>>>>> cluster.lookup-optimize: on >>>>>>> performance.cache-size: 1GB >>>>>>> cluster.self-heal-daemon: enable >>>>>>> transport.address-family: inet >>>>>>> nfs.disable: on >>>>>>> performance.client-io-threads: on >>>>>>> cluster.granular-entry-heal: enable >>>>>>> cluster.data-self-heal-algorithm: full >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>> nbalacha at redhat.com> wrote: >>>>>>> >>>>>>>> Hi Artem, >>>>>>>> >>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount the >>>>>>>> volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>> >>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>> question. >>>>>>>> >>>>>>>> >>>>>>>> regards, >>>>>>>> Nithya >>>>>>>> >>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> The fuse crash happened two more times, but this time monit helped >>>>>>>>> recover within 1 minute, so it's a great workaround for now. >>>>>>>>> >>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>> servers, and I don't know why. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>>>> there any mount options that could help mitigate this? >>>>>>>>>> >>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>> >>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> monit check: >>>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> stack trace: >>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>> pending frames: >>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>> frame : type(0) op(0) >>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>> signal received: 6 >>>>>>>>>> time of crash: >>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>> configuration details: >>>>>>>>>> argp 1 >>>>>>>>>> backtrace 1 >>>>>>>>>> dlfcn 1 >>>>>>>>>> libpthread 1 >>>>>>>>>> llistxattr 1 >>>>>>>>>> setfsid 1 >>>>>>>>>> spinlock 1 >>>>>>>>>> epoll.h 1 >>>>>>>>>> xattr.h 1 >>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>> mounts. >>>>>>>>>>> >>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for any >>>>>>>>>>> uploads or downloads from that server until I woke up and fixed the mount. >>>>>>>>>>> >>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>> trace. >>>>>>>>>>> >>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Artem, >>>>>>>>>>>> >>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>> >>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>> >>>>>>>>>>>> Can you please tell if the crashes happened as soon as upgrade >>>>>>>>>>>> ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>> >>>>>>>>>>>> -Amar >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had >>>>>>>>>>>>> to unmount, kill gluster, and remount: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>> pending frames: >>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>> time of crash: >>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>> configuration details: >>>>>>>>>>>>> argp 1 >>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>> >>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>> --------- >>>>>>>>>>>>> >>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>> >>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks. >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>>>>> this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> +Milind Changire Can you check why >>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From vbellur at redhat.com Tue Feb 12 00:34:49 2019 From: vbellur at redhat.com (Vijay Bellur) Date: Mon, 11 Feb 2019 16:34:49 -0800 Subject: [Gluster-users] Files on Brick not showing up in ls command In-Reply-To: References: Message-ID: On Sun, Feb 10, 2019 at 5:20 PM Patrick Nixon wrote: > Hello! > > I have an 8 node distribute volume setup. I have one node that accept > files and stores them on disk, but when doing an ls, none of the files on > that specific node are being returned. > > Can someone give some guidance on what should be the best place to start > troubleshooting this? > Are the files being written from a glusterfs mount? If so, it might be worth checking if the network connectivity is fine between the client (that does ls) and the server/brick that contains these files. You could look up the client log file to check if there are any messages related to rpc disconnections. Regards, Vijay > # gluster volume info > > Volume Name: gfs > Type: Distribute > Volume ID: 44c8c4f1-2dfb-4c03-9bca-d1ae4f314a78 > Status: Started > Snapshot Count: 0 > Number of Bricks: 8 > Transport-type: tcp > Bricks: > Brick1: gfs01:/data/brick1/gv0 > Brick2: gfs02:/data/brick1/gv0 > Brick3: gfs03:/data/brick1/gv0 > Brick4: gfs05:/data/brick1/gv0 > Brick5: gfs06:/data/brick1/gv0 > Brick6: gfs07:/data/brick1/gv0 > Brick7: gfs08:/data/brick1/gv0 > Brick8: gfs04:/data/brick1/gv0 > Options Reconfigured: > cluster.min-free-disk: 10% > nfs.disable: on > performance.readdir-ahead: on > > # gluster peer status > Number of Peers: 7 > Hostname: gfs03 > Uuid: 4a2d4deb-f8dd-49fc-a2ab-74e39dc25e20 > State: Peer in Cluster (Connected) > Hostname: gfs08 > Uuid: 17705b3a-ed6f-4123-8e2e-4dc5ab6d807d > State: Peer in Cluster (Connected) > Hostname: gfs07 > Uuid: dd699f55-1a27-4e51-b864-b4600d630732 > State: Peer in Cluster (Connected) > Hostname: gfs06 > Uuid: 8eb2a965-2c1e-4a64-b5b5-b7b7136ddede > State: Peer in Cluster (Connected) > Hostname: gfs04 > Uuid: cd866191-f767-40d0-bf7b-81ca0bc032b7 > State: Peer in Cluster (Connected) > Hostname: gfs02 > Uuid: 6864c6ac-6ff4-423a-ae3c-f5fd25621851 > State: Peer in Cluster (Connected) > Hostname: gfs05 > Uuid: dcecb55a-87b8-4441-ab09-b52e485e5f62 > State: Peer in Cluster (Connected) > > All gluster nodes are running glusterfs 4.0.2 > The clients accessing the files are also running glusterfs 4.0.2 > Both are Ubuntu > > Thanks! > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From pnixon at gmail.com Tue Feb 12 02:58:57 2019 From: pnixon at gmail.com (Patrick Nixon) Date: Mon, 11 Feb 2019 21:58:57 -0500 Subject: [Gluster-users] Files on Brick not showing up in ls command In-Reply-To: References: Message-ID: The files are being written to via the glusterfs mount (and read on the same client and a different client). I try not to do anything on the nodes directly because I understand that can cause weirdness. As far as I can tell, there haven't been any network disconnections, but I'll review the client log to see if there any indication. I don't recall any issues last time I was in there. Thanks for the response! On Mon, Feb 11, 2019 at 7:35 PM Vijay Bellur wrote: > > > On Sun, Feb 10, 2019 at 5:20 PM Patrick Nixon wrote: > >> Hello! >> >> I have an 8 node distribute volume setup. I have one node that accept >> files and stores them on disk, but when doing an ls, none of the files on >> that specific node are being returned. >> >> Can someone give some guidance on what should be the best place to start >> troubleshooting this? >> > > > Are the files being written from a glusterfs mount? If so, it might be > worth checking if the network connectivity is fine between the client (that > does ls) and the server/brick that contains these files. You could look up > the client log file to check if there are any messages related to > rpc disconnections. > > Regards, > Vijay > > >> # gluster volume info >> >> Volume Name: gfs >> Type: Distribute >> Volume ID: 44c8c4f1-2dfb-4c03-9bca-d1ae4f314a78 >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 8 >> Transport-type: tcp >> Bricks: >> Brick1: gfs01:/data/brick1/gv0 >> Brick2: gfs02:/data/brick1/gv0 >> Brick3: gfs03:/data/brick1/gv0 >> Brick4: gfs05:/data/brick1/gv0 >> Brick5: gfs06:/data/brick1/gv0 >> Brick6: gfs07:/data/brick1/gv0 >> Brick7: gfs08:/data/brick1/gv0 >> Brick8: gfs04:/data/brick1/gv0 >> Options Reconfigured: >> cluster.min-free-disk: 10% >> nfs.disable: on >> performance.readdir-ahead: on >> >> # gluster peer status >> Number of Peers: 7 >> Hostname: gfs03 >> Uuid: 4a2d4deb-f8dd-49fc-a2ab-74e39dc25e20 >> State: Peer in Cluster (Connected) >> Hostname: gfs08 >> Uuid: 17705b3a-ed6f-4123-8e2e-4dc5ab6d807d >> State: Peer in Cluster (Connected) >> Hostname: gfs07 >> Uuid: dd699f55-1a27-4e51-b864-b4600d630732 >> State: Peer in Cluster (Connected) >> Hostname: gfs06 >> Uuid: 8eb2a965-2c1e-4a64-b5b5-b7b7136ddede >> State: Peer in Cluster (Connected) >> Hostname: gfs04 >> Uuid: cd866191-f767-40d0-bf7b-81ca0bc032b7 >> State: Peer in Cluster (Connected) >> Hostname: gfs02 >> Uuid: 6864c6ac-6ff4-423a-ae3c-f5fd25621851 >> State: Peer in Cluster (Connected) >> Hostname: gfs05 >> Uuid: dcecb55a-87b8-4441-ab09-b52e485e5f62 >> State: Peer in Cluster (Connected) >> >> All gluster nodes are running glusterfs 4.0.2 >> The clients accessing the files are also running glusterfs 4.0.2 >> Both are Ubuntu >> >> Thanks! >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Tue Feb 12 03:19:30 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 12 Feb 2019 08:49:30 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < joao.bauto at neuro.fchampalimaud.org> wrote: > Although I don't have these error messages, I'm having fuse crashes as > frequent as you. I have disabled write-behind and the mount has been > running over the weekend with heavy usage and no issues. > The issue you are facing will likely be fixed by patch [1]. Me, Xavi and Nithya were able to identify the corruption in write-behind. [1] https://review.gluster.org/22189 > I can provide coredumps before disabling write-behind if needed. I opened > a BZ report with > the crashes that I was having. > > *Jo?o Ba?to* > --------------- > > *Scientific Computing and Software Platform* > Champalimaud Research > Champalimaud Center for the Unknown > Av. Bras?lia, Doca de Pedrou?os > 1400-038 Lisbon, Portugal > fchampalimaud.org > > > Artem Russakovskii escreveu no dia s?bado, > 9/02/2019 ?(s) 22:18: > >> Alright. I've enabled core-dumping (hopefully), so now I'm waiting for >> the next crash to see if it dumps a core for you guys to remotely debug. >> >> Then I can consider setting performance.write-behind to off and >> monitoring for further crashes. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii >>> wrote: >>> >>>> Hi Nithya, >>>> >>>> I can try to disable write-behind as long as it doesn't heavily impact >>>> performance for us. Which option is it exactly? I don't see it set in my >>>> list of changed volume variables that I sent you guys earlier. >>>> >>> >>> The option is performance.write-behind >>> >>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran >>>> wrote: >>>> >>>>> Hi Artem, >>>>> >>>>> We have found the cause of one crash. Unfortunately we have not >>>>> managed to reproduce the one you reported so we don't know if it is the >>>>> same cause. >>>>> >>>>> Can you disable write-behind on the volume and let us know if it >>>>> solves the problem? If yes, it is likely to be the same issue. >>>>> >>>>> >>>>> regards, >>>>> Nithya >>>>> >>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >>>>> wrote: >>>>> >>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>> lru-limit=0 didn't help. >>>>>> >>>>>> Here's the snippet of the crash and the subsequent remount by monit. >>>>>> >>>>>> >>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7f4402b99329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>> valid argument] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>> [2019-02-08 01:13:09.311554] >>>>>> pending frames: >>>>>> frame : type(1) op(LOOKUP) >>>>>> frame : type(0) op(0) >>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>> signal received: 6 >>>>>> time of crash: >>>>>> 2019-02-08 01:13:09 >>>>>> configuration details: >>>>>> argp 1 >>>>>> backtrace 1 >>>>>> dlfcn 1 >>>>>> libpthread 1 >>>>>> llistxattr 1 >>>>>> setfsid 1 >>>>>> spinlock 1 >>>>>> epoll.h 1 >>>>>> xattr.h 1 >>>>>> st_atim.tv_nsec 1 >>>>>> package-string: glusterfs 5.3 >>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>> --------- >>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>>> /mnt/_data1) >>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 1 >>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 2 >>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 3 >>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 4 >>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-0: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-1: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-2: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-3: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>>>> Final graph: >>>>>> >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>>>> taken effect correctly: >>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>>> >>>>>>> Let's see if it stops crashing or not. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Hi Nithya, >>>>>>>> >>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>>>>>> crashes, and no further releases have been made yet. >>>>>>>> >>>>>>>> volume info: >>>>>>>> Type: Replicate >>>>>>>> Volume ID: ****SNIP**** >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: ****SNIP**** >>>>>>>> Brick2: ****SNIP**** >>>>>>>> Brick3: ****SNIP**** >>>>>>>> Brick4: ****SNIP**** >>>>>>>> Options Reconfigured: >>>>>>>> cluster.quorum-count: 1 >>>>>>>> cluster.quorum-type: fixed >>>>>>>> network.ping-timeout: 5 >>>>>>>> network.remote-dio: enable >>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>> performance.readdir-ahead: on >>>>>>>> performance.parallel-readdir: on >>>>>>>> network.inode-lru-limit: 500000 >>>>>>>> performance.md-cache-timeout: 600 >>>>>>>> performance.cache-invalidation: on >>>>>>>> performance.stat-prefetch: on >>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>> features.cache-invalidation: on >>>>>>>> cluster.readdir-optimize: on >>>>>>>> performance.io-thread-count: 32 >>>>>>>> server.event-threads: 4 >>>>>>>> client.event-threads: 4 >>>>>>>> performance.read-ahead: off >>>>>>>> cluster.lookup-optimize: on >>>>>>>> performance.cache-size: 1GB >>>>>>>> cluster.self-heal-daemon: enable >>>>>>>> transport.address-family: inet >>>>>>>> nfs.disable: on >>>>>>>> performance.client-io-threads: on >>>>>>>> cluster.granular-entry-heal: enable >>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>> >>>>>>>>> Hi Artem, >>>>>>>>> >>>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount >>>>>>>>> the volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>>> >>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>> question. >>>>>>>>> >>>>>>>>> >>>>>>>>> regards, >>>>>>>>> Nithya >>>>>>>>> >>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>> >>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>> servers, and I don't know why. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>>>>> there any mount options that could help mitigate this? >>>>>>>>>>> >>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>> >>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> monit check: >>>>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> stack trace: >>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>> pending frames: >>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>> signal received: 6 >>>>>>>>>>> time of crash: >>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>> configuration details: >>>>>>>>>>> argp 1 >>>>>>>>>>> backtrace 1 >>>>>>>>>>> dlfcn 1 >>>>>>>>>>> libpthread 1 >>>>>>>>>>> llistxattr 1 >>>>>>>>>>> setfsid 1 >>>>>>>>>>> spinlock 1 >>>>>>>>>>> epoll.h 1 >>>>>>>>>>> xattr.h 1 >>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>> mounts. >>>>>>>>>>>> >>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for any >>>>>>>>>>>> uploads or downloads from that server until I woke up and fixed the mount. >>>>>>>>>>>> >>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>> trace. >>>>>>>>>>>> >>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>> >>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>> >>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>> >>>>>>>>>>>>> Can you please tell if the crashes happened as soon as upgrade >>>>>>>>>>>>> ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>> >>>>>>>>>>>>> -Amar >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had >>>>>>>>>>>>>> to unmount, kill gluster, and remount: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>> --------- >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +Milind Changire Can you check why >>>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Tue Feb 12 03:32:04 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 12 Feb 2019 09:02:04 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < joao.bauto at neuro.fchampalimaud.org> wrote: > Although I don't have these error messages, I'm having fuse crashes as > frequent as you. I have disabled write-behind and the mount has been > running over the weekend with heavy usage and no issues. > > I can provide coredumps before disabling write-behind if needed. I opened > a BZ report with > the crashes that I was having. > I've created a bug and marked it as a blocker for release-6. I've marked bz 1671014 as a duplicate of this bug report on master. If you disagree about the bug you filed being a duplicate, please reopen. > *Jo?o Ba?to* > --------------- > > *Scientific Computing and Software Platform* > Champalimaud Research > Champalimaud Center for the Unknown > Av. Bras?lia, Doca de Pedrou?os > 1400-038 Lisbon, Portugal > fchampalimaud.org > > > Artem Russakovskii escreveu no dia s?bado, > 9/02/2019 ?(s) 22:18: > >> Alright. I've enabled core-dumping (hopefully), so now I'm waiting for >> the next crash to see if it dumps a core for you guys to remotely debug. >> >> Then I can consider setting performance.write-behind to off and >> monitoring for further crashes. >> >> Sincerely, >> Artem >> >> -- >> Founder, Android Police , APK Mirror >> , Illogical Robot LLC >> beerpla.net | +ArtemRussakovskii >> | @ArtemR >> >> >> >> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii >>> wrote: >>> >>>> Hi Nithya, >>>> >>>> I can try to disable write-behind as long as it doesn't heavily impact >>>> performance for us. Which option is it exactly? I don't see it set in my >>>> list of changed volume variables that I sent you guys earlier. >>>> >>> >>> The option is performance.write-behind >>> >>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran >>>> wrote: >>>> >>>>> Hi Artem, >>>>> >>>>> We have found the cause of one crash. Unfortunately we have not >>>>> managed to reproduce the one you reported so we don't know if it is the >>>>> same cause. >>>>> >>>>> Can you disable write-behind on the volume and let us know if it >>>>> solves the problem? If yes, it is likely to be the same issue. >>>>> >>>>> >>>>> regards, >>>>> Nithya >>>>> >>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >>>>> wrote: >>>>> >>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>> lru-limit=0 didn't help. >>>>>> >>>>>> Here's the snippet of the crash and the subsequent remount by monit. >>>>>> >>>>>> >>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>> [0x7f4402b99329] >>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>> valid argument] >>>>>> The message "I [MSGID: 108031] >>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>> The message "E [MSGID: 101191] >>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>> [2019-02-08 01:13:09.311554] >>>>>> pending frames: >>>>>> frame : type(1) op(LOOKUP) >>>>>> frame : type(0) op(0) >>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>> signal received: 6 >>>>>> time of crash: >>>>>> 2019-02-08 01:13:09 >>>>>> configuration details: >>>>>> argp 1 >>>>>> backtrace 1 >>>>>> dlfcn 1 >>>>>> libpthread 1 >>>>>> llistxattr 1 >>>>>> setfsid 1 >>>>>> spinlock 1 >>>>>> epoll.h 1 >>>>>> xattr.h 1 >>>>>> st_atim.tv_nsec 1 >>>>>> package-string: glusterfs 5.3 >>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>> >>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>> --------- >>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>>> /mnt/_data1) >>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 1 >>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 2 >>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 3 >>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>> with index 4 >>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-0: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-1: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-2: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] [client.c:2354:notify] >>>>>> 0-_data1-client-3: parent translators are ready, attempting connect >>>>>> on transport >>>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>>>> Final graph: >>>>>> >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>>>> taken effect correctly: >>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>>> >>>>>>> Let's see if it stops crashing or not. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Hi Nithya, >>>>>>>> >>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing >>>>>>>> crashes, and no further releases have been made yet. >>>>>>>> >>>>>>>> volume info: >>>>>>>> Type: Replicate >>>>>>>> Volume ID: ****SNIP**** >>>>>>>> Status: Started >>>>>>>> Snapshot Count: 0 >>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>> Transport-type: tcp >>>>>>>> Bricks: >>>>>>>> Brick1: ****SNIP**** >>>>>>>> Brick2: ****SNIP**** >>>>>>>> Brick3: ****SNIP**** >>>>>>>> Brick4: ****SNIP**** >>>>>>>> Options Reconfigured: >>>>>>>> cluster.quorum-count: 1 >>>>>>>> cluster.quorum-type: fixed >>>>>>>> network.ping-timeout: 5 >>>>>>>> network.remote-dio: enable >>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>> performance.readdir-ahead: on >>>>>>>> performance.parallel-readdir: on >>>>>>>> network.inode-lru-limit: 500000 >>>>>>>> performance.md-cache-timeout: 600 >>>>>>>> performance.cache-invalidation: on >>>>>>>> performance.stat-prefetch: on >>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>> features.cache-invalidation: on >>>>>>>> cluster.readdir-optimize: on >>>>>>>> performance.io-thread-count: 32 >>>>>>>> server.event-threads: 4 >>>>>>>> client.event-threads: 4 >>>>>>>> performance.read-ahead: off >>>>>>>> cluster.lookup-optimize: on >>>>>>>> performance.cache-size: 1GB >>>>>>>> cluster.self-heal-daemon: enable >>>>>>>> transport.address-family: inet >>>>>>>> nfs.disable: on >>>>>>>> performance.client-io-threads: on >>>>>>>> cluster.granular-entry-heal: enable >>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>> >>>>>>>>> Hi Artem, >>>>>>>>> >>>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount >>>>>>>>> the volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>>> >>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>> question. >>>>>>>>> >>>>>>>>> >>>>>>>>> regards, >>>>>>>>> Nithya >>>>>>>>> >>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>> >>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>> servers, and I don't know why. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>>>>> there any mount options that could help mitigate this? >>>>>>>>>>> >>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>> >>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is not >>>>>>>>>>> connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> monit check: >>>>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> stack trace: >>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>> pending frames: >>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>> signal received: 6 >>>>>>>>>>> time of crash: >>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>> configuration details: >>>>>>>>>>> argp 1 >>>>>>>>>>> backtrace 1 >>>>>>>>>>> dlfcn 1 >>>>>>>>>>> libpthread 1 >>>>>>>>>>> llistxattr 1 >>>>>>>>>>> setfsid 1 >>>>>>>>>>> spinlock 1 >>>>>>>>>>> epoll.h 1 >>>>>>>>>>> xattr.h 1 >>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>> mounts. >>>>>>>>>>>> >>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for any >>>>>>>>>>>> uploads or downloads from that server until I woke up and fixed the mount. >>>>>>>>>>>> >>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>> trace. >>>>>>>>>>>> >>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>> >>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>> >>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>> >>>>>>>>>>>>> Can you please tell if the crashes happened as soon as upgrade >>>>>>>>>>>>> ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>> >>>>>>>>>>>>> -Amar >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had >>>>>>>>>>>>>> to unmount, kill gluster, and remount: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>> --------- >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>> >>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ addresses >>>>>>>>>>>>>>> this. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> +Milind Changire Can you check why >>>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>> Gluster-users mailing list >>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>> >>>>>>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Tue Feb 12 04:53:53 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Mon, 11 Feb 2019 20:53:53 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Great job identifying the issue! Any ETA on the next release with the logging and crash fixes in it? On Mon, Feb 11, 2019, 7:19 PM Raghavendra Gowdappa wrote: > > > On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < > joao.bauto at neuro.fchampalimaud.org> wrote: > >> Although I don't have these error messages, I'm having fuse crashes as >> frequent as you. I have disabled write-behind and the mount has been >> running over the weekend with heavy usage and no issues. >> > > The issue you are facing will likely be fixed by patch [1]. Me, Xavi and > Nithya were able to identify the corruption in write-behind. > > [1] https://review.gluster.org/22189 > > >> I can provide coredumps before disabling write-behind if needed. I opened >> a BZ report with >> the crashes that I was having. >> >> *Jo?o Ba?to* >> --------------- >> >> *Scientific Computing and Software Platform* >> Champalimaud Research >> Champalimaud Center for the Unknown >> Av. Bras?lia, Doca de Pedrou?os >> 1400-038 Lisbon, Portugal >> fchampalimaud.org >> >> >> Artem Russakovskii escreveu no dia s?bado, >> 9/02/2019 ?(s) 22:18: >> >>> Alright. I've enabled core-dumping (hopefully), so now I'm waiting for >>> the next crash to see if it dumps a core for you guys to remotely debug. >>> >>> Then I can consider setting performance.write-behind to off and >>> monitoring for further crashes. >>> >>> Sincerely, >>> Artem >>> >>> -- >>> Founder, Android Police , APK Mirror >>> , Illogical Robot LLC >>> beerpla.net | +ArtemRussakovskii >>> | @ArtemR >>> >>> >>> >>> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa >>> wrote: >>> >>>> >>>> >>>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii >>>> wrote: >>>> >>>>> Hi Nithya, >>>>> >>>>> I can try to disable write-behind as long as it doesn't heavily impact >>>>> performance for us. Which option is it exactly? I don't see it set in my >>>>> list of changed volume variables that I sent you guys earlier. >>>>> >>>> >>>> The option is performance.write-behind >>>> >>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran < >>>>> nbalacha at redhat.com> wrote: >>>>> >>>>>> Hi Artem, >>>>>> >>>>>> We have found the cause of one crash. Unfortunately we have not >>>>>> managed to reproduce the one you reported so we don't know if it is the >>>>>> same cause. >>>>>> >>>>>> Can you disable write-behind on the volume and let us know if it >>>>>> solves the problem? If yes, it is likely to be the same issue. >>>>>> >>>>>> >>>>>> regards, >>>>>> Nithya >>>>>> >>>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >>>>>> wrote: >>>>>> >>>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>>> lru-limit=0 didn't help. >>>>>>> >>>>>>> Here's the snippet of the crash and the subsequent remount by monit. >>>>>>> >>>>>>> >>>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>> [0x7f4402b99329] >>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>>> valid argument] >>>>>>> The message "I [MSGID: 108031] >>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>>> The message "E [MSGID: 101191] >>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>>> [2019-02-08 01:13:09.311554] >>>>>>> pending frames: >>>>>>> frame : type(1) op(LOOKUP) >>>>>>> frame : type(0) op(0) >>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>> signal received: 6 >>>>>>> time of crash: >>>>>>> 2019-02-08 01:13:09 >>>>>>> configuration details: >>>>>>> argp 1 >>>>>>> backtrace 1 >>>>>>> dlfcn 1 >>>>>>> libpthread 1 >>>>>>> llistxattr 1 >>>>>>> setfsid 1 >>>>>>> spinlock 1 >>>>>>> epoll.h 1 >>>>>>> xattr.h 1 >>>>>>> st_atim.tv_nsec 1 >>>>>>> package-string: glusterfs 5.3 >>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>>> >>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>>> --------- >>>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>>>> /mnt/_data1) >>>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 1 >>>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 2 >>>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 3 >>>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>> with index 4 >>>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-_data1-client-0: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-_data1-client-1: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-_data1-client-2: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] >>>>>>> [client.c:2354:notify] 0-_data1-client-3: parent translators are >>>>>>> ready, attempting connect on transport >>>>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>>>>> Final graph: >>>>>>> >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>>>>> taken effect correctly: >>>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>>>> >>>>>>>> Let's see if it stops crashing or not. >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Nithya, >>>>>>>>> >>>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started >>>>>>>>> seeing crashes, and no further releases have been made yet. >>>>>>>>> >>>>>>>>> volume info: >>>>>>>>> Type: Replicate >>>>>>>>> Volume ID: ****SNIP**** >>>>>>>>> Status: Started >>>>>>>>> Snapshot Count: 0 >>>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>>> Transport-type: tcp >>>>>>>>> Bricks: >>>>>>>>> Brick1: ****SNIP**** >>>>>>>>> Brick2: ****SNIP**** >>>>>>>>> Brick3: ****SNIP**** >>>>>>>>> Brick4: ****SNIP**** >>>>>>>>> Options Reconfigured: >>>>>>>>> cluster.quorum-count: 1 >>>>>>>>> cluster.quorum-type: fixed >>>>>>>>> network.ping-timeout: 5 >>>>>>>>> network.remote-dio: enable >>>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>>> performance.readdir-ahead: on >>>>>>>>> performance.parallel-readdir: on >>>>>>>>> network.inode-lru-limit: 500000 >>>>>>>>> performance.md-cache-timeout: 600 >>>>>>>>> performance.cache-invalidation: on >>>>>>>>> performance.stat-prefetch: on >>>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>>> features.cache-invalidation: on >>>>>>>>> cluster.readdir-optimize: on >>>>>>>>> performance.io-thread-count: 32 >>>>>>>>> server.event-threads: 4 >>>>>>>>> client.event-threads: 4 >>>>>>>>> performance.read-ahead: off >>>>>>>>> cluster.lookup-optimize: on >>>>>>>>> performance.cache-size: 1GB >>>>>>>>> cluster.self-heal-daemon: enable >>>>>>>>> transport.address-family: inet >>>>>>>>> nfs.disable: on >>>>>>>>> performance.client-io-threads: on >>>>>>>>> cluster.granular-entry-heal: enable >>>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Artem, >>>>>>>>>> >>>>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount >>>>>>>>>> the volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>>>> >>>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>>> question. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> regards, >>>>>>>>>> Nithya >>>>>>>>>> >>>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>>> >>>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>>> servers, and I don't know why. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> The fuse crash happened again yesterday, to another volume. Are >>>>>>>>>>>> there any mount options that could help mitigate this? >>>>>>>>>>>> >>>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>>> >>>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is >>>>>>>>>>>> not connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> monit check: >>>>>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> stack trace: >>>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>>> pending frames: >>>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>> signal received: 6 >>>>>>>>>>>> time of crash: >>>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>>> configuration details: >>>>>>>>>>>> argp 1 >>>>>>>>>>>> backtrace 1 >>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>> libpthread 1 >>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>> setfsid 1 >>>>>>>>>>>> spinlock 1 >>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>>> >>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> The first (and so far only) crash happened at 2am the next day >>>>>>>>>>>>> after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>>> mounts. >>>>>>>>>>>>> >>>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for any >>>>>>>>>>>>> uploads or downloads from that server until I woke up and fixed the mount. >>>>>>>>>>>>> >>>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>>> trace. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>>> >>>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you please tell if the crashes happened as soon as >>>>>>>>>>>>>> upgrade ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -Amar >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, I >>>>>>>>>>>>>>> already got a crash which others have mentioned in >>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had >>>>>>>>>>>>>>> to unmount, kill gluster, and remount: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of these >>>>>>>>>>>>>>>>> "Failed to dispatch handler" in my logs as well. Many people have been >>>>>>>>>>>>>>>>> commenting about this issue here >>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ >>>>>>>>>>>>>>>> addresses this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +Milind Changire Can you check why >>>>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Tue Feb 12 05:34:22 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 12 Feb 2019 11:04:22 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: On Tue, Feb 12, 2019 at 10:24 AM Artem Russakovskii wrote: > Great job identifying the issue! > > Any ETA on the next release with the logging and crash fixes in it? > I've marked write-behind corruption as a blocker for release-6. Logging fixes are already in codebase. > On Mon, Feb 11, 2019, 7:19 PM Raghavendra Gowdappa > wrote: > >> >> >> On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < >> joao.bauto at neuro.fchampalimaud.org> wrote: >> >>> Although I don't have these error messages, I'm having fuse crashes as >>> frequent as you. I have disabled write-behind and the mount has been >>> running over the weekend with heavy usage and no issues. >>> >> >> The issue you are facing will likely be fixed by patch [1]. Me, Xavi and >> Nithya were able to identify the corruption in write-behind. >> >> [1] https://review.gluster.org/22189 >> >> >>> I can provide coredumps before disabling write-behind if needed. I >>> opened a BZ report with >>> the crashes that I was having. >>> >>> *Jo?o Ba?to* >>> --------------- >>> >>> *Scientific Computing and Software Platform* >>> Champalimaud Research >>> Champalimaud Center for the Unknown >>> Av. Bras?lia, Doca de Pedrou?os >>> 1400-038 Lisbon, Portugal >>> fchampalimaud.org >>> >>> >>> Artem Russakovskii escreveu no dia s?bado, >>> 9/02/2019 ?(s) 22:18: >>> >>>> Alright. I've enabled core-dumping (hopefully), so now I'm waiting for >>>> the next crash to see if it dumps a core for you guys to remotely debug. >>>> >>>> Then I can consider setting performance.write-behind to off and >>>> monitoring for further crashes. >>>> >>>> Sincerely, >>>> Artem >>>> >>>> -- >>>> Founder, Android Police , APK Mirror >>>> , Illogical Robot LLC >>>> beerpla.net | +ArtemRussakovskii >>>> | @ArtemR >>>> >>>> >>>> >>>> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa < >>>> rgowdapp at redhat.com> wrote: >>>> >>>>> >>>>> >>>>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii < >>>>> archon810 at gmail.com> wrote: >>>>> >>>>>> Hi Nithya, >>>>>> >>>>>> I can try to disable write-behind as long as it doesn't heavily >>>>>> impact performance for us. Which option is it exactly? I don't see it set >>>>>> in my list of changed volume variables that I sent you guys earlier. >>>>>> >>>>> >>>>> The option is performance.write-behind >>>>> >>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran < >>>>>> nbalacha at redhat.com> wrote: >>>>>> >>>>>>> Hi Artem, >>>>>>> >>>>>>> We have found the cause of one crash. Unfortunately we have not >>>>>>> managed to reproduce the one you reported so we don't know if it is the >>>>>>> same cause. >>>>>>> >>>>>>> Can you disable write-behind on the volume and let us know if it >>>>>>> solves the problem? If yes, it is likely to be the same issue. >>>>>>> >>>>>>> >>>>>>> regards, >>>>>>> Nithya >>>>>>> >>>>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii >>>>>>> wrote: >>>>>>> >>>>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>>>> lru-limit=0 didn't help. >>>>>>>> >>>>>>>> Here's the snippet of the crash and the subsequent remount by monit. >>>>>>>> >>>>>>>> >>>>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>> [0x7f4402b99329] >>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>>>> valid argument] >>>>>>>> The message "I [MSGID: 108031] >>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>>>> The message "E [MSGID: 101191] >>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>>>> [2019-02-08 01:13:09.311554] >>>>>>>> pending frames: >>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>> frame : type(0) op(0) >>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>> signal received: 6 >>>>>>>> time of crash: >>>>>>>> 2019-02-08 01:13:09 >>>>>>>> configuration details: >>>>>>>> argp 1 >>>>>>>> backtrace 1 >>>>>>>> dlfcn 1 >>>>>>>> libpthread 1 >>>>>>>> llistxattr 1 >>>>>>>> setfsid 1 >>>>>>>> spinlock 1 >>>>>>>> epoll.h 1 >>>>>>>> xattr.h 1 >>>>>>>> st_atim.tv_nsec 1 >>>>>>>> package-string: glusterfs 5.3 >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>>>> >>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>>>> --------- >>>>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>>>>> /mnt/_data1) >>>>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 1 >>>>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 2 >>>>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 3 >>>>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>> with index 4 >>>>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-_data1-client-0: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-_data1-client-1: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-_data1-client-2: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] >>>>>>>> [client.c:2354:notify] 0-_data1-client-3: parent translators are >>>>>>>> ready, attempting connect on transport >>>>>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>>>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>>>>>> Final graph: >>>>>>>> >>>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see it's >>>>>>>>> taken effect correctly: >>>>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>>>>> >>>>>>>>> Let's see if it stops crashing or not. >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Nithya, >>>>>>>>>> >>>>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started >>>>>>>>>> seeing crashes, and no further releases have been made yet. >>>>>>>>>> >>>>>>>>>> volume info: >>>>>>>>>> Type: Replicate >>>>>>>>>> Volume ID: ****SNIP**** >>>>>>>>>> Status: Started >>>>>>>>>> Snapshot Count: 0 >>>>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>>>> Transport-type: tcp >>>>>>>>>> Bricks: >>>>>>>>>> Brick1: ****SNIP**** >>>>>>>>>> Brick2: ****SNIP**** >>>>>>>>>> Brick3: ****SNIP**** >>>>>>>>>> Brick4: ****SNIP**** >>>>>>>>>> Options Reconfigured: >>>>>>>>>> cluster.quorum-count: 1 >>>>>>>>>> cluster.quorum-type: fixed >>>>>>>>>> network.ping-timeout: 5 >>>>>>>>>> network.remote-dio: enable >>>>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>>>> performance.readdir-ahead: on >>>>>>>>>> performance.parallel-readdir: on >>>>>>>>>> network.inode-lru-limit: 500000 >>>>>>>>>> performance.md-cache-timeout: 600 >>>>>>>>>> performance.cache-invalidation: on >>>>>>>>>> performance.stat-prefetch: on >>>>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>>>> features.cache-invalidation: on >>>>>>>>>> cluster.readdir-optimize: on >>>>>>>>>> performance.io-thread-count: 32 >>>>>>>>>> server.event-threads: 4 >>>>>>>>>> client.event-threads: 4 >>>>>>>>>> performance.read-ahead: off >>>>>>>>>> cluster.lookup-optimize: on >>>>>>>>>> performance.cache-size: 1GB >>>>>>>>>> cluster.self-heal-daemon: enable >>>>>>>>>> transport.address-family: inet >>>>>>>>>> nfs.disable: on >>>>>>>>>> performance.client-io-threads: on >>>>>>>>>> cluster.granular-entry-heal: enable >>>>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Artem, >>>>>>>>>>> >>>>>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount >>>>>>>>>>> the volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>>>>> >>>>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>>>> question. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> regards, >>>>>>>>>>> Nithya >>>>>>>>>>> >>>>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>>>> >>>>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>>>> servers, and I don't know why. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The fuse crash happened again yesterday, to another volume. >>>>>>>>>>>>> Are there any mount options that could help mitigate this? >>>>>>>>>>>>> >>>>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>>>> >>>>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is >>>>>>>>>>>>> not connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> monit check: >>>>>>>>>>>>> check filesystem glusterfs_data1 with path /mnt/glusterfs_data1 >>>>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> stack trace: >>>>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>>>> pending frames: >>>>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>> time of crash: >>>>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>>>> configuration details: >>>>>>>>>>>>> argp 1 >>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>>>> >>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> The first (and so far only) crash happened at 2am the next >>>>>>>>>>>>>> day after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>>>> mounts. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for >>>>>>>>>>>>>> any uploads or downloads from that server until I woke up and fixed the >>>>>>>>>>>>>> mount. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>>>> trace. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you please tell if the crashes happened as soon as >>>>>>>>>>>>>>> upgrade ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -Amar >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, >>>>>>>>>>>>>>>> I already got a crash which others have mentioned in >>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and >>>>>>>>>>>>>>>> had to unmount, kill gluster, and remount: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of >>>>>>>>>>>>>>>>>> these "Failed to dispatch handler" in my logs as well. Many people have >>>>>>>>>>>>>>>>>> been commenting about this issue here >>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ >>>>>>>>>>>>>>>>> addresses this. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list may >>>>>>>>>>>>>>>>>> bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +Milind Changire Can you check why >>>>>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Tue Feb 12 06:14:43 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Mon, 11 Feb 2019 22:14:43 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Awesome. But is there a release schedule and an ETA for when these will be out in the repos? On Mon, Feb 11, 2019, 9:34 PM Raghavendra Gowdappa wrote: > > > On Tue, Feb 12, 2019 at 10:24 AM Artem Russakovskii > wrote: > >> Great job identifying the issue! >> >> Any ETA on the next release with the logging and crash fixes in it? >> > > I've marked write-behind corruption as a blocker for release-6. Logging > fixes are already in codebase. > > >> On Mon, Feb 11, 2019, 7:19 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < >>> joao.bauto at neuro.fchampalimaud.org> wrote: >>> >>>> Although I don't have these error messages, I'm having fuse crashes as >>>> frequent as you. I have disabled write-behind and the mount has been >>>> running over the weekend with heavy usage and no issues. >>>> >>> >>> The issue you are facing will likely be fixed by patch [1]. Me, Xavi and >>> Nithya were able to identify the corruption in write-behind. >>> >>> [1] https://review.gluster.org/22189 >>> >>> >>>> I can provide coredumps before disabling write-behind if needed. I >>>> opened a BZ report >>>> with the crashes >>>> that I was having. >>>> >>>> *Jo?o Ba?to* >>>> --------------- >>>> >>>> *Scientific Computing and Software Platform* >>>> Champalimaud Research >>>> Champalimaud Center for the Unknown >>>> Av. Bras?lia, Doca de Pedrou?os >>>> 1400-038 Lisbon, Portugal >>>> fchampalimaud.org >>>> >>>> >>>> Artem Russakovskii escreveu no dia s?bado, >>>> 9/02/2019 ?(s) 22:18: >>>> >>>>> Alright. I've enabled core-dumping (hopefully), so now I'm waiting for >>>>> the next crash to see if it dumps a core for you guys to remotely debug. >>>>> >>>>> Then I can consider setting performance.write-behind to off and >>>>> monitoring for further crashes. >>>>> >>>>> Sincerely, >>>>> Artem >>>>> >>>>> -- >>>>> Founder, Android Police , APK Mirror >>>>> , Illogical Robot LLC >>>>> beerpla.net | +ArtemRussakovskii >>>>> | @ArtemR >>>>> >>>>> >>>>> >>>>> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa < >>>>> rgowdapp at redhat.com> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii < >>>>>> archon810 at gmail.com> wrote: >>>>>> >>>>>>> Hi Nithya, >>>>>>> >>>>>>> I can try to disable write-behind as long as it doesn't heavily >>>>>>> impact performance for us. Which option is it exactly? I don't see it set >>>>>>> in my list of changed volume variables that I sent you guys earlier. >>>>>>> >>>>>> >>>>>> The option is performance.write-behind >>>>>> >>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran < >>>>>>> nbalacha at redhat.com> wrote: >>>>>>> >>>>>>>> Hi Artem, >>>>>>>> >>>>>>>> We have found the cause of one crash. Unfortunately we have not >>>>>>>> managed to reproduce the one you reported so we don't know if it is the >>>>>>>> same cause. >>>>>>>> >>>>>>>> Can you disable write-behind on the volume and let us know if it >>>>>>>> solves the problem? If yes, it is likely to be the same issue. >>>>>>>> >>>>>>>> >>>>>>>> regards, >>>>>>>> Nithya >>>>>>>> >>>>>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>>>>> lru-limit=0 didn't help. >>>>>>>>> >>>>>>>>> Here's the snippet of the crash and the subsequent remount by >>>>>>>>> monit. >>>>>>>>> >>>>>>>>> >>>>>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>> [0x7f4402b99329] >>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>>>>> valid argument] >>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>>>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>>>>> [2019-02-08 01:13:09.311554] >>>>>>>>> pending frames: >>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>> frame : type(0) op(0) >>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>> signal received: 6 >>>>>>>>> time of crash: >>>>>>>>> 2019-02-08 01:13:09 >>>>>>>>> configuration details: >>>>>>>>> argp 1 >>>>>>>>> backtrace 1 >>>>>>>>> dlfcn 1 >>>>>>>>> libpthread 1 >>>>>>>>> llistxattr 1 >>>>>>>>> setfsid 1 >>>>>>>>> spinlock 1 >>>>>>>>> epoll.h 1 >>>>>>>>> xattr.h 1 >>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>>>>> >>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>>>>> --------- >>>>>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>>>>>> /mnt/_data1) >>>>>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>> with index 1 >>>>>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>> with index 2 >>>>>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>> with index 3 >>>>>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>> with index 4 >>>>>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] >>>>>>>>> [client.c:2354:notify] 0-_data1-client-0: parent translators are >>>>>>>>> ready, attempting connect on transport >>>>>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] >>>>>>>>> [client.c:2354:notify] 0-_data1-client-1: parent translators are >>>>>>>>> ready, attempting connect on transport >>>>>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] >>>>>>>>> [client.c:2354:notify] 0-_data1-client-2: parent translators are >>>>>>>>> ready, attempting connect on transport >>>>>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] >>>>>>>>> [client.c:2354:notify] 0-_data1-client-3: parent translators are >>>>>>>>> ready, attempting connect on transport >>>>>>>>> [2019-02-08 01:13:35.655527] I [rpc-clnt.c:2042:rpc_clnt_reconfig] >>>>>>>>> 0-_data1-client-0: changing port to 49153 (from 0) >>>>>>>>> Final graph: >>>>>>>>> >>>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see >>>>>>>>>> it's taken effect correctly: >>>>>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>>>>>> >>>>>>>>>> Let's see if it stops crashing or not. >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Nithya, >>>>>>>>>>> >>>>>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started >>>>>>>>>>> seeing crashes, and no further releases have been made yet. >>>>>>>>>>> >>>>>>>>>>> volume info: >>>>>>>>>>> Type: Replicate >>>>>>>>>>> Volume ID: ****SNIP**** >>>>>>>>>>> Status: Started >>>>>>>>>>> Snapshot Count: 0 >>>>>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>>>>> Transport-type: tcp >>>>>>>>>>> Bricks: >>>>>>>>>>> Brick1: ****SNIP**** >>>>>>>>>>> Brick2: ****SNIP**** >>>>>>>>>>> Brick3: ****SNIP**** >>>>>>>>>>> Brick4: ****SNIP**** >>>>>>>>>>> Options Reconfigured: >>>>>>>>>>> cluster.quorum-count: 1 >>>>>>>>>>> cluster.quorum-type: fixed >>>>>>>>>>> network.ping-timeout: 5 >>>>>>>>>>> network.remote-dio: enable >>>>>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>>>>> performance.readdir-ahead: on >>>>>>>>>>> performance.parallel-readdir: on >>>>>>>>>>> network.inode-lru-limit: 500000 >>>>>>>>>>> performance.md-cache-timeout: 600 >>>>>>>>>>> performance.cache-invalidation: on >>>>>>>>>>> performance.stat-prefetch: on >>>>>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>>>>> features.cache-invalidation: on >>>>>>>>>>> cluster.readdir-optimize: on >>>>>>>>>>> performance.io-thread-count: 32 >>>>>>>>>>> server.event-threads: 4 >>>>>>>>>>> client.event-threads: 4 >>>>>>>>>>> performance.read-ahead: off >>>>>>>>>>> cluster.lookup-optimize: on >>>>>>>>>>> performance.cache-size: 1GB >>>>>>>>>>> cluster.self-heal-daemon: enable >>>>>>>>>>> transport.address-family: inet >>>>>>>>>>> nfs.disable: on >>>>>>>>>>> performance.client-io-threads: on >>>>>>>>>>> cluster.granular-entry-heal: enable >>>>>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Artem, >>>>>>>>>>>> >>>>>>>>>>>> Do you still see the crashes with 5.3? If yes, please try mount >>>>>>>>>>>> the volume using the mount option lru-limit=0 and see if that helps. We are >>>>>>>>>>>> looking into the crashes and will update when have a fix. >>>>>>>>>>>> >>>>>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>>>>> question. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> regards, >>>>>>>>>>>> Nithya >>>>>>>>>>>> >>>>>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>>>>> >>>>>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>>>>> servers, and I don't know why. >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The fuse crash happened again yesterday, to another volume. >>>>>>>>>>>>>> Are there any mount options that could help mitigate this? >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>>>>> >>>>>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is >>>>>>>>>>>>>> not connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> monit check: >>>>>>>>>>>>>> check filesystem glusterfs_data1 with path >>>>>>>>>>>>>> /mnt/glusterfs_data1 >>>>>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> stack trace: >>>>>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>>>>> >>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The first (and so far only) crash happened at 2am the next >>>>>>>>>>>>>>> day after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>>>>> mounts. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a pretty >>>>>>>>>>>>>>> busy site (apkmirror.com), and it caused a disruption for >>>>>>>>>>>>>>> any uploads or downloads from that server until I woke up and fixed the >>>>>>>>>>>>>>> mount. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>>>>> trace. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved soon. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Can you please tell if the crashes happened as soon as >>>>>>>>>>>>>>>> upgrade ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -Amar >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to 5.3, >>>>>>>>>>>>>>>>> I already got a crash which others have mentioned in >>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and >>>>>>>>>>>>>>>>> had to unmount, kill gluster, and remount: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone already >>>>>>>>>>>>>>>>> opened a ticket for the crashes that I can join and monitor? This is going >>>>>>>>>>>>>>>>> to create a massive problem for us since production systems are crashing. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of >>>>>>>>>>>>>>>>>>> these "Failed to dispatch handler" in my logs as well. Many people have >>>>>>>>>>>>>>>>>>> been commenting about this issue here >>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ >>>>>>>>>>>>>>>>>> addresses this. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list >>>>>>>>>>>>>>>>>>> may bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> +Milind Changire Can you check why >>>>>>>>>>>>>>>>>> this message is logged and send a fix? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Founder, Android Police >>>>>>>>>>>>>>>>>>>> , APK Mirror , Illogical >>>>>>>>>>>>>>>>>>>> Robot LLC >>>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Tue Feb 12 08:37:43 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Tue, 12 Feb 2019 14:07:43 +0530 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Not yet but we are discussing an interim release. It is going to take a couple of days to review the fixes so not before then. We will update on the list with dates once we decide. On Tue, 12 Feb 2019 at 11:46, Artem Russakovskii wrote: > Awesome. But is there a release schedule and an ETA for when these will be > out in the repos? > > On Mon, Feb 11, 2019, 9:34 PM Raghavendra Gowdappa > wrote: > >> >> >> On Tue, Feb 12, 2019 at 10:24 AM Artem Russakovskii >> wrote: >> >>> Great job identifying the issue! >>> >>> Any ETA on the next release with the logging and crash fixes in it? >>> >> >> I've marked write-behind corruption as a blocker for release-6. Logging >> fixes are already in codebase. >> >> >>> On Mon, Feb 11, 2019, 7:19 PM Raghavendra Gowdappa >>> wrote: >>> >>>> >>>> >>>> On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < >>>> joao.bauto at neuro.fchampalimaud.org> wrote: >>>> >>>>> Although I don't have these error messages, I'm having fuse crashes as >>>>> frequent as you. I have disabled write-behind and the mount has been >>>>> running over the weekend with heavy usage and no issues. >>>>> >>>> >>>> The issue you are facing will likely be fixed by patch [1]. Me, Xavi >>>> and Nithya were able to identify the corruption in write-behind. >>>> >>>> [1] https://review.gluster.org/22189 >>>> >>>> >>>>> I can provide coredumps before disabling write-behind if needed. I >>>>> opened a BZ report >>>>> with the >>>>> crashes that I was having. >>>>> >>>>> *Jo?o Ba?to* >>>>> --------------- >>>>> >>>>> *Scientific Computing and Software Platform* >>>>> Champalimaud Research >>>>> Champalimaud Center for the Unknown >>>>> Av. Bras?lia, Doca de Pedrou?os >>>>> 1400-038 Lisbon, Portugal >>>>> fchampalimaud.org >>>>> >>>>> >>>>> Artem Russakovskii escreveu no dia s?bado, >>>>> 9/02/2019 ?(s) 22:18: >>>>> >>>>>> Alright. I've enabled core-dumping (hopefully), so now I'm waiting >>>>>> for the next crash to see if it dumps a core for you guys to remotely debug. >>>>>> >>>>>> Then I can consider setting performance.write-behind to off and >>>>>> monitoring for further crashes. >>>>>> >>>>>> Sincerely, >>>>>> Artem >>>>>> >>>>>> -- >>>>>> Founder, Android Police , APK Mirror >>>>>> , Illogical Robot LLC >>>>>> beerpla.net | +ArtemRussakovskii >>>>>> | @ArtemR >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa < >>>>>> rgowdapp at redhat.com> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii < >>>>>>> archon810 at gmail.com> wrote: >>>>>>> >>>>>>>> Hi Nithya, >>>>>>>> >>>>>>>> I can try to disable write-behind as long as it doesn't heavily >>>>>>>> impact performance for us. Which option is it exactly? I don't see it set >>>>>>>> in my list of changed volume variables that I sent you guys earlier. >>>>>>>> >>>>>>> >>>>>>> The option is performance.write-behind >>>>>>> >>>>>>> >>>>>>>> Sincerely, >>>>>>>> Artem >>>>>>>> >>>>>>>> -- >>>>>>>> Founder, Android Police , APK Mirror >>>>>>>> , Illogical Robot LLC >>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>> | @ArtemR >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran < >>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>> >>>>>>>>> Hi Artem, >>>>>>>>> >>>>>>>>> We have found the cause of one crash. Unfortunately we have not >>>>>>>>> managed to reproduce the one you reported so we don't know if it is the >>>>>>>>> same cause. >>>>>>>>> >>>>>>>>> Can you disable write-behind on the volume and let us know if it >>>>>>>>> solves the problem? If yes, it is likely to be the same issue. >>>>>>>>> >>>>>>>>> >>>>>>>>> regards, >>>>>>>>> Nithya >>>>>>>>> >>>>>>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii < >>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>>>>>> lru-limit=0 didn't help. >>>>>>>>>> >>>>>>>>>> Here's the snippet of the crash and the subsequent remount by >>>>>>>>>> monit. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>> [0x7f4402b99329] >>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>>>>>> valid argument] >>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>>>>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>>>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>>>>>> [2019-02-08 01:13:09.311554] >>>>>>>>>> pending frames: >>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>> frame : type(0) op(0) >>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>> signal received: 6 >>>>>>>>>> time of crash: >>>>>>>>>> 2019-02-08 01:13:09 >>>>>>>>>> configuration details: >>>>>>>>>> argp 1 >>>>>>>>>> backtrace 1 >>>>>>>>>> dlfcn 1 >>>>>>>>>> libpthread 1 >>>>>>>>>> llistxattr 1 >>>>>>>>>> setfsid 1 >>>>>>>>>> spinlock 1 >>>>>>>>>> epoll.h 1 >>>>>>>>>> xattr.h 1 >>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>>>>>> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>>>>>> >>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>>>>>> --------- >>>>>>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>>>>>>> /mnt/_data1) >>>>>>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>> with index 1 >>>>>>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>> with index 2 >>>>>>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>> with index 3 >>>>>>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>> with index 4 >>>>>>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] >>>>>>>>>> [client.c:2354:notify] 0-_data1-client-0: parent translators are >>>>>>>>>> ready, attempting connect on transport >>>>>>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] >>>>>>>>>> [client.c:2354:notify] 0-_data1-client-1: parent translators are >>>>>>>>>> ready, attempting connect on transport >>>>>>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] >>>>>>>>>> [client.c:2354:notify] 0-_data1-client-2: parent translators are >>>>>>>>>> ready, attempting connect on transport >>>>>>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] >>>>>>>>>> [client.c:2354:notify] 0-_data1-client-3: parent translators are >>>>>>>>>> ready, attempting connect on transport >>>>>>>>>> [2019-02-08 01:13:35.655527] I >>>>>>>>>> [rpc-clnt.c:2042:rpc_clnt_reconfig] 0-_data1-client-0: changing port >>>>>>>>>> to 49153 (from 0) >>>>>>>>>> Final graph: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Sincerely, >>>>>>>>>> Artem >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Founder, Android Police , APK >>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>> | @ArtemR >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see >>>>>>>>>>> it's taken effect correctly: >>>>>>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>>>>>>> >>>>>>>>>>> Let's see if it stops crashing or not. >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Nithya, >>>>>>>>>>>> >>>>>>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started >>>>>>>>>>>> seeing crashes, and no further releases have been made yet. >>>>>>>>>>>> >>>>>>>>>>>> volume info: >>>>>>>>>>>> Type: Replicate >>>>>>>>>>>> Volume ID: ****SNIP**** >>>>>>>>>>>> Status: Started >>>>>>>>>>>> Snapshot Count: 0 >>>>>>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>>>>>> Transport-type: tcp >>>>>>>>>>>> Bricks: >>>>>>>>>>>> Brick1: ****SNIP**** >>>>>>>>>>>> Brick2: ****SNIP**** >>>>>>>>>>>> Brick3: ****SNIP**** >>>>>>>>>>>> Brick4: ****SNIP**** >>>>>>>>>>>> Options Reconfigured: >>>>>>>>>>>> cluster.quorum-count: 1 >>>>>>>>>>>> cluster.quorum-type: fixed >>>>>>>>>>>> network.ping-timeout: 5 >>>>>>>>>>>> network.remote-dio: enable >>>>>>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>>>>>> performance.readdir-ahead: on >>>>>>>>>>>> performance.parallel-readdir: on >>>>>>>>>>>> network.inode-lru-limit: 500000 >>>>>>>>>>>> performance.md-cache-timeout: 600 >>>>>>>>>>>> performance.cache-invalidation: on >>>>>>>>>>>> performance.stat-prefetch: on >>>>>>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>>>>>> features.cache-invalidation: on >>>>>>>>>>>> cluster.readdir-optimize: on >>>>>>>>>>>> performance.io-thread-count: 32 >>>>>>>>>>>> server.event-threads: 4 >>>>>>>>>>>> client.event-threads: 4 >>>>>>>>>>>> performance.read-ahead: off >>>>>>>>>>>> cluster.lookup-optimize: on >>>>>>>>>>>> performance.cache-size: 1GB >>>>>>>>>>>> cluster.self-heal-daemon: enable >>>>>>>>>>>> transport.address-family: inet >>>>>>>>>>>> nfs.disable: on >>>>>>>>>>>> performance.client-io-threads: on >>>>>>>>>>>> cluster.granular-entry-heal: enable >>>>>>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>> >>>>>>>>>>>>> Do you still see the crashes with 5.3? If yes, please try >>>>>>>>>>>>> mount the volume using the mount option lru-limit=0 and see if that helps. >>>>>>>>>>>>> We are looking into the crashes and will update when have a fix. >>>>>>>>>>>>> >>>>>>>>>>>>> Also, please provide the gluster volume info for the volume in >>>>>>>>>>>>> question. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> regards, >>>>>>>>>>>>> Nithya >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What's odd is that the crashes are only happening on one of 4 >>>>>>>>>>>>>> servers, and I don't know why. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>> Artem >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The fuse crash happened again yesterday, to another volume. >>>>>>>>>>>>>>> Are there any mount options that could help mitigate this? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> In the meantime, I set up a monit (https://mmonit.com/monit/) >>>>>>>>>>>>>>> task to watch and restart the mount, which works and recovers the mount >>>>>>>>>>>>>>> point within a minute. Not ideal, but a temporary workaround. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint is >>>>>>>>>>>>>>> not connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> monit check: >>>>>>>>>>>>>>> check filesystem glusterfs_data1 with path >>>>>>>>>>>>>>> /mnt/glusterfs_data1 >>>>>>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> stack trace: >>>>>>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The first (and so far only) crash happened at 2am the next >>>>>>>>>>>>>>>> day after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>>>>>> mounts. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a >>>>>>>>>>>>>>>> pretty busy site (apkmirror.com), and it caused a >>>>>>>>>>>>>>>> disruption for any uploads or downloads from that server until I woke up >>>>>>>>>>>>>>>> and fixed the mount. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I wish I could be more helpful but all I have is that stack >>>>>>>>>>>>>>>> trace. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved >>>>>>>>>>>>>>>> soon. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 >>>>>>>>>>>>>>>>> (ie, as a clone of other bugs where recent discussions happened), and >>>>>>>>>>>>>>>>> marked it as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Can you please tell if the crashes happened as soon as >>>>>>>>>>>>>>>>> upgrade ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -Amar >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to >>>>>>>>>>>>>>>>>> 5.3, I already got a crash which others have mentioned in >>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and >>>>>>>>>>>>>>>>>> had to unmount, kill gluster, and remount: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Do the pending patches fix the crash or only the repeated >>>>>>>>>>>>>>>>>> warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone >>>>>>>>>>>>>>>>>> already opened a ticket for the crashes that I can join and monitor? This >>>>>>>>>>>>>>>>>> is going to create a massive problem for us since production systems are >>>>>>>>>>>>>>>>>> crashing. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of >>>>>>>>>>>>>>>>>>>> these "Failed to dispatch handler" in my logs as well. Many people have >>>>>>>>>>>>>>>>>>>> been commenting about this issue here >>>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ >>>>>>>>>>>>>>>>>>> addresses this. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list >>>>>>>>>>>>>>>>>>>> may bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>> Founder, Android Police >>>>>>>>>>>>>>>>>>>> , APK Mirror , Illogical >>>>>>>>>>>>>>>>>>>> Robot LLC >>>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> +Milind Changire Can you check >>>>>>>>>>>>>>>>>>> why this message is logged and send a fix? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Founder, Android Police >>>>>>>>>>>>>>>>>>>>> , APK Mirror , Illogical >>>>>>>>>>>>>>>>>>>>> Robot LLC >>>>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>> Gluster-users mailing list >>>>>>>> Gluster-users at gluster.org >>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>> >>>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Tue Feb 12 12:08:00 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 12 Feb 2019 17:38:00 +0530 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts Message-ID: All, We've found perf xlators io-cache and read-ahead not adding any performance improvement. At best read-ahead is redundant due to kernel read-ahead and at worst io-cache is degrading the performance for workloads that doesn't involve re-read. Given that VFS already have both these functionalities, I am proposing to have these two translators turned off by default for native fuse mounts. For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have these xlators on by having custom profiles. Comments? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 regards, Raghavendra -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Tue Feb 12 13:22:43 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Tue, 12 Feb 2019 18:52:43 +0530 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: References: Message-ID: https://review.gluster.org/22203 On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa wrote: > All, > > We've found perf xlators io-cache and read-ahead not adding any > performance improvement. At best read-ahead is redundant due to kernel > read-ahead and at worst io-cache is degrading the performance for workloads > that doesn't involve re-read. Given that VFS already have both these > functionalities, I am proposing to have these two translators turned off by > default for native fuse mounts. > > For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have > these xlators on by having custom profiles. Comments? > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 > > regards, > Raghavendra > -------------- next part -------------- An HTML attachment was scrubbed... URL: From budic at onholyground.com Tue Feb 12 17:39:12 2019 From: budic at onholyground.com (Darrell Budic) Date: Tue, 12 Feb 2019 11:39:12 -0600 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: References: Message-ID: <59A9002B-F427-4D94-A653-31A99DEF6CD8@onholyground.com> Is there an example of a custom profile you can share for my ovirt use case (with gfapi enabled)? Or are you just talking about the standard group settings for virt as a custom profile? > On Feb 12, 2019, at 7:22 AM, Raghavendra Gowdappa wrote: > > https://review.gluster.org/22203 > > On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa > wrote: > All, > > We've found perf xlators io-cache and read-ahead not adding any performance improvement. At best read-ahead is redundant due to kernel read-ahead and at worst io-cache is degrading the performance for workloads that doesn't involve re-read. Given that VFS already have both these functionalities, I am proposing to have these two translators turned off by default for native fuse mounts. > > For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have these xlators on by having custom profiles. Comments? > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 > > regards, > Raghavendra > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Feb 13 05:14:37 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 13 Feb 2019 10:44:37 +0530 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: <59A9002B-F427-4D94-A653-31A99DEF6CD8@onholyground.com> References: <59A9002B-F427-4D94-A653-31A99DEF6CD8@onholyground.com> Message-ID: On Tue, Feb 12, 2019 at 11:09 PM Darrell Budic wrote: > Is there an example of a custom profile you can share for my ovirt use > case (with gfapi enabled)? > I was speaking about a group setting like "group metadata-cache". Its just that custom options one would turn on for a class of applications or problems. Or are you just talking about the standard group settings for virt as a > custom profile? > > On Feb 12, 2019, at 7:22 AM, Raghavendra Gowdappa > wrote: > > https://review.gluster.org/22203 > > On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa > wrote: > >> All, >> >> We've found perf xlators io-cache and read-ahead not adding any >> performance improvement. At best read-ahead is redundant due to kernel >> read-ahead and at worst io-cache is degrading the performance for workloads >> that doesn't involve re-read. Given that VFS already have both these >> functionalities, I am proposing to have these two translators turned off by >> default for native fuse mounts. >> >> For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have >> these xlators on by having custom profiles. Comments? >> >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 >> >> regards, >> Raghavendra >> > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Feb 13 05:21:38 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 13 Feb 2019 10:51:38 +0530 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: References: Message-ID: On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa wrote: > All, > > We've found perf xlators io-cache and read-ahead not adding any > performance improvement. At best read-ahead is redundant due to kernel > read-ahead > One thing we are still figuring out is whether kernel read-ahead is tunable. From what we've explored, it _looks_ like (may not be entirely correct), ra is capped at 128KB. If that's the case, I am interested in few things: * Are there any realworld applications/usecases, which would benefit from larger read-ahead (Manoj says block devices can do ra of 4MB)? * Is the limit on kernel ra tunable a hard one? IOW, what does it take to make it to do higher ra? If its difficult, can glusterfs read-ahead provide the expected performance improvement for these applications that would benefit from aggressive ra (as glusterfs can support larger ra sizes)? I am still inclined to prefer kernel ra as I think its more intelligent and can identify more sequential patterns than Glusterfs read-ahead [1][2]. [1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-273-284.pdf [2] https://lwn.net/Articles/155510/ and at worst io-cache is degrading the performance for workloads that > doesn't involve re-read. Given that VFS already have both these > functionalities, I am proposing to have these two translators turned off by > default for native fuse mounts. > > For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have > these xlators on by having custom profiles. Comments? > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 > > regards, > Raghavendra > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mpillai at redhat.com Wed Feb 13 05:45:32 2019 From: mpillai at redhat.com (Manoj Pillai) Date: Wed, 13 Feb 2019 11:15:32 +0530 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: References: Message-ID: On Wed, Feb 13, 2019 at 10:51 AM Raghavendra Gowdappa wrote: > > > On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa > wrote: > >> All, >> >> We've found perf xlators io-cache and read-ahead not adding any >> performance improvement. At best read-ahead is redundant due to kernel >> read-ahead >> > > One thing we are still figuring out is whether kernel read-ahead is > tunable. From what we've explored, it _looks_ like (may not be entirely > correct), ra is capped at 128KB. If that's the case, I am interested in few > things: > * Are there any realworld applications/usecases, which would benefit from > larger read-ahead (Manoj says block devices can do ra of 4MB)? > kernel read-ahead is adaptive but influenced by the read-ahead setting on the block device (/sys/block//queue/read_ahead_kb), which can be tuned. For RHEL specifically, the default is 128KB (last I checked) but the default RHEL tuned-profile, throughput-performance, bumps that up to 4MB. It should be fairly easy to rig up a test where 4MB read-ahead on the block device gives better performance than 128KB read-ahead. -- Manoj * Is the limit on kernel ra tunable a hard one? IOW, what does it take to > make it to do higher ra? If its difficult, can glusterfs read-ahead provide > the expected performance improvement for these applications that would > benefit from aggressive ra (as glusterfs can support larger ra sizes)? > > I am still inclined to prefer kernel ra as I think its more intelligent > and can identify more sequential patterns than Glusterfs read-ahead [1][2]. > [1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-273-284.pdf > [2] https://lwn.net/Articles/155510/ > > and at worst io-cache is degrading the performance for workloads that >> doesn't involve re-read. Given that VFS already have both these >> functionalities, I am proposing to have these two translators turned off by >> default for native fuse mounts. >> >> For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have >> these xlators on by having custom profiles. Comments? >> >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 >> >> regards, >> Raghavendra >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgowdapp at redhat.com Wed Feb 13 06:03:05 2019 From: rgowdapp at redhat.com (Raghavendra Gowdappa) Date: Wed, 13 Feb 2019 11:33:05 +0530 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: References: Message-ID: On Wed, Feb 13, 2019 at 11:16 AM Manoj Pillai wrote: > > > On Wed, Feb 13, 2019 at 10:51 AM Raghavendra Gowdappa > wrote: > >> >> >> On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa >> wrote: >> >>> All, >>> >>> We've found perf xlators io-cache and read-ahead not adding any >>> performance improvement. At best read-ahead is redundant due to kernel >>> read-ahead >>> >> >> One thing we are still figuring out is whether kernel read-ahead is >> tunable. From what we've explored, it _looks_ like (may not be entirely >> correct), ra is capped at 128KB. If that's the case, I am interested in few >> things: >> * Are there any realworld applications/usecases, which would benefit from >> larger read-ahead (Manoj says block devices can do ra of 4MB)? >> > > kernel read-ahead is adaptive but influenced by the read-ahead setting on > the block device (/sys/block//queue/read_ahead_kb), which can be > tuned. For RHEL specifically, the default is 128KB (last I checked) but the > default RHEL tuned-profile, throughput-performance, bumps that up to 4MB. > It should be fairly easy to rig up a test where 4MB read-ahead on the > block device gives better performance than 128KB read-ahead. > Thanks Manoj. To add to what Manoj said and give more context here, Glusterfs being a fuse-based fs is not exposed as a block device. So, that's the first problem of where/how to tune and I've listed other problems earlier. > -- Manoj > > * Is the limit on kernel ra tunable a hard one? IOW, what does it take to >> make it to do higher ra? If its difficult, can glusterfs read-ahead provide >> the expected performance improvement for these applications that would >> benefit from aggressive ra (as glusterfs can support larger ra sizes)? >> >> I am still inclined to prefer kernel ra as I think its more intelligent >> and can identify more sequential patterns than Glusterfs read-ahead [1][2]. >> [1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-273-284.pdf >> [2] https://lwn.net/Articles/155510/ >> >> and at worst io-cache is degrading the performance for workloads that >>> doesn't involve re-read. Given that VFS already have both these >>> functionalities, I am proposing to have these two translators turned off by >>> default for native fuse mounts. >>> >>> For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have >>> these xlators on by having custom profiles. Comments? >>> >>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 >>> >>> regards, >>> Raghavendra >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Wed Feb 13 07:43:03 2019 From: revirii at googlemail.com (Hu Bert) Date: Wed, 13 Feb 2019 08:43:03 +0100 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: References: Message-ID: fyi: we have 3 servers, each with 2 SW RAID10 used as bricks in a replicate 3 setup (so 2 volumes); the default values set by OS (debian stretch) are: /dev/md3 Array Size : 29298911232 (27941.62 GiB 30002.09 GB) /sys/block/md3/queue/read_ahead_kb : 3027 /dev/md4 Array Size : 19532607488 (18627.75 GiB 20001.39 GB) /sys/block/md4/queue/read_ahead_kb : 2048 maybe that helps somehow :) Hubert Am Mi., 13. Feb. 2019 um 06:46 Uhr schrieb Manoj Pillai : > > > > On Wed, Feb 13, 2019 at 10:51 AM Raghavendra Gowdappa wrote: >> >> >> >> On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa wrote: >>> >>> All, >>> >>> We've found perf xlators io-cache and read-ahead not adding any performance improvement. At best read-ahead is redundant due to kernel read-ahead >> >> >> One thing we are still figuring out is whether kernel read-ahead is tunable. From what we've explored, it _looks_ like (may not be entirely correct), ra is capped at 128KB. If that's the case, I am interested in few things: >> * Are there any realworld applications/usecases, which would benefit from larger read-ahead (Manoj says block devices can do ra of 4MB)? > > > kernel read-ahead is adaptive but influenced by the read-ahead setting on the block device (/sys/block//queue/read_ahead_kb), which can be tuned. For RHEL specifically, the default is 128KB (last I checked) but the default RHEL tuned-profile, throughput-performance, bumps that up to 4MB. It should be fairly easy to rig up a test where 4MB read-ahead on the block device gives better performance than 128KB read-ahead. > > -- Manoj > >> * Is the limit on kernel ra tunable a hard one? IOW, what does it take to make it to do higher ra? If its difficult, can glusterfs read-ahead provide the expected performance improvement for these applications that would benefit from aggressive ra (as glusterfs can support larger ra sizes)? >> >> I am still inclined to prefer kernel ra as I think its more intelligent and can identify more sequential patterns than Glusterfs read-ahead [1][2]. >> [1] https://www.kernel.org/doc/ols/2007/ols2007v2-pages-273-284.pdf >> [2] https://lwn.net/Articles/155510/ >> >>> and at worst io-cache is degrading the performance for workloads that doesn't involve re-read. Given that VFS already have both these functionalities, I am proposing to have these two translators turned off by default for native fuse mounts. >>> >>> For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have these xlators on by having custom profiles. Comments? >>> >>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 >>> >>> regards, >>> Raghavendra > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users From budic at onholyground.com Wed Feb 13 14:51:06 2019 From: budic at onholyground.com (Darrell Budic) Date: Wed, 13 Feb 2019 08:51:06 -0600 Subject: [Gluster-users] Disabling read-ahead and io-cache for native fuse mounts In-Reply-To: References: <59A9002B-F427-4D94-A653-31A99DEF6CD8@onholyground.com> Message-ID: <869C2772-A443-4668-AA0B-B7ACB7A865B5@onholyground.com> Ah, ok, that?s what I thought. Then I have no complaints about improved defaults for the fuse case as long as the use case groups retain appropriately optimized settings. Thanks! > On Feb 12, 2019, at 11:14 PM, Raghavendra Gowdappa wrote: > > > > On Tue, Feb 12, 2019 at 11:09 PM Darrell Budic > wrote: > Is there an example of a custom profile you can share for my ovirt use case (with gfapi enabled)? > > I was speaking about a group setting like "group metadata-cache". Its just that custom options one would turn on for a class of applications or problems. > > Or are you just talking about the standard group settings for virt as a custom profile? > >> On Feb 12, 2019, at 7:22 AM, Raghavendra Gowdappa > wrote: >> >> https://review.gluster.org/22203 >> >> On Tue, Feb 12, 2019 at 5:38 PM Raghavendra Gowdappa > wrote: >> All, >> >> We've found perf xlators io-cache and read-ahead not adding any performance improvement. At best read-ahead is redundant due to kernel read-ahead and at worst io-cache is degrading the performance for workloads that doesn't involve re-read. Given that VFS already have both these functionalities, I am proposing to have these two translators turned off by default for native fuse mounts. >> >> For non-native fuse mounts like gfapi (NFS-ganesha/samba) we can have these xlators on by having custom profiles. Comments? >> >> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1665029 >> >> regards, >> Raghavendra >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Thu Feb 14 03:28:21 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Thu, 14 Feb 2019 08:58:21 +0530 Subject: [Gluster-users] Files on Brick not showing up in ls command In-Reply-To: References: Message-ID: On Tue, 12 Feb 2019 at 08:30, Patrick Nixon wrote: > The files are being written to via the glusterfs mount (and read on the > same client and a different client). I try not to do anything on the nodes > directly because I understand that can cause weirdness. As far as I can > tell, there haven't been any network disconnections, but I'll review the > client log to see if there any indication. I don't recall any issues last > time I was in there. > > If I understand correctly, the files are written to the volume from the client , but when the same client tries to list them again, those entries are not listed. Is that right? Do the files exist on the bricks? Would you be willing to provide a tcpdump of the client when doing this? If yes, please do the following: On the client system: - tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22 - Copy the files to the volume using the client - List the contents of the directory in which the files should exist - Stop the tcpdump capture and send it to us. Also provide the name of the directory and the missing files. Regards, NIthya > Thanks for the response! > > On Mon, Feb 11, 2019 at 7:35 PM Vijay Bellur wrote: > >> >> >> On Sun, Feb 10, 2019 at 5:20 PM Patrick Nixon wrote: >> >>> Hello! >>> >>> I have an 8 node distribute volume setup. I have one node that accept >>> files and stores them on disk, but when doing an ls, none of the files on >>> that specific node are being returned. >>> >>> Can someone give some guidance on what should be the best place to >>> start troubleshooting this? >>> >> >> >> Are the files being written from a glusterfs mount? If so, it might be >> worth checking if the network connectivity is fine between the client (that >> does ls) and the server/brick that contains these files. You could look up >> the client log file to check if there are any messages related to >> rpc disconnections. >> >> Regards, >> Vijay >> >> >>> # gluster volume info >>> >>> Volume Name: gfs >>> Type: Distribute >>> Volume ID: 44c8c4f1-2dfb-4c03-9bca-d1ae4f314a78 >>> Status: Started >>> Snapshot Count: 0 >>> Number of Bricks: 8 >>> Transport-type: tcp >>> Bricks: >>> Brick1: gfs01:/data/brick1/gv0 >>> Brick2: gfs02:/data/brick1/gv0 >>> Brick3: gfs03:/data/brick1/gv0 >>> Brick4: gfs05:/data/brick1/gv0 >>> Brick5: gfs06:/data/brick1/gv0 >>> Brick6: gfs07:/data/brick1/gv0 >>> Brick7: gfs08:/data/brick1/gv0 >>> Brick8: gfs04:/data/brick1/gv0 >>> Options Reconfigured: >>> cluster.min-free-disk: 10% >>> nfs.disable: on >>> performance.readdir-ahead: on >>> >>> # gluster peer status >>> Number of Peers: 7 >>> Hostname: gfs03 >>> Uuid: 4a2d4deb-f8dd-49fc-a2ab-74e39dc25e20 >>> State: Peer in Cluster (Connected) >>> Hostname: gfs08 >>> Uuid: 17705b3a-ed6f-4123-8e2e-4dc5ab6d807d >>> State: Peer in Cluster (Connected) >>> Hostname: gfs07 >>> Uuid: dd699f55-1a27-4e51-b864-b4600d630732 >>> State: Peer in Cluster (Connected) >>> Hostname: gfs06 >>> Uuid: 8eb2a965-2c1e-4a64-b5b5-b7b7136ddede >>> State: Peer in Cluster (Connected) >>> Hostname: gfs04 >>> Uuid: cd866191-f767-40d0-bf7b-81ca0bc032b7 >>> State: Peer in Cluster (Connected) >>> Hostname: gfs02 >>> Uuid: 6864c6ac-6ff4-423a-ae3c-f5fd25621851 >>> State: Peer in Cluster (Connected) >>> Hostname: gfs05 >>> Uuid: dcecb55a-87b8-4441-ab09-b52e485e5f62 >>> State: Peer in Cluster (Connected) >>> >>> All gluster nodes are running glusterfs 4.0.2 >>> The clients accessing the files are also running glusterfs 4.0.2 >>> Both are Ubuntu >>> >>> Thanks! >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From pnixon at gmail.com Thu Feb 14 03:35:26 2019 From: pnixon at gmail.com (Patrick Nixon) Date: Wed, 13 Feb 2019 22:35:26 -0500 Subject: [Gluster-users] Files on Brick not showing up in ls command In-Reply-To: References: Message-ID: Thanks for the follow up. After reviewing the logs Vijay mentioned, nothing useful was found. I wiped removed and wiped the brick tonight. I'm in the process of balancing the new brick and will resync the files onto the full gluster volume when that completes On Wed, Feb 13, 2019, 10:28 PM Nithya Balachandran wrote: > > > On Tue, 12 Feb 2019 at 08:30, Patrick Nixon wrote: > >> The files are being written to via the glusterfs mount (and read on the >> same client and a different client). I try not to do anything on the nodes >> directly because I understand that can cause weirdness. As far as I can >> tell, there haven't been any network disconnections, but I'll review the >> client log to see if there any indication. I don't recall any issues last >> time I was in there. >> >> > If I understand correctly, the files are written to the volume from the > client , but when the same client tries to list them again, those entries > are not listed. Is that right? > > Do the files exist on the bricks? > Would you be willing to provide a tcpdump of the client when doing this? > If yes, please do the following: > > On the client system: > > - tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22 > - Copy the files to the volume using the client > - List the contents of the directory in which the files should exist > - Stop the tcpdump capture and send it to us. > > > Also provide the name of the directory and the missing files. > > Regards, > NIthya > > > > > >> Thanks for the response! >> >> On Mon, Feb 11, 2019 at 7:35 PM Vijay Bellur wrote: >> >>> >>> >>> On Sun, Feb 10, 2019 at 5:20 PM Patrick Nixon wrote: >>> >>>> Hello! >>>> >>>> I have an 8 node distribute volume setup. I have one node that accept >>>> files and stores them on disk, but when doing an ls, none of the files on >>>> that specific node are being returned. >>>> >>>> Can someone give some guidance on what should be the best place to >>>> start troubleshooting this? >>>> >>> >>> >>> Are the files being written from a glusterfs mount? If so, it might be >>> worth checking if the network connectivity is fine between the client (that >>> does ls) and the server/brick that contains these files. You could look up >>> the client log file to check if there are any messages related to >>> rpc disconnections. >>> >>> Regards, >>> Vijay >>> >>> >>>> # gluster volume info >>>> >>>> Volume Name: gfs >>>> Type: Distribute >>>> Volume ID: 44c8c4f1-2dfb-4c03-9bca-d1ae4f314a78 >>>> Status: Started >>>> Snapshot Count: 0 >>>> Number of Bricks: 8 >>>> Transport-type: tcp >>>> Bricks: >>>> Brick1: gfs01:/data/brick1/gv0 >>>> Brick2: gfs02:/data/brick1/gv0 >>>> Brick3: gfs03:/data/brick1/gv0 >>>> Brick4: gfs05:/data/brick1/gv0 >>>> Brick5: gfs06:/data/brick1/gv0 >>>> Brick6: gfs07:/data/brick1/gv0 >>>> Brick7: gfs08:/data/brick1/gv0 >>>> Brick8: gfs04:/data/brick1/gv0 >>>> Options Reconfigured: >>>> cluster.min-free-disk: 10% >>>> nfs.disable: on >>>> performance.readdir-ahead: on >>>> >>>> # gluster peer status >>>> Number of Peers: 7 >>>> Hostname: gfs03 >>>> Uuid: 4a2d4deb-f8dd-49fc-a2ab-74e39dc25e20 >>>> State: Peer in Cluster (Connected) >>>> Hostname: gfs08 >>>> Uuid: 17705b3a-ed6f-4123-8e2e-4dc5ab6d807d >>>> State: Peer in Cluster (Connected) >>>> Hostname: gfs07 >>>> Uuid: dd699f55-1a27-4e51-b864-b4600d630732 >>>> State: Peer in Cluster (Connected) >>>> Hostname: gfs06 >>>> Uuid: 8eb2a965-2c1e-4a64-b5b5-b7b7136ddede >>>> State: Peer in Cluster (Connected) >>>> Hostname: gfs04 >>>> Uuid: cd866191-f767-40d0-bf7b-81ca0bc032b7 >>>> State: Peer in Cluster (Connected) >>>> Hostname: gfs02 >>>> Uuid: 6864c6ac-6ff4-423a-ae3c-f5fd25621851 >>>> State: Peer in Cluster (Connected) >>>> Hostname: gfs05 >>>> Uuid: dcecb55a-87b8-4441-ab09-b52e485e5f62 >>>> State: Peer in Cluster (Connected) >>>> >>>> All gluster nodes are running glusterfs 4.0.2 >>>> The clients accessing the files are also running glusterfs 4.0.2 >>>> Both are Ubuntu >>>> >>>> Thanks! >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nbalacha at redhat.com Thu Feb 14 05:03:14 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Thu, 14 Feb 2019 10:33:14 +0530 Subject: [Gluster-users] Files on Brick not showing up in ls command In-Reply-To: References: Message-ID: Let me know if you still see problems. Thanks, Nithya On Thu, 14 Feb 2019 at 09:05, Patrick Nixon wrote: > Thanks for the follow up. After reviewing the logs Vijay mentioned, > nothing useful was found. > > I wiped removed and wiped the brick tonight. I'm in the process of > balancing the new brick and will resync the files onto the full gluster > volume when that completes > > On Wed, Feb 13, 2019, 10:28 PM Nithya Balachandran > wrote: > >> >> >> On Tue, 12 Feb 2019 at 08:30, Patrick Nixon wrote: >> >>> The files are being written to via the glusterfs mount (and read on the >>> same client and a different client). I try not to do anything on the nodes >>> directly because I understand that can cause weirdness. As far as I can >>> tell, there haven't been any network disconnections, but I'll review the >>> client log to see if there any indication. I don't recall any issues last >>> time I was in there. >>> >>> >> If I understand correctly, the files are written to the volume from the >> client , but when the same client tries to list them again, those entries >> are not listed. Is that right? >> >> Do the files exist on the bricks? >> Would you be willing to provide a tcpdump of the client when doing this? >> If yes, please do the following: >> >> On the client system: >> >> - tcpdump -i any -s 0 -w /var/tmp/dirls.pcap tcp and not port 22 >> - Copy the files to the volume using the client >> - List the contents of the directory in which the files should exist >> - Stop the tcpdump capture and send it to us. >> >> >> Also provide the name of the directory and the missing files. >> >> Regards, >> NIthya >> >> >> >> >> >>> Thanks for the response! >>> >>> On Mon, Feb 11, 2019 at 7:35 PM Vijay Bellur wrote: >>> >>>> >>>> >>>> On Sun, Feb 10, 2019 at 5:20 PM Patrick Nixon wrote: >>>> >>>>> Hello! >>>>> >>>>> I have an 8 node distribute volume setup. I have one node that >>>>> accept files and stores them on disk, but when doing an ls, none of the >>>>> files on that specific node are being returned. >>>>> >>>>> Can someone give some guidance on what should be the best place to >>>>> start troubleshooting this? >>>>> >>>> >>>> >>>> Are the files being written from a glusterfs mount? If so, it might be >>>> worth checking if the network connectivity is fine between the client (that >>>> does ls) and the server/brick that contains these files. You could look up >>>> the client log file to check if there are any messages related to >>>> rpc disconnections. >>>> >>>> Regards, >>>> Vijay >>>> >>>> >>>>> # gluster volume info >>>>> >>>>> Volume Name: gfs >>>>> Type: Distribute >>>>> Volume ID: 44c8c4f1-2dfb-4c03-9bca-d1ae4f314a78 >>>>> Status: Started >>>>> Snapshot Count: 0 >>>>> Number of Bricks: 8 >>>>> Transport-type: tcp >>>>> Bricks: >>>>> Brick1: gfs01:/data/brick1/gv0 >>>>> Brick2: gfs02:/data/brick1/gv0 >>>>> Brick3: gfs03:/data/brick1/gv0 >>>>> Brick4: gfs05:/data/brick1/gv0 >>>>> Brick5: gfs06:/data/brick1/gv0 >>>>> Brick6: gfs07:/data/brick1/gv0 >>>>> Brick7: gfs08:/data/brick1/gv0 >>>>> Brick8: gfs04:/data/brick1/gv0 >>>>> Options Reconfigured: >>>>> cluster.min-free-disk: 10% >>>>> nfs.disable: on >>>>> performance.readdir-ahead: on >>>>> >>>>> # gluster peer status >>>>> Number of Peers: 7 >>>>> Hostname: gfs03 >>>>> Uuid: 4a2d4deb-f8dd-49fc-a2ab-74e39dc25e20 >>>>> State: Peer in Cluster (Connected) >>>>> Hostname: gfs08 >>>>> Uuid: 17705b3a-ed6f-4123-8e2e-4dc5ab6d807d >>>>> State: Peer in Cluster (Connected) >>>>> Hostname: gfs07 >>>>> Uuid: dd699f55-1a27-4e51-b864-b4600d630732 >>>>> State: Peer in Cluster (Connected) >>>>> Hostname: gfs06 >>>>> Uuid: 8eb2a965-2c1e-4a64-b5b5-b7b7136ddede >>>>> State: Peer in Cluster (Connected) >>>>> Hostname: gfs04 >>>>> Uuid: cd866191-f767-40d0-bf7b-81ca0bc032b7 >>>>> State: Peer in Cluster (Connected) >>>>> Hostname: gfs02 >>>>> Uuid: 6864c6ac-6ff4-423a-ae3c-f5fd25621851 >>>>> State: Peer in Cluster (Connected) >>>>> Hostname: gfs05 >>>>> Uuid: dcecb55a-87b8-4441-ab09-b52e485e5f62 >>>>> State: Peer in Cluster (Connected) >>>>> >>>>> All gluster nodes are running glusterfs 4.0.2 >>>>> The clients accessing the files are also running glusterfs 4.0.2 >>>>> Both are Ubuntu >>>>> >>>>> Thanks! >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Thu Feb 14 05:41:36 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 14 Feb 2019 11:11:36 +0530 Subject: [Gluster-users] Gluster Container Storage: Release Update Message-ID: Hello everyone, We are announcing v1.0RC release of GlusterCS this week!** The version 1.0 is due along with *glusterfs-6.0* next month. Below are the Goals for v1.0: - RWX PVs - Scale and Performance - RWO PVs - Simple, leaner stack with Gluster?s Virtual Block. - Thin Arbiter (2 DataCenter Replicate) Support for RWX volume. - RWO hosting volume to use Thin Arbiter volume type would be still in Alpha. - Integrated monitoring. - Simple Install / Overall user-experience. Along with above, we are in Alpha state to support GCS on ARM architecture. We are also trying to get the website done for GCS @ https://gluster.github.io/gcs We are looking for some validation of the GCS containers, and the overall gluster stack, in your k8s setup. While we are focusing more on getting stability, and better user-experience, we are also trying to ship few tech-preview items, for early preview. The main item on this is loopback based bricks ( https://github.com/gluster/glusterd2/pull/1473), which allows us to bring more data services on top of Gluster with more options in container world, specially with backup and recovery. The above feature also makes better snapshot/clone story for gluster in containers with reflink support on XFS. *(NOTE: this will be a future improvement)* This email is a request for help with regard to testing and feedback on this new stack, in its alpha release tag. Do let us know if there are any concerns. We are ready to take anything from ?This is BS!!? to ?Wow! this looks really simple, works without hassle? [image: :smile:] Btw, if you are interested to try / help, few things to note: - GCS uses CSI spec v1.0, which is only available from k8s 1.13+ - We do have weekly meetings on GCS as announced in https://lists.gluster.org/pipermail/gluster-devel/2019-January/055774.html - Feel free to jump in if interested. - ie, Every Thursday, 15:00 UTC. - GCS doesn?t have any operator support yet, but for simplicity, you can also try using https://github.com/aravindavk/kubectl-gluster - Planned to be integrated in later versions. - We are not great at creating cool website, help in making GCS homepage would be great too :-) Interested? feel free to jump into Architecture call today. Regards, Gluster Container Storage Team PS: The meeting minutes, where the release pointers were discussed is @ https://hackmd.io/sj9ik9SCTYm81YcQDOOrtw?both ** - subject to resolving some blockers @ https://waffle.io/gluster/gcs?label=GCS%2F1.0 -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Feb 14 06:27:01 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 14 Feb 2019 11:57:01 +0530 Subject: [Gluster-users] (PLEASE UNDERSTAND our concern as TOP PRIORITY) : Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi John/Michael Could you please at least provide workaround for this issue. We stuck from last two months and unable to use our setup We tried following ways. 1. heketi pod restart 2. export pending operations and clear and import but still same issue when I try to create single volume at a time. Please understand our concern and provide workaround. BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" , Madhu Rajanna Date: 01/25/2019 04:03 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi John, Could you please have look my issue If you have time (atleast provide workaround). Thanks in advance. BR Salam From: "Shaik Salam" To: Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/25/2019 02:55 PM Subject: Re: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Sent by: gluster-users-bounces at gluster.org "External email. Open with Caution" Hi John, Please find db dump and heketi log. Here kernel version. Please let me know If you need more information. [root at app2 ~]# uname -a Linux app2.matrix.nokia.com 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Hardware: HP GEN8 OS; NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" From: "Madhu Rajanna" To: "Shaik Salam" , "John Mulligan" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 10:52 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" Adding John who is having more idea about how to debug this one. @Shaik Salam can you some more info on the hardware on which you are running heketi (kernel details) On Thu, Jan 24, 2019 at 7:42 PM Shaik Salam wrote: Hi Madhu, Sorry to disturb could you please provide atleast work around (to clear requests which stuck) to move further. We are also not able to find root cause from glusterd logs. Please find attachment. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 04:12 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, Please let me know If any other information required. BR Salam From: Shaik Salam/HYD/TCS To: "Madhu Rajanna" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 03:23 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Madhu, This is complete one after restart of heketi pod and process log. BR Salam [attachment "heketi-pod-complete.log" deleted by Shaik Salam/HYD/TCS] [attachment "ps-aux.txt" deleted by Shaik Salam/HYD/TCS] From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 01:55 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the logs you provided is not complete, not able to figure out which command is struck, can you reattach the complete output of `ps aux` and also attach complete heketi logs. On Thu, Jan 24, 2019 at 1:41 PM Shaik Salam wrote: Hi Madhu, Please find requested info. BR Salam From: Madhu Rajanna To: Shaik Salam Cc: "gluster-users at gluster.org List" , Michael Adam Date: 01/24/2019 01:33 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" the heketi logs you have attached is not complete i believe, can you povide the complete heketi logs and also an we get the output of "ps aux" from the gluster pods ? I want to see if any lvm commands or gluster commands are "stuck". On Thu, Jan 24, 2019 at 1:16 PM Shaik Salam wrote: Hi Madhu. I tried lot of times restarted heketi pod but not resolved. sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 0 New: 0 Stale: 0 Now you can see all operations are zero. Now I try to create single volume below is observation in-flight reaching slowly to 8. sh-4.4# heketi-cli server operations infoCLI_SERVER=http://localhost:8080 ; export HEKETI_CLI_USE Operation Counts: Total: 0 In-Flight: 6 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 sh-4.4# heketi-cli server operations info Operation Counts: Total: 0 In-Flight: 7 New: 0 Stale: 0 [negroni] Completed 200 OK in 186.286?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 166.294?s [negroni] Started GET /operations [negroni] Completed 200 OK in 186.411?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.796?s [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 131.108?s [negroni] Started POST /volumes [negroni] Started GET /operations [negroni] Completed 200 OK in 111.392?s [negroni] Started GET /operations [negroni] Completed 200 OK in 265.023?s [negroni] Started GET /operations [negroni] Completed 200 OK in 179.364?s [negroni] Started GET /operations [negroni] Completed 200 OK in 295.058?s [negroni] Started GET /operations [negroni] Completed 200 OK in 146.857?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:36 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 403.166?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/24 07:43:51 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 193.554?s But for pod volume is not creating. 1:15:36 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. 9 times in the last 2 minutes 1:13:21 PM Warning Provisioning failed Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume . 8 times in the last From: "Madhu Rajanna" To: "Shaik Salam" Cc: "gluster-users at gluster.org List" , "Michael Adam" Date: 01/24/2019 12:51 PM Subject: Re: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy "External email. Open with Caution" HI Shaik, can you provide me the outpout of $heketi-cli server operations info from heketi pod as a workround you can try restarting the heketi pod. This will cause the current operations to go stale, but other pending pvcs may go to Bound state Regards, Madhu R On Thu, Jan 24, 2019 at 12:36 PM Shaik Salam wrote: H Madhu, Could you please have look my issue If you have time (atleast workaround). I am unable to send mail to "John Mulligan" " who is currently handling issue https://bugzilla.redhat.com/show_bug.cgi?id=1636912 BR Salam From: Shaik Salam/HYD/TCS To: "John Mulligan" , "Michael Adam" < madam at redhat.com>, "Madhu Rajanna" Cc: "gluster-users at gluster.org List" Date: 01/24/2019 12:21 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi All, We are facing also following issue on openshift origin while we are creating pvc for pods. (atlease provide workaround to move further) Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. Please find heketidb dump and log [negroni] Completed 429 Too Many Requests in 250.763?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:07:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 169.08?s [negroni] Started DELETE /volumes/520bc5f4e1bfd029855a72f9ca7ebf6c [negroni] Completed 404 Not Found in 148.125?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 496.624?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 101.673?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 209.681?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 103.595?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 297.594?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:34 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 96.75?s [negroni] Started POST /volumes [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 477.007?s [heketi] WARNING 2019/01/23 12:08:49 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 165.38?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 488.253?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:04 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 171.836?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 208.59?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/23 12:09:19 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 125.141?s [negroni] Started DELETE /volumes/99e87ecd0a816ac34ae5a04eabc1d606 [negroni] Completed 404 Not Found in 138.687?s [negroni] Started POST /volumes BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 -- Madhu Rajanna Software Engineer Red Hat Bangalore, India mrajanna at redhat.com M: +91-9741133155 [attachment "heketi-complete.log.txt" deleted by Shaik Salam/HYD/TCS] [attachment "heketi-gluster.db.txt" deleted by Shaik Salam/HYD/TCS] _______________________________________________ Gluster-users mailing list Gluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From shaik.salam at tcs.com Thu Feb 14 10:50:06 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Thu, 14 Feb 2019 16:20:06 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy In-Reply-To: References: Message-ID: Hi Raghavendra, We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods. (Please provide workaround to move further (pod restart doesn't workout) https://bugzilla.redhat.com/show_bug.cgi?id=1630117 https://bugzilla.redhat.com/show_bug.cgi?id=1636912 Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8. Now single volume not able to create and we are till now mostly 10 volumes are created. Please find heketidb dump and log [negroni] Completed 200 OK in 98.699?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 106.654?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 185.406?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 102.664?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 192.658?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 198.611?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 124.254?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 101.491?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 116.997?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 100.171?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 109.238?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 191.118?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 188.791?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 94.436?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 110.893?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.132?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 96.15?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.682?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 140.543?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 182.066?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 151.572?s BR Salam From: Shaik Salam/HYD/TCS To: rtalur at redhat.com Cc: "gluster-users at gluster.org List" , "John Mulligan" , "Michael Adam" Date: 02/08/2019 12:05 PM Subject: Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Hi Raghavendra, We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods. (Please provide workaround to move further (pod restart doesn't workout) https://bugzilla.redhat.com/show_bug.cgi?id=1630117 https://bugzilla.redhat.com/show_bug.cgi?id=1636912 Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8. Now single volume not able to create and we are till now mostly 10 volumes are created. Please find heketidb dump and log [negroni] Completed 200 OK in 98.699?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 106.654?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 185.406?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 102.664?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 192.658?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 198.611?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 124.254?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 101.491?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 116.997?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 100.171?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 109.238?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 191.118?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 188.791?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 94.436?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 110.893?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.132?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 96.15?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.682?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 140.543?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 182.066?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 151.572?s BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: heketi-log.txt URL: -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: db dump.txt URL: From bengoa at gmail.com Mon Feb 18 16:58:21 2019 From: bengoa at gmail.com (Alberto Bengoa) Date: Mon, 18 Feb 2019 16:58:21 +0000 Subject: [Gluster-users] High network traffic with performance.readdir-ahead on Message-ID: Hello folks, We are working on a migration from Gluster 3.8 to 5.3. Because it has a long migration path, we decided to install new servers running version 5.3 and then migrate the clients updating and pointing them to the new cluster. As a bonus, we still keep a rollback option in case of problems. We made our first migration attempt today and, unfortunately, we had to rollback to the old cluster. Since the very few minutes after switching clients from old to the new cluster, we noticed an unusual network traffic on glusters servers (around 320mbps) for that time. Near to 08:05 (our first daily peak is 8AM) we reached near to 1gbps during some minutes, and the traffic kept sustaining really high (over 800mbps) up to our second daily peak (at 9AM) when we reached again 1gbps. We decided to rollback the main production servers to old cluster, and kept some servers on the new one. We observed the network traffic going down again to around 300mbps. Talking with @nbalacha (Thank you again, man!) on IRC channel he suggested disabling performance.readdir-ahead option and the traffic went instantly down to near to 10mbps. A graph showing all these events can be found here: https://pasteboard.co/I1JR7ck.png So, the first point here, should performance.readdir-ahead be on by default? Maybe our scenario isn't the best use scenario, because, in fact, we do have hundreds of thousands of directories and it looks to be causing much more problems than benefits. Another thing we noticed is that when we point clients running new gluster version (5.3) to the old cluster (version 3.8) we also ran into the high traffic scenario, even already having performance.readdir-ahead switched to "off" (the default option for this version). You can see these high traffics on old cluster here: https://pasteboard.co/I1KdTUd.png . We are aware that having clients and servers running different versions isn't recommended and we are doing that just for debug/tests purposes. About our setup, we have ~= 1.5T volume running in Replicated mode (2 servers each cluster). We have around 30 clients mounting these volumes through fuse.glusterfs. # gluster volume info of new cluster Volume Name: X Type: Replicate Volume ID: 1d8f7d2d-bda6-4f1c-aa10-6ad29e0b7f5e Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: fs02tmp.x.net:/var/data/glusterfs/x/brick Brick2: fs01tmp.x.net:/var/data/glusterfs/x/brick Options Reconfigured: performance.readdir-ahead: off client.event-threads: 4 server.event-threads: 4 server.allow-insecure: on performance.client-io-threads: off nfs.disable: on transport.address-family: inet performance.io-thread-count: 32 performance.cache-size: 1900MB performance.write-behind-window-size: 16MB performance.flush-behind: on network.ping-timeout: 10 # gluster volume info of old cluster Volume Name: X Type: Replicate Volume ID: 1bd3b5d8-b10f-4c4b-a28a-06ea4cfa1d89 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: fs1.x.net:/var/local/gfs Brick2: fs2.x.net:/var/local/gfs Options Reconfigured: network.ping-timeout: 10 performance.cache-size: 512MB server.allow-insecure: on client.bind-insecure: on I was able to collect a profile from new gluster and pasted here: https://pastebin.com/ffF8RVH4 . The sad part is that I was unable to reproduce the issue after reenabling performance.readdir-ahead after. Not sure if the clients connected to the cluster were unable to create a workload near to the one that we had this morning. We'll try to recreate the condition that we had soon. I can provide more info and tests if you guys need it. Cheers, Alberto Bengoa -------------- next part -------------- An HTML attachment was scrubbed... URL: From meira at cesup.ufrgs.br Mon Feb 18 17:53:19 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Mon, 18 Feb 2019 14:53:19 -0300 (-03) Subject: [Gluster-users] GlusterFS Scale Message-ID: We're running some benchmarks on a striped glusterfs volume. We have 6 identical servers acting as bricks. Measured link speed between these servers is 3.36GB/s. Link speed between clients of the parallel file system and its servers is also 3.36GB/s. So we're expecting this system to have a write performance of around 20.16GB/s (6 times 3.36GB/s) minus some write overhead. If we write to the system from a single client, we manage to write at around 3.36GB/s. That's okay, because we're limited by the max throughput of that client's network adapter. But when we account for that and write from 6 or more clients, we can never get past 11GB/s. Is that right? Is this really the overhead to be expected? We'd appreciate any inputs. Output of gluster volume info: Volume Name: gfs0 Type: Stripe Volume ID: 2ca3dd45-6209-43ff-a164-7f2694097c64 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 6 = 6 Transport-type: tcp Bricks: Brick1: pfs01-ib:/mnt/data Brick2: pfs02-ib:/mnt/data Brick3: pfs03-ib:/mnt/data Brick4: pfs04-ib:/mnt/data Brick5: pfs05-ib:/mnt/data Brick6: pfs06-ib:/mnt/data Options Reconfigured: cluster.stripe-block-size: 128KB performance.cache-size: 32MB performance.write-behind-window-size: 1MB performance.strict-write-ordering: off performance.strict-o-direct: off performance.stat-prefetch: off server.event-threads: 4 client.event-threads: 2 performance.io-thread-count: 16 transport.address-family: inet nfs.disable: on cluster.localtime-logging: enable Thanks, Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 From shaik.salam at tcs.com Fri Feb 8 06:35:02 2019 From: shaik.salam at tcs.com (Shaik Salam) Date: Fri, 8 Feb 2019 12:05:02 +0530 Subject: [Gluster-users] Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: server busy Message-ID: Hi Raghavendra, We are also facing following issue which is mentioned in case on openshift origin while we are creating pvc for pods. (Please provide workaround to move further (pod restart doesn't workout) https://bugzilla.redhat.com/show_bug.cgi?id=1630117 https://bugzilla.redhat.com/show_bug.cgi?id=1636912 Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Failed to provision volume with StorageClass "glusterfs-storage": glusterfs: create volume err: error creating volume Server busy. Retry operation later.. at a time to create only one volume when in-flight operations are zero. Once volume requested it reaches to 8. Now single volume not able to create and we are till now mostly 10 volumes are created. Please find heketidb dump and log [negroni] Completed 200 OK in 98.699?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 106.654?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 185.406?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 102.664?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 192.658?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 198.611?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 124.254?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 101.491?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 116.997?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 100.171?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 109.238?s [negroni] Started POST /volumes [heketi] WARNING 2019/01/28 06:50:57 operations in-flight (8) exceeds limit (8) [negroni] Completed 429 Too Many Requests in 191.118?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 188.791?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 94.436?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 110.893?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.132?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 96.15?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 112.682?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 140.543?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 182.066?s [negroni] Started GET /queue/2604dd5965445711a4b6bc28592cb0f6 [negroni] Completed 200 OK in 151.572?s BR Salam =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaymef at gmail.com Thu Feb 14 18:36:38 2019 From: jaymef at gmail.com (Jayme) Date: Thu, 14 Feb 2019 14:36:38 -0400 Subject: [Gluster-users] Tracking down high writes in GlusterFS volume Message-ID: Running an oVirt 4.3 HCI 3-way replica cluster with SSD backed storage. I've noticed that my SSD writes (smart Total_LBAs_Written) are quite high on one particular drive. Specifically I've noticed one volume is much much higher total bytes written than others (despite using less overall space). My volume is writing over 1TB of data per day (by my manual calculation, and with glusterfs profiling) and wearing my SSDs quickly, how can I best determine which VM or process is at fault here? There are 5 low use VMs using the volume in question. I'm attempting to track iostats on each of the vm's individually but so far I'm not seeing anything obvious that would account for 1TB of writes per day that the gluster volume is reporting. -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Tue Feb 19 09:47:09 2019 From: revirii at googlemail.com (Hu Bert) Date: Tue, 19 Feb 2019 10:47:09 +0100 Subject: [Gluster-users] gluster 5.3: file or directory not read-/writeable, although it exists - cache? Message-ID: Hello @ll, one of our backend developers told me that, in the tomcat logs, he sees errors that directories on a glusterfs mount aren't readable. Within tomcat the errors look like this: 2019-02-19 07:39:27,124 WARN Path /data/repository/shared/public/staticmap/370/626 is existed but it is not directory java.nio.file.FileAlreadyExistsException: /data/repository/shared/public/staticmap/370/626 But the basic directory does exist, has been created on 2019-02-18 (and is readable on other clients): ls -lah /data/repository/shared/public/staticmap/370/626/ total 36K drwxr-xr-x 9 tomcat8 tomcat8 4.0K Feb 18 12:15 . drwxr-xr-x 522 tomcat8 tomcat8 4.0K Feb 19 10:29 .. drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:45 37062632 drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 09:29 37062647 drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 12:15 37062663 drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:18 37062668 drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:36 37062681 drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 16:53 37062682 drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 08:19 37062688 gluster v5.3, debian stretch. gluster volume info: https://pastebin.com/UBVWSUex gluster volume status: https://pastebin.com/3guxFq5m mount options on client: gluster1:/workdata /data/repository/shared/public glusterfs defaults,_netdev,lru-limit=0,backup-volfile-servers=gluster2:gluster3 0 0 brick mount options: /dev/md/4 /gluster/md4 xfs inode64,noatime,nodiratime 0 0 Hmm... problem with mount options? Or is some cache involved? Best regards, Hubert From revirii at googlemail.com Tue Feb 19 13:10:53 2019 From: revirii at googlemail.com (Hu Bert) Date: Tue, 19 Feb 2019 14:10:53 +0100 Subject: [Gluster-users] gluster 5.3: file or directory not read-/writeable, although it exists - cache? In-Reply-To: References: Message-ID: a little update: it seems to be only one client. Whenever there's a "no such file or directory" error in the logs, the directory can be read/opened on all the other clients. very strange... Nothing in glusterfs logs besides the zillion "dict is NULL [Invalid argument]" warnings. Must be something on the client itself. Am Di., 19. Feb. 2019 um 10:47 Uhr schrieb Hu Bert : > > Hello @ll, > > one of our backend developers told me that, in the tomcat logs, he > sees errors that directories on a glusterfs mount aren't readable. > Within tomcat the errors look like this: > > 2019-02-19 07:39:27,124 WARN Path > /data/repository/shared/public/staticmap/370/626 is existed but it is > not directory > java.nio.file.FileAlreadyExistsException: > /data/repository/shared/public/staticmap/370/626 > > But the basic directory does exist, has been created on 2019-02-18 > (and is readable on other clients): > > ls -lah /data/repository/shared/public/staticmap/370/626/ > total 36K > drwxr-xr-x 9 tomcat8 tomcat8 4.0K Feb 18 12:15 . > drwxr-xr-x 522 tomcat8 tomcat8 4.0K Feb 19 10:29 .. > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:45 37062632 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 09:29 37062647 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 12:15 37062663 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:18 37062668 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:36 37062681 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 16:53 37062682 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 08:19 37062688 > > gluster v5.3, debian stretch. > gluster volume info: https://pastebin.com/UBVWSUex > gluster volume status: https://pastebin.com/3guxFq5m > > mount options on client: > gluster1:/workdata /data/repository/shared/public glusterfs > defaults,_netdev,lru-limit=0,backup-volfile-servers=gluster2:gluster3 > 0 0 > > brick mount options: > /dev/md/4 /gluster/md4 xfs inode64,noatime,nodiratime 0 0 > > Hmm... problem with mount options? Or is some cache involved? > > > Best regards, > Hubert From revirii at googlemail.com Tue Feb 19 13:39:12 2019 From: revirii at googlemail.com (Hu Bert) Date: Tue, 19 Feb 2019 14:39:12 +0100 Subject: [Gluster-users] gluster 5.3: file or directory not read-/writeable, although it exists - cache? In-Reply-To: References: Message-ID: No, that conclusion was too early. Error happens on all clients, but the directories are different, e.g.: dir1 fails on client1, but works on all others dir2 fails on client2, but works on all others etc. Interesting: when doing a 'ls' on the dir and TAB->autocomplete (missing the last number), the directory appears: ls /data/repository/shared/public/images/370/435/3704359 37043592/ 37043593/ 37043595/ 37043597/ 37043598/ 37043599/ and the can be opened and cd'ed into. Content appears: ls -lah /shared/public/images/370/435/37043597/ total 1.6M drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 13:52 . drwxr-xr-x 69 tomcat8 tomcat8 4.0K Feb 17 21:42 .. -rw-r--r-- 1 tomcat8 tomcat8 7.0K Feb 19 13:52 100x100f.jpg -rw-r--r-- 1 tomcat8 tomcat8 235K Feb 19 13:23 1080x610r.jpg -rw-r--r-- 1 tomcat8 tomcat8 96K Feb 18 10:46 1200x1200s.jpg -rw-r--r-- 1 tomcat8 tomcat8 14K Feb 17 23:03 150x150f.jpg [...] Some error message in /var/log/glusterfs/data-repository-shared-public.log on one of the clients saying that mkdir fails because the directory exists: [2019-02-19 10:39:44.673279] W [fuse-bridge.c:582:fuse_entry_cbk] 0-glusterfs-fuse: 69757484: MKDIR() /images/370/435/37043597 => -1 (File exists) [2019-02-19 11:14:58.625014] W [fuse-bridge.c:582:fuse_entry_cbk] 0-glusterfs-fuse: 69977210: MKDIR() /images/370/435/37043597 => -1 (File exists) [2019-02-19 11:42:15.626959] W [fuse-bridge.c:582:fuse_entry_cbk] 0-glusterfs-fuse: 70148412: MKDIR() /images/370/435/37043597 => -1 (File exists) [2019-02-19 12:26:01.065483] W [fuse-bridge.c:582:fuse_entry_cbk] 0-glusterfs-fuse: 70406931: MKDIR() /images/370/435/37043597 => -1 (File exists) The directory exists -> warning is OK, but why doesn't it appear first? Am Di., 19. Feb. 2019 um 14:10 Uhr schrieb Hu Bert : > > a little update: it seems to be only one client. Whenever there's a > "no such file or directory" error in the logs, the directory can be > read/opened on all the other clients. very strange... > > Nothing in glusterfs logs besides the zillion "dict is NULL [Invalid > argument]" warnings. Must be something on the client itself. > > Am Di., 19. Feb. 2019 um 10:47 Uhr schrieb Hu Bert : > > > > Hello @ll, > > > > one of our backend developers told me that, in the tomcat logs, he > > sees errors that directories on a glusterfs mount aren't readable. > > Within tomcat the errors look like this: > > > > 2019-02-19 07:39:27,124 WARN Path > > /data/repository/shared/public/staticmap/370/626 is existed but it is > > not directory > > java.nio.file.FileAlreadyExistsException: > > /data/repository/shared/public/staticmap/370/626 > > > > But the basic directory does exist, has been created on 2019-02-18 > > (and is readable on other clients): > > > > ls -lah /data/repository/shared/public/staticmap/370/626/ > > total 36K > > drwxr-xr-x 9 tomcat8 tomcat8 4.0K Feb 18 12:15 . > > drwxr-xr-x 522 tomcat8 tomcat8 4.0K Feb 19 10:29 .. > > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:45 37062632 > > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 09:29 37062647 > > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 12:15 37062663 > > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:18 37062668 > > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:36 37062681 > > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 16:53 37062682 > > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 08:19 37062688 > > > > gluster v5.3, debian stretch. > > gluster volume info: https://pastebin.com/UBVWSUex > > gluster volume status: https://pastebin.com/3guxFq5m > > > > mount options on client: > > gluster1:/workdata /data/repository/shared/public glusterfs > > defaults,_netdev,lru-limit=0,backup-volfile-servers=gluster2:gluster3 > > 0 0 > > > > brick mount options: > > /dev/md/4 /gluster/md4 xfs inode64,noatime,nodiratime 0 0 > > > > Hmm... problem with mount options? Or is some cache involved? > > > > > > Best regards, > > Hubert From nbalacha at redhat.com Wed Feb 20 05:12:14 2019 From: nbalacha at redhat.com (Nithya Balachandran) Date: Wed, 20 Feb 2019 10:42:14 +0530 Subject: [Gluster-users] gluster 5.3: file or directory not read-/writeable, although it exists - cache? In-Reply-To: References: Message-ID: On Tue, 19 Feb 2019 at 15:18, Hu Bert wrote: > Hello @ll, > > one of our backend developers told me that, in the tomcat logs, he > sees errors that directories on a glusterfs mount aren't readable. > Within tomcat the errors look like this: > > 2019-02-19 07:39:27,124 WARN Path > /data/repository/shared/public/staticmap/370/626 is existed but it is > not directory > java.nio.file.FileAlreadyExistsException: > /data/repository/shared/public/staticmap/370/626 > Do you know which operation failed here? regards, Nithya > > But the basic directory does exist, has been created on 2019-02-18 > (and is readable on other clients): > > ls -lah /data/repository/shared/public/staticmap/370/626/ > total 36K > drwxr-xr-x 9 tomcat8 tomcat8 4.0K Feb 18 12:15 . > drwxr-xr-x 522 tomcat8 tomcat8 4.0K Feb 19 10:29 .. > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:45 37062632 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 09:29 37062647 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 12:15 37062663 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:18 37062668 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:36 37062681 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 16:53 37062682 > drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 08:19 37062688 > > gluster v5.3, debian stretch. > gluster volume info: https://pastebin.com/UBVWSUex > gluster volume status: https://pastebin.com/3guxFq5m > > mount options on client: > gluster1:/workdata /data/repository/shared/public glusterfs > defaults,_netdev,lru-limit=0,backup-volfile-servers=gluster2:gluster3 > 0 0 > > brick mount options: > /dev/md/4 /gluster/md4 xfs inode64,noatime,nodiratime 0 0 > > Hmm... problem with mount options? Or is some cache involved? > > > Best regards, > Hubert > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From archon810 at gmail.com Wed Feb 20 05:36:34 2019 From: archon810 at gmail.com (Artem Russakovskii) Date: Tue, 19 Feb 2019 21:36:34 -0800 Subject: [Gluster-users] Message repeated over and over after upgrade from 4.1 to 5.3: W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] In-Reply-To: References: Message-ID: Hi Nithya, Unfortunately, I just had another crash on the same server, with performance.write-behind still set to off. I'll email the core file privately. [2019-02-19 19:50:39.511743] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7f9598991329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7f9598ba2af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7f95a137d218] ) 2-dict: dict is NULL [Invalid argument] The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 95 times between [2019-02-19 19:49:07.655620] and [2019-02-19 19:50:39.499284] The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-_data3-replicate-0: selecting local read_child _data3-client-3" repeated 56 times between [2019-02-19 19:49:07.602370] and [2019-02-19 19:50:42.912766] pending frames: frame : type(1) op(LOOKUP) frame : type(0) op(0) patchset: git://git.gluster.org/glusterfs.git signal received: 6 time of crash: 2019-02-19 19:50:43 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.3 /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f95a138864c] /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f95a1392cb6] /lib64/libc.so.6(+0x36160)[0x7f95a054f160] /lib64/libc.so.6(gsignal+0x110)[0x7f95a054f0e0] /lib64/libc.so.6(abort+0x151)[0x7f95a05506c1] /lib64/libc.so.6(+0x2e6fa)[0x7f95a05476fa] /lib64/libc.so.6(+0x2e772)[0x7f95a0547772] /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f95a08dd0b8] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f95994f0c9d] /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f9599503ba1] /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f9599788f3f] /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f95a1153820] /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f95a1153b6f] /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f95a1150063] /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f959aea00b2] /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f95a13e64c3] /lib64/libpthread.so.0(+0x7559)[0x7f95a08da559] /lib64/libc.so.6(clone+0x3f)[0x7f95a061181f] --------- [2019-02-19 19:51:34.425106] I [MSGID: 100030] [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 --process-name fuse --volfile-server=localhost --volfile-id=/_data3 /mnt/_data3) [2019-02-19 19:51:34.435206] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-02-19 19:51:34.450272] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 2 [2019-02-19 19:51:34.450394] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 4 [2019-02-19 19:51:34.450488] I [MSGID: 101190] [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread with index 3 Sincerely, Artem -- Founder, Android Police , APK Mirror , Illogical Robot LLC beerpla.net | +ArtemRussakovskii | @ArtemR On Tue, Feb 12, 2019 at 12:38 AM Nithya Balachandran wrote: > > Not yet but we are discussing an interim release. It is going to take a > couple of days to review the fixes so not before then. We will update on > the list with dates once we decide. > > > On Tue, 12 Feb 2019 at 11:46, Artem Russakovskii > wrote: > >> Awesome. But is there a release schedule and an ETA for when these will >> be out in the repos? >> >> On Mon, Feb 11, 2019, 9:34 PM Raghavendra Gowdappa >> wrote: >> >>> >>> >>> On Tue, Feb 12, 2019 at 10:24 AM Artem Russakovskii >>> wrote: >>> >>>> Great job identifying the issue! >>>> >>>> Any ETA on the next release with the logging and crash fixes in it? >>>> >>> >>> I've marked write-behind corruption as a blocker for release-6. Logging >>> fixes are already in codebase. >>> >>> >>>> On Mon, Feb 11, 2019, 7:19 PM Raghavendra Gowdappa >>>> wrote: >>>> >>>>> >>>>> >>>>> On Mon, Feb 11, 2019 at 3:49 PM Jo?o Ba?to < >>>>> joao.bauto at neuro.fchampalimaud.org> wrote: >>>>> >>>>>> Although I don't have these error messages, I'm having fuse crashes >>>>>> as frequent as you. I have disabled write-behind and the mount has been >>>>>> running over the weekend with heavy usage and no issues. >>>>>> >>>>> >>>>> The issue you are facing will likely be fixed by patch [1]. Me, Xavi >>>>> and Nithya were able to identify the corruption in write-behind. >>>>> >>>>> [1] https://review.gluster.org/22189 >>>>> >>>>> >>>>>> I can provide coredumps before disabling write-behind if needed. I >>>>>> opened a BZ report >>>>>> with the >>>>>> crashes that I was having. >>>>>> >>>>>> *Jo?o Ba?to* >>>>>> --------------- >>>>>> >>>>>> *Scientific Computing and Software Platform* >>>>>> Champalimaud Research >>>>>> Champalimaud Center for the Unknown >>>>>> Av. Bras?lia, Doca de Pedrou?os >>>>>> 1400-038 Lisbon, Portugal >>>>>> fchampalimaud.org >>>>>> >>>>>> >>>>>> Artem Russakovskii escreveu no dia s?bado, >>>>>> 9/02/2019 ?(s) 22:18: >>>>>> >>>>>>> Alright. I've enabled core-dumping (hopefully), so now I'm waiting >>>>>>> for the next crash to see if it dumps a core for you guys to remotely debug. >>>>>>> >>>>>>> Then I can consider setting performance.write-behind to off and >>>>>>> monitoring for further crashes. >>>>>>> >>>>>>> Sincerely, >>>>>>> Artem >>>>>>> >>>>>>> -- >>>>>>> Founder, Android Police , APK Mirror >>>>>>> , Illogical Robot LLC >>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>> | @ArtemR >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 8, 2019 at 7:22 PM Raghavendra Gowdappa < >>>>>>> rgowdapp at redhat.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Feb 9, 2019 at 12:53 AM Artem Russakovskii < >>>>>>>> archon810 at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Nithya, >>>>>>>>> >>>>>>>>> I can try to disable write-behind as long as it doesn't heavily >>>>>>>>> impact performance for us. Which option is it exactly? I don't see it set >>>>>>>>> in my list of changed volume variables that I sent you guys earlier. >>>>>>>>> >>>>>>>> >>>>>>>> The option is performance.write-behind >>>>>>>> >>>>>>>> >>>>>>>>> Sincerely, >>>>>>>>> Artem >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Founder, Android Police , APK Mirror >>>>>>>>> , Illogical Robot LLC >>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>> | @ArtemR >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Feb 8, 2019 at 4:57 AM Nithya Balachandran < >>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Artem, >>>>>>>>>> >>>>>>>>>> We have found the cause of one crash. Unfortunately we have not >>>>>>>>>> managed to reproduce the one you reported so we don't know if it is the >>>>>>>>>> same cause. >>>>>>>>>> >>>>>>>>>> Can you disable write-behind on the volume and let us know if it >>>>>>>>>> solves the problem? If yes, it is likely to be the same issue. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> regards, >>>>>>>>>> Nithya >>>>>>>>>> >>>>>>>>>> On Fri, 8 Feb 2019 at 06:51, Artem Russakovskii < >>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Sorry to disappoint, but the crash just happened again, so >>>>>>>>>>> lru-limit=0 didn't help. >>>>>>>>>>> >>>>>>>>>>> Here's the snippet of the crash and the subsequent remount by >>>>>>>>>>> monit. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [2019-02-08 01:13:05.854391] W [dict.c:761:dict_ref] >>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>> [0x7f4402b99329] >>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>> [0x7f4402daaaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>> [0x7f440b6b5218] ) 0-dict: dict is NULL [In >>>>>>>>>>> valid argument] >>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-_data1-replicate-0: >>>>>>>>>>> selecting local read_child _data1-client-3" repeated 39 times between >>>>>>>>>>> [2019-02-08 01:11:18.043286] and [2019-02-08 01:13:07.915604] >>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>> handler" repeated 515 times between [2019-02-08 01:11:17.932515] and >>>>>>>>>>> [2019-02-08 01:13:09.311554] >>>>>>>>>>> pending frames: >>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>> signal received: 6 >>>>>>>>>>> time of crash: >>>>>>>>>>> 2019-02-08 01:13:09 >>>>>>>>>>> configuration details: >>>>>>>>>>> argp 1 >>>>>>>>>>> backtrace 1 >>>>>>>>>>> dlfcn 1 >>>>>>>>>>> libpthread 1 >>>>>>>>>>> llistxattr 1 >>>>>>>>>>> setfsid 1 >>>>>>>>>>> spinlock 1 >>>>>>>>>>> epoll.h 1 >>>>>>>>>>> xattr.h 1 >>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7f440b6c064c] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7f440b6cacb6] >>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7f440a887160] >>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7f440a8870e0] >>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7f440a8886c1] >>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7f440a87f6fa] >>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7f440a87f772] >>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7f440ac150b8] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7f44036f8c9d] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7f440370bba1] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7f4403990f3f] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7f440b48b820] >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7f440b48bb6f] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f440b488063] >>>>>>>>>>> >>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7f44050a80b2] >>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7f440b71e4c3] >>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7f440ac12559] >>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7f440a94981f] >>>>>>>>>>> --------- >>>>>>>>>>> [2019-02-08 01:13:35.628478] I [MSGID: 100030] >>>>>>>>>>> [glusterfsd.c:2715:main] 0-/usr/sbin/glusterfs: Started running >>>>>>>>>>> /usr/sbin/glusterfs version 5.3 (args: /usr/sbin/glusterfs --lru-limit=0 >>>>>>>>>>> --process-name fuse --volfile-server=localhost --volfile-id=/_data1 >>>>>>>>>>> /mnt/_data1) >>>>>>>>>>> [2019-02-08 01:13:35.637830] I [MSGID: 101190] >>>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>>> with index 1 >>>>>>>>>>> [2019-02-08 01:13:35.651405] I [MSGID: 101190] >>>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>>> with index 2 >>>>>>>>>>> [2019-02-08 01:13:35.651628] I [MSGID: 101190] >>>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>>> with index 3 >>>>>>>>>>> [2019-02-08 01:13:35.651747] I [MSGID: 101190] >>>>>>>>>>> [event-epoll.c:622:event_dispatch_epoll_worker] 0-epoll: Started thread >>>>>>>>>>> with index 4 >>>>>>>>>>> [2019-02-08 01:13:35.652575] I [MSGID: 114020] >>>>>>>>>>> [client.c:2354:notify] 0-_data1-client-0: parent translators are >>>>>>>>>>> ready, attempting connect on transport >>>>>>>>>>> [2019-02-08 01:13:35.652978] I [MSGID: 114020] >>>>>>>>>>> [client.c:2354:notify] 0-_data1-client-1: parent translators are >>>>>>>>>>> ready, attempting connect on transport >>>>>>>>>>> [2019-02-08 01:13:35.655197] I [MSGID: 114020] >>>>>>>>>>> [client.c:2354:notify] 0-_data1-client-2: parent translators are >>>>>>>>>>> ready, attempting connect on transport >>>>>>>>>>> [2019-02-08 01:13:35.655497] I [MSGID: 114020] >>>>>>>>>>> [client.c:2354:notify] 0-_data1-client-3: parent translators are >>>>>>>>>>> ready, attempting connect on transport >>>>>>>>>>> [2019-02-08 01:13:35.655527] I >>>>>>>>>>> [rpc-clnt.c:2042:rpc_clnt_reconfig] 0-_data1-client-0: changing port >>>>>>>>>>> to 49153 (from 0) >>>>>>>>>>> Final graph: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Sincerely, >>>>>>>>>>> Artem >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>> | @ArtemR >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Feb 7, 2019 at 1:28 PM Artem Russakovskii < >>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I've added the lru-limit=0 parameter to the mounts, and I see >>>>>>>>>>>> it's taken effect correctly: >>>>>>>>>>>> "/usr/sbin/glusterfs --lru-limit=0 --process-name fuse >>>>>>>>>>>> --volfile-server=localhost --volfile-id=/ /mnt/" >>>>>>>>>>>> >>>>>>>>>>>> Let's see if it stops crashing or not. >>>>>>>>>>>> >>>>>>>>>>>> Sincerely, >>>>>>>>>>>> Artem >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>> | @ArtemR >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii < >>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Nithya, >>>>>>>>>>>>> >>>>>>>>>>>>> Indeed, I upgraded from 4.1 to 5.3, at which point I started >>>>>>>>>>>>> seeing crashes, and no further releases have been made yet. >>>>>>>>>>>>> >>>>>>>>>>>>> volume info: >>>>>>>>>>>>> Type: Replicate >>>>>>>>>>>>> Volume ID: ****SNIP**** >>>>>>>>>>>>> Status: Started >>>>>>>>>>>>> Snapshot Count: 0 >>>>>>>>>>>>> Number of Bricks: 1 x 4 = 4 >>>>>>>>>>>>> Transport-type: tcp >>>>>>>>>>>>> Bricks: >>>>>>>>>>>>> Brick1: ****SNIP**** >>>>>>>>>>>>> Brick2: ****SNIP**** >>>>>>>>>>>>> Brick3: ****SNIP**** >>>>>>>>>>>>> Brick4: ****SNIP**** >>>>>>>>>>>>> Options Reconfigured: >>>>>>>>>>>>> cluster.quorum-count: 1 >>>>>>>>>>>>> cluster.quorum-type: fixed >>>>>>>>>>>>> network.ping-timeout: 5 >>>>>>>>>>>>> network.remote-dio: enable >>>>>>>>>>>>> performance.rda-cache-limit: 256MB >>>>>>>>>>>>> performance.readdir-ahead: on >>>>>>>>>>>>> performance.parallel-readdir: on >>>>>>>>>>>>> network.inode-lru-limit: 500000 >>>>>>>>>>>>> performance.md-cache-timeout: 600 >>>>>>>>>>>>> performance.cache-invalidation: on >>>>>>>>>>>>> performance.stat-prefetch: on >>>>>>>>>>>>> features.cache-invalidation-timeout: 600 >>>>>>>>>>>>> features.cache-invalidation: on >>>>>>>>>>>>> cluster.readdir-optimize: on >>>>>>>>>>>>> performance.io-thread-count: 32 >>>>>>>>>>>>> server.event-threads: 4 >>>>>>>>>>>>> client.event-threads: 4 >>>>>>>>>>>>> performance.read-ahead: off >>>>>>>>>>>>> cluster.lookup-optimize: on >>>>>>>>>>>>> performance.cache-size: 1GB >>>>>>>>>>>>> cluster.self-heal-daemon: enable >>>>>>>>>>>>> transport.address-family: inet >>>>>>>>>>>>> nfs.disable: on >>>>>>>>>>>>> performance.client-io-threads: on >>>>>>>>>>>>> cluster.granular-entry-heal: enable >>>>>>>>>>>>> cluster.data-self-heal-algorithm: full >>>>>>>>>>>>> >>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>> Artem >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran < >>>>>>>>>>>>> nbalacha at redhat.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you still see the crashes with 5.3? If yes, please try >>>>>>>>>>>>>> mount the volume using the mount option lru-limit=0 and see if that helps. >>>>>>>>>>>>>> We are looking into the crashes and will update when have a fix. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Also, please provide the gluster volume info for the volume >>>>>>>>>>>>>> in question. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> regards, >>>>>>>>>>>>>> Nithya >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, 5 Feb 2019 at 05:31, Artem Russakovskii < >>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The fuse crash happened two more times, but this time monit >>>>>>>>>>>>>>> helped recover within 1 minute, so it's a great workaround for now. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> What's odd is that the crashes are only happening on one of >>>>>>>>>>>>>>> 4 servers, and I don't know why. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii < >>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The fuse crash happened again yesterday, to another volume. >>>>>>>>>>>>>>>> Are there any mount options that could help mitigate this? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In the meantime, I set up a monit ( >>>>>>>>>>>>>>>> https://mmonit.com/monit/) task to watch and restart the >>>>>>>>>>>>>>>> mount, which works and recovers the mount point within a minute. Not ideal, >>>>>>>>>>>>>>>> but a temporary workaround. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> By the way, the way to reproduce this "Transport endpoint >>>>>>>>>>>>>>>> is not connected" condition for testing purposes is to kill -9 the right >>>>>>>>>>>>>>>> "glusterfs --process-name fuse" process. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> monit check: >>>>>>>>>>>>>>>> check filesystem glusterfs_data1 with path >>>>>>>>>>>>>>>> /mnt/glusterfs_data1 >>>>>>>>>>>>>>>> start program = "/bin/mount /mnt/glusterfs_data1" >>>>>>>>>>>>>>>> stop program = "/bin/umount /mnt/glusterfs_data1" >>>>>>>>>>>>>>>> if space usage > 90% for 5 times within 15 cycles >>>>>>>>>>>>>>>> then alert else if succeeded for 10 cycles then alert >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> stack trace: >>>>>>>>>>>>>>>> [2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> [2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>> [0x7fa0249e4329] >>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>> [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>> [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch >>>>>>>>>>>>>>>> handler" repeated 26 times between [2019-02-01 23:21:20.857333] and >>>>>>>>>>>>>>>> [2019-02-01 23:21:56.164427] >>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: >>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-3" repeated 27 times between >>>>>>>>>>>>>>>> [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036] >>>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>>> frame : type(1) op(LOOKUP) >>>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>>> 2019-02-01 23:22:03 >>>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fa02c12d160] >>>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0] >>>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa] >>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fa02c125772] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f] >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820] >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2] >>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3] >>>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559] >>>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii < >>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The first (and so far only) crash happened at 2am the next >>>>>>>>>>>>>>>>> day after we upgraded, on only one of four servers and only to one of two >>>>>>>>>>>>>>>>> mounts. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have no idea what caused it, but yeah, we do have a >>>>>>>>>>>>>>>>> pretty busy site (apkmirror.com), and it caused a >>>>>>>>>>>>>>>>> disruption for any uploads or downloads from that server until I woke up >>>>>>>>>>>>>>>>> and fixed the mount. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I wish I could be more helpful but all I have is that >>>>>>>>>>>>>>>>> stack trace. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm glad it's a blocker and will hopefully be resolved >>>>>>>>>>>>>>>>> soon. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan < >>>>>>>>>>>>>>>>> atumball at redhat.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Artem, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Opened >>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, >>>>>>>>>>>>>>>>>> as a clone of other bugs where recent discussions happened), and marked it >>>>>>>>>>>>>>>>>> as a blocker for glusterfs-5.4 release. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We already have fixes for log flooding - >>>>>>>>>>>>>>>>>> https://review.gluster.org/22128, and are the process of >>>>>>>>>>>>>>>>>> identifying and fixing the issue seen with crash. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Can you please tell if the crashes happened as soon as >>>>>>>>>>>>>>>>>> upgrade ? or was there any particular pattern you observed before the crash. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -Amar >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii < >>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Within 24 hours after updating from rock solid 4.1 to >>>>>>>>>>>>>>>>>>> 5.3, I already got a crash which others have mentioned in >>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and >>>>>>>>>>>>>>>>>>> had to unmount, kill gluster, and remount: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>> [0x7fcccafcd329] >>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>> [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>> [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-3" repeated 5 times between >>>>>>>>>>>>>>>>>>> [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061] >>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>> handler" repeated 72 times between [2019-01-31 09:37:53.746741] and >>>>>>>>>>>>>>>>>>> [2019-01-31 09:38:04.696993] >>>>>>>>>>>>>>>>>>> pending frames: >>>>>>>>>>>>>>>>>>> frame : type(1) op(READ) >>>>>>>>>>>>>>>>>>> frame : type(1) op(OPEN) >>>>>>>>>>>>>>>>>>> frame : type(0) op(0) >>>>>>>>>>>>>>>>>>> patchset: git://git.gluster.org/glusterfs.git >>>>>>>>>>>>>>>>>>> signal received: 6 >>>>>>>>>>>>>>>>>>> time of crash: >>>>>>>>>>>>>>>>>>> 2019-01-31 09:38:04 >>>>>>>>>>>>>>>>>>> configuration details: >>>>>>>>>>>>>>>>>>> argp 1 >>>>>>>>>>>>>>>>>>> backtrace 1 >>>>>>>>>>>>>>>>>>> dlfcn 1 >>>>>>>>>>>>>>>>>>> libpthread 1 >>>>>>>>>>>>>>>>>>> llistxattr 1 >>>>>>>>>>>>>>>>>>> setfsid 1 >>>>>>>>>>>>>>>>>>> spinlock 1 >>>>>>>>>>>>>>>>>>> epoll.h 1 >>>>>>>>>>>>>>>>>>> xattr.h 1 >>>>>>>>>>>>>>>>>>> st_atim.tv_nsec 1 >>>>>>>>>>>>>>>>>>> package-string: glusterfs 5.3 >>>>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6] >>>>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x36160)[0x7fccd622d160] >>>>>>>>>>>>>>>>>>> /lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0] >>>>>>>>>>>>>>>>>>> /lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1] >>>>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa] >>>>>>>>>>>>>>>>>>> /lib64/libc.so.6(+0x2e772)[0x7fccd6225772] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778] >>>>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820] >>>>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063] >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> /usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2] >>>>>>>>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3] >>>>>>>>>>>>>>>>>>> /lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559] >>>>>>>>>>>>>>>>>>> /lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f] >>>>>>>>>>>>>>>>>>> --------- >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Do the pending patches fix the crash or only the >>>>>>>>>>>>>>>>>>> repeated warnings? I'm running glusterfs on OpenSUSE 15.0 installed via >>>>>>>>>>>>>>>>>>> http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, >>>>>>>>>>>>>>>>>>> not too sure how to make it core dump. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> If it's not fixed by the patches above, has anyone >>>>>>>>>>>>>>>>>>> already opened a ticket for the crashes that I can join and monitor? This >>>>>>>>>>>>>>>>>>> is going to create a massive problem for us since production systems are >>>>>>>>>>>>>>>>>>> crashing. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>> Founder, Android Police , APK >>>>>>>>>>>>>>>>>>> Mirror , Illogical Robot LLC >>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa < >>>>>>>>>>>>>>>>>>> rgowdapp at redhat.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii < >>>>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Also, not sure if related or not, but I got a ton of >>>>>>>>>>>>>>>>>>>>> these "Failed to dispatch handler" in my logs as well. Many people have >>>>>>>>>>>>>>>>>>>>> been commenting about this issue here >>>>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1651246. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> https://review.gluster.org/#/c/glusterfs/+/22046/ >>>>>>>>>>>>>>>>>>>> addresses this. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>>>> handler" repeated 413 times between [2019-01-30 20:36:23.881090] and >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.015593] >>>>>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0" repeated 42 times between >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306] >>>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>>>> The message "I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0" repeated 50 times between >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789] >>>>>>>>>>>>>>>>>>>>>> The message "E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>>>> handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:20.546355] >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:21.492319] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: >>>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data1-client-0 >>>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data3.log <== >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.349689] I [MSGID: 108031] >>>>>>>>>>>>>>>>>>>>>> [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: >>>>>>>>>>>>>>>>>>>>>> selecting local read_child SITE_data3-client-0 >>>>>>>>>>>>>>>>>>>>>> ==> mnt-SITE_data1.log <== >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:38:22.762941] E [MSGID: 101191] >>>>>>>>>>>>>>>>>>>>>> [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch >>>>>>>>>>>>>>>>>>>>>> handler >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm hoping raising the issue here on the mailing list >>>>>>>>>>>>>>>>>>>>> may bring some additional eyeballs and get them both fixed. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>> Founder, Android Police >>>>>>>>>>>>>>>>>>>>> , APK Mirror , Illogical >>>>>>>>>>>>>>>>>>>>> Robot LLC >>>>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>>>> | @ArtemR >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii < >>>>>>>>>>>>>>>>>>>>> archon810 at gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I found a similar issue here: >>>>>>>>>>>>>>>>>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1313567. >>>>>>>>>>>>>>>>>>>>>> There's a comment from 3 days ago from someone else with 5.3 who started >>>>>>>>>>>>>>>>>>>>>> seeing the spam. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Here's the command that repeats over and over: >>>>>>>>>>>>>>>>>>>>>> [2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] >>>>>>>>>>>>>>>>>>>>>> (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) >>>>>>>>>>>>>>>>>>>>>> [0x7fd966fcd329] >>>>>>>>>>>>>>>>>>>>>> -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) >>>>>>>>>>>>>>>>>>>>>> [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) >>>>>>>>>>>>>>>>>>>>>> [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument] >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> +Milind Changire Can you check >>>>>>>>>>>>>>>>>>>> why this message is logged and send a fix? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Is there any fix for this issue? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Sincerely, >>>>>>>>>>>>>>>>>>>>>> Artem >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>> Founder, Android Police >>>>>>>>>>>>>>>>>>>>>> , APK Mirror >>>>>>>>>>>>>>>>>>>>>> , Illogical Robot LLC >>>>>>>>>>>>>>>>>>>>>> beerpla.net | +ArtemRussakovskii >>>>>>>>>>>>>>>>>>>>>> | >>>>>>>>>>>>>>>>>>>>>> @ArtemR >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>> Amar Tumballi (amarts) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> Gluster-users mailing list >>>>>>>>>>>>>>> Gluster-users at gluster.org >>>>>>>>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>> Gluster-users mailing list >>>>>>>>> Gluster-users at gluster.org >>>>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>> Gluster-users mailing list >>>>>>> Gluster-users at gluster.org >>>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>>> >>>>>> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From revirii at googlemail.com Wed Feb 20 08:47:30 2019 From: revirii at googlemail.com (Hu Bert) Date: Wed, 20 Feb 2019 09:47:30 +0100 Subject: [Gluster-users] gluster 5.3: file or directory not read-/writeable, although it exists - cache? In-Reply-To: References: Message-ID: Hi Nithya, apologies for the long mail... the backend developer told me that he simply uses standard java file utils; test if a directory exists - if not throw exception. A little explanation to our setup: our frontend requests a few files, and via round robin they get distributed to 3 tomcats, and they create the directories and files in the glusterfs volume. It's possible that one of the tomcats creates a directory and writes a file, and then another tomcat wants to write another file but then can't - and throws such a exception. After comparing the timestamps in tomcat and in the filesystem there are 2 cases: a directory already exists a couple of days or it was freshly created. Case 1: directory was freshly created. Timestamps say that the directory and one file were created at 07:05:26, and at 07:05:27 there's an error message in tomcat: 2019-02-20 07:05:27,399 WARN Path /data/repository/shared/public/staticmap/271/362/2713628 is existed but it is not directory [[fetch data pool thread 6 at de.alpstein.core.io.FileUtils:747]] java.nio.file.FileAlreadyExistsException: /data/repository/shared/public/staticmap/271/362/2713628 ls -lah --full-time /data/repository/shared/public/staticmap/271/362/2713628/ total 22K drwxr-xr-x 2 tomcat8 tomcat8 4.0K 2019-02-20 07:05:26.420007137 +0100 . drwxr-xr-x 7 tomcat8 tomcat8 4.0K 2019-02-20 07:05:26.411096237 +0100 .. -rw-r--r-- 1 tomcat8 tomcat8 14K 2019-02-20 07:05:26.424007285 +0100 170x100_ed82ac740fe8b9bbaac9ca77aa47f573.jpeg Case 2: directory is already a couple of days old. I just saw an error message in one of the tomcats i immediately checked the 3 paths on the 3 servers: server1: ls /data/repository/shared/public/staticmap/369/217/36921711/ works server2: ls /data/repository/shared/public/staticmap/369/217/36921711/ works server3: ls /data/repository/shared/public/staticmap/369/217/36921711/ ls: cannot access '/data/repository/shared/public/staticmap/369/217/36921711/': No such file or directory Error message in tomcat: 2019-02-20 08:49:28,762 WARN Path /data/repository/shared/public/staticmap/369/217/36921711 is existed but it is not directory [[fetch data pool thread 43 at de.alpstein.core.io.FileUtils:747]] java.nio.file.FileAlreadyExistsException: /data/repository/shared/public/staticmap/369/217/36921711 Timestamps of directory (is already ~7 days old): ls -lah --full-time /data/repository/shared/public/staticmap/369/217/36921711/ total 197K drwxr-xr-x 2 tomcat8 tomcat8 4.0K 2019-02-13 12:02:58.164004209 +0100 . drwxr-xr-x 6 tomcat8 tomcat8 4.0K 2019-02-13 13:36:41.159642935 +0100 .. -rw-r--r-- 1 tomcat8 tomcat8 123K 2019-02-13 12:02:58.172004504 +0100 404x475_d473f1d60fc6842dddc7a5460672722b.jpeg -rw-r--r-- 1 tomcat8 tomcat8 66K 2019-02-13 11:58:50.218874380 +0100 420x237_1a7913cc18f4ec142ebc7c578856fb66.jpeg In both cases the directory itself is OK on the other 2 servers, there's an error message on only one of them, including the 'ls $dir -> no such file or directory' error. Interestingly, in both cases, the error gets fixed when either doing umount/mount of glusterfs volume or doing this on the server with the error message: ls /data/repository/shared/public/staticmap/369/217/36921711 -> TAB -> wait 1-2 seconds -> last / appears -> Enter -> directory shows up. First i thought that might be some caching problem, but that seems not quite probable with a directory that's 7 days old. Regards, Hubert Am Mi., 20. Feb. 2019 um 06:12 Uhr schrieb Nithya Balachandran : > > > > On Tue, 19 Feb 2019 at 15:18, Hu Bert wrote: >> >> Hello @ll, >> >> one of our backend developers told me that, in the tomcat logs, he >> sees errors that directories on a glusterfs mount aren't readable. >> Within tomcat the errors look like this: >> >> 2019-02-19 07:39:27,124 WARN Path >> /data/repository/shared/public/staticmap/370/626 is existed but it is >> not directory >> java.nio.file.FileAlreadyExistsException: >> /data/repository/shared/public/staticmap/370/626 > > > Do you know which operation failed here? > regards, > Nithya >> >> >> But the basic directory does exist, has been created on 2019-02-18 >> (and is readable on other clients): >> >> ls -lah /data/repository/shared/public/staticmap/370/626/ >> total 36K >> drwxr-xr-x 9 tomcat8 tomcat8 4.0K Feb 18 12:15 . >> drwxr-xr-x 522 tomcat8 tomcat8 4.0K Feb 19 10:29 .. >> drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:45 37062632 >> drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 09:29 37062647 >> drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 12:15 37062663 >> drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:18 37062668 >> drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 11:36 37062681 >> drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 18 16:53 37062682 >> drwxr-xr-x 2 tomcat8 tomcat8 4.0K Feb 19 08:19 37062688 >> >> gluster v5.3, debian stretch. >> gluster volume info: https://pastebin.com/UBVWSUex >> gluster volume status: https://pastebin.com/3guxFq5m >> >> mount options on client: >> gluster1:/workdata /data/repository/shared/public glusterfs >> defaults,_netdev,lru-limit=0,backup-volfile-servers=gluster2:gluster3 >> 0 0 >> >> brick mount options: >> /dev/md/4 /gluster/md4 xfs inode64,noatime,nodiratime 0 0 >> >> Hmm... problem with mount options? Or is some cache involved? >> >> >> Best regards, >> Hubert >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users From atumball at redhat.com Wed Feb 20 13:28:08 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 20 Feb 2019 18:58:08 +0530 Subject: [Gluster-users] GlusterFS Scale In-Reply-To: References: Message-ID: On Mon, Feb 18, 2019 at 11:23 PM Lindolfo Meira wrote: > We're running some benchmarks on a striped glusterfs volume. > > Hi Lindolfo, We are not supporting Stripe anymore, and planning to remove it from build too by glusterfs-6.0 (ie, next release). See if you can use 'Shard' for the usecase. > We have 6 identical servers acting as bricks. Measured link speed between > these servers is 3.36GB/s. Link speed between clients of the parallel file > system and its servers is also 3.36GB/s. So we're expecting this system to > have a write performance of around 20.16GB/s (6 times 3.36GB/s) minus some > write overhead. > > If we write to the system from a single client, we manage to write at > around 3.36GB/s. That's okay, because we're limited by the max throughput > of that client's network adapter. But when we account for that and write > from 6 or more clients, we can never get past 11GB/s. Is that right? Is > this really the overhead to be expected? We'd appreciate any inputs. > > Lame question: Are we getting more than 11GB/s from disks ? Please collect `gluster volume profile gfs0 info`, that can give more information. > Output of gluster volume info: > > Volume Name: gfs0 > Type: Stripe > Volume ID: 2ca3dd45-6209-43ff-a164-7f2694097c64 > Status: Started > Snapshot Count: 0 > Number of Bricks: 1 x 6 = 6 > Transport-type: tcp > Bricks: > Brick1: pfs01-ib:/mnt/data > Brick2: pfs02-ib:/mnt/data > Brick3: pfs03-ib:/mnt/data > Brick4: pfs04-ib:/mnt/data > Brick5: pfs05-ib:/mnt/data > Brick6: pfs06-ib:/mnt/data > Options Reconfigured: > cluster.stripe-block-size: 128KB > performance.cache-size: 32MB > performance.write-behind-window-size: 1MB > performance.strict-write-ordering: off > performance.strict-o-direct: off > performance.stat-prefetch: off > server.event-threads: 4 > client.event-threads: 2 > performance.io-thread-count: 16 > transport.address-family: inet > nfs.disable: on > cluster.localtime-logging: enable > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From meira at cesup.ufrgs.br Wed Feb 20 14:56:24 2019 From: meira at cesup.ufrgs.br (Lindolfo Meira) Date: Wed, 20 Feb 2019 11:56:24 -0300 (-03) Subject: [Gluster-users] GlusterFS Scale In-Reply-To: References: Message-ID: Hi Amar. Thanks for taking the time. Yes, I know stripping has been deprecated, but we have a specific interest in testing striped volumes at the moment. We're trying to stablish a performance comparison (stripe vs shard). Each brick contributes a single hardware controlled RAID-6 volume with 16 disks. Running the benchmark with exactly the same parameters directly on these volumes we manage to get around 13GB/s (way above the 3.36GB/s the network adapters are able to achieve). We're benchmarking this 6-brick gluster system using the MPIIO API of IOR. The output of the profiling follows attached. It refers to an IOR write test, iterated 8 times. Again, max I got was 11GB/s. IOR parameters used were -B -E -F -q -w -k -z -m -i=8 -t=2m -b=1g -a=MPIIO. Cheers, Lindolfo Meira, MSc Diretor Geral, Centro Nacional de Supercomputa??o Universidade Federal do Rio Grande do Sul +55 (51) 3308-3139 On Wed, 20 Feb 2019, Amar Tumballi Suryanarayan wrote: > On Mon, Feb 18, 2019 at 11:23 PM Lindolfo Meira > wrote: > > > We're running some benchmarks on a striped glusterfs volume. > > > > > Hi Lindolfo, > > We are not supporting Stripe anymore, and planning to remove it from build > too by glusterfs-6.0 (ie, next release). See if you can use 'Shard' for the > usecase. > > > > We have 6 identical servers acting as bricks. Measured link speed between > > these servers is 3.36GB/s. Link speed between clients of the parallel file > > system and its servers is also 3.36GB/s. So we're expecting this system to > > have a write performance of around 20.16GB/s (6 times 3.36GB/s) minus some > > write overhead. > > > > If we write to the system from a single client, we manage to write at > > around 3.36GB/s. That's okay, because we're limited by the max throughput > > of that client's network adapter. But when we account for that and write > > from 6 or more clients, we can never get past 11GB/s. Is that right? Is > > this really the overhead to be expected? We'd appreciate any inputs. > > > > > Lame question: Are we getting more than 11GB/s from disks ? > > Please collect `gluster volume profile gfs0 info`, that can give more > information. > > > > Output of gluster volume info: > > > > Volume Name: gfs0 > > Type: Stripe > > Volume ID: 2ca3dd45-6209-43ff-a164-7f2694097c64 > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 1 x 6 = 6 > > Transport-type: tcp > > Bricks: > > Brick1: pfs01-ib:/mnt/data > > Brick2: pfs02-ib:/mnt/data > > Brick3: pfs03-ib:/mnt/data > > Brick4: pfs04-ib:/mnt/data > > Brick5: pfs05-ib:/mnt/data > > Brick6: pfs06-ib:/mnt/data > > Options Reconfigured: > > cluster.stripe-block-size: 128KB > > performance.cache-size: 32MB > > performance.write-behind-window-size: 1MB > > performance.strict-write-ordering: off > > performance.strict-o-direct: off > > performance.stat-prefetch: off > > server.event-threads: 4 > > client.event-threads: 2 > > performance.io-thread-count: 16 > > transport.address-family: inet > > nfs.disable: on > > cluster.localtime-logging: enable > > > > > > > -------------- next part -------------- Brick: pfs01-ib:/mnt/data ------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131136 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.05 51.18 us 20.63 us 256.30 us 192 LK 0.07 93.88 us 50.88 us 362.80 us 163 SETATTR 0.09 107.73 us 43.98 us 240.06 us 163 SETXATTR 0.11 40.00 us 12.66 us 217.19 us 576 FLUSH 0.14 41.94 us 15.70 us 184.47 us 669 STAT 0.14 98.87 us 50.81 us 309.65 us 288 OPEN 0.15 164.14 us 81.69 us 370.35 us 192 UNLINK 0.20 47.53 us 19.10 us 272.89 us 867 STATFS 0.40 286.72 us 197.48 us 521.78 us 288 CREATE 0.61 99.54 us 39.43 us 436.23 us 1252 LOOKUP 7.17 4521.94 us 23.03 us 28429.53 us 326 INODELK 90.87 142.49 us 67.03 us 13285.58 us 131136 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17188257792 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131136 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.05 51.18 us 20.63 us 256.30 us 192 LK 0.07 93.88 us 50.88 us 362.80 us 163 SETATTR 0.09 107.73 us 43.98 us 240.06 us 163 SETXATTR 0.11 40.00 us 12.66 us 217.19 us 576 FLUSH 0.14 41.94 us 15.70 us 184.47 us 669 STAT 0.14 98.87 us 50.81 us 309.65 us 288 OPEN 0.15 164.14 us 81.69 us 370.35 us 192 UNLINK 0.20 47.53 us 19.10 us 272.89 us 867 STATFS 0.40 286.72 us 197.48 us 521.78 us 288 CREATE 0.61 99.54 us 39.43 us 436.23 us 1252 LOOKUP 7.17 4521.94 us 23.03 us 28429.53 us 326 INODELK 90.87 142.49 us 67.03 us 13285.58 us 131136 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17188257792 bytes Brick: pfs04-ib:/mnt/data ------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.05 57.99 us 19.80 us 224.35 us 192 LK 0.08 114.89 us 46.33 us 378.40 us 163 SETATTR 0.10 43.42 us 11.99 us 1391.00 us 576 FLUSH 0.12 44.41 us 16.43 us 249.75 us 669 STAT 0.13 104.06 us 54.86 us 1862.77 us 288 OPEN 0.16 45.24 us 20.35 us 246.34 us 867 STATFS 0.64 122.02 us 42.30 us 25803.78 us 1252 LOOKUP 1.05 1309.44 us 85.68 us 44130.35 us 192 UNLINK 3.46 2868.13 us 199.93 us 212895.24 us 288 CREATE 94.20 171.45 us 67.48 us 21658.19 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.05 57.99 us 19.80 us 224.35 us 192 LK 0.08 114.89 us 46.33 us 378.40 us 163 SETATTR 0.10 43.42 us 11.99 us 1391.00 us 576 FLUSH 0.12 44.41 us 16.43 us 249.75 us 669 STAT 0.13 104.06 us 54.86 us 1862.77 us 288 OPEN 0.16 45.24 us 20.35 us 246.34 us 867 STATFS 0.64 122.09 us 42.30 us 25803.78 us 1252 LOOKUP 1.05 1309.44 us 85.68 us 44130.35 us 192 UNLINK 3.46 2868.13 us 199.93 us 212895.24 us 288 CREATE 94.20 171.41 us 67.48 us 21658.19 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes Brick: pfs05-ib:/mnt/data ------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.04 55.49 us 15.39 us 301.38 us 192 LK 0.07 106.98 us 48.60 us 398.94 us 163 SETATTR 0.09 37.23 us 11.51 us 284.55 us 576 FLUSH 0.11 94.69 us 55.42 us 260.94 us 288 OPEN 0.11 41.13 us 16.07 us 203.31 us 669 STAT 0.16 45.45 us 19.89 us 270.21 us 867 STATFS 0.50 98.73 us 43.35 us 360.68 us 1252 LOOKUP 3.78 4820.73 us 85.53 us 343928.29 us 192 UNLINK 8.03 6823.51 us 204.49 us 344265.24 us 288 CREATE 87.10 162.75 us 67.78 us 20562.18 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.04 55.49 us 15.39 us 301.38 us 192 LK 0.07 106.98 us 48.60 us 398.94 us 163 SETATTR 0.09 37.23 us 11.51 us 284.55 us 576 FLUSH 0.11 94.69 us 55.42 us 260.94 us 288 OPEN 0.11 41.13 us 16.07 us 203.31 us 669 STAT 0.16 45.45 us 19.89 us 270.21 us 867 STATFS 0.50 98.73 us 43.35 us 360.68 us 1252 LOOKUP 3.78 4820.73 us 85.53 us 343928.29 us 192 UNLINK 8.03 6823.51 us 204.49 us 344265.24 us 288 CREATE 87.10 162.73 us 67.78 us 20562.18 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes Brick: pfs02-ib:/mnt/data ------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131136 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.06 65.96 us 17.65 us 2289.39 us 192 LK 0.08 100.75 us 50.54 us 323.66 us 163 SETATTR 0.10 36.51 us 12.29 us 250.04 us 576 FLUSH 0.13 90.32 us 50.30 us 226.21 us 288 OPEN 0.16 47.30 us 15.82 us 2281.57 us 669 STAT 0.20 46.76 us 18.38 us 2912.04 us 867 STATFS 0.61 97.93 us 40.99 us 780.38 us 1252 LOOKUP 1.56 1639.43 us 82.56 us 196463.69 us 192 UNLINK 5.95 4164.89 us 187.42 us 195441.47 us 288 CREATE 91.14 140.09 us 66.06 us 16004.89 us 131136 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17188257792 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131136 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.06 65.96 us 17.65 us 2289.39 us 192 LK 0.08 100.75 us 50.54 us 323.66 us 163 SETATTR 0.10 36.51 us 12.29 us 250.04 us 576 FLUSH 0.13 90.32 us 50.30 us 226.21 us 288 OPEN 0.16 47.30 us 15.82 us 2281.57 us 669 STAT 0.20 46.76 us 18.38 us 2912.04 us 867 STATFS 0.61 97.93 us 40.99 us 780.38 us 1252 LOOKUP 1.56 1639.43 us 82.56 us 196463.69 us 192 UNLINK 5.95 4164.89 us 187.42 us 195441.47 us 288 CREATE 91.14 140.07 us 66.06 us 16004.89 us 131136 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17188257792 bytes Brick: pfs06-ib:/mnt/data ------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.06 72.23 us 19.21 us 3624.32 us 192 LK 0.08 107.83 us 52.61 us 314.04 us 163 SETATTR 0.09 35.56 us 12.99 us 282.80 us 576 FLUSH 0.12 91.34 us 50.59 us 234.47 us 288 OPEN 0.13 44.96 us 16.94 us 3535.79 us 669 STAT 0.18 45.68 us 19.13 us 321.56 us 867 STATFS 0.61 110.06 us 41.37 us 15742.64 us 1252 LOOKUP 1.40 1649.38 us 87.82 us 64877.34 us 192 UNLINK 3.19 2498.30 us 195.74 us 111136.27 us 288 CREATE 94.14 162.01 us 68.96 us 28557.12 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.06 72.54 us 19.21 us 3624.32 us 192 LK 0.08 107.83 us 52.61 us 314.04 us 163 SETATTR 0.09 35.56 us 12.99 us 282.80 us 576 FLUSH 0.12 91.34 us 50.59 us 234.47 us 288 OPEN 0.13 44.96 us 16.94 us 3535.79 us 669 STAT 0.18 45.68 us 19.13 us 321.56 us 867 STATFS 0.61 110.07 us 41.37 us 15742.64 us 1252 LOOKUP 1.40 1649.38 us 87.82 us 64877.34 us 192 UNLINK 3.19 2498.30 us 195.74 us 111136.27 us 288 CREATE 94.14 161.96 us 68.96 us 28557.12 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes Brick: pfs03-ib:/mnt/data ------------------------- Cumulative Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.05 57.70 us 19.13 us 345.61 us 192 LK 0.07 104.82 us 49.85 us 500.99 us 163 SETATTR 0.09 36.39 us 10.25 us 406.40 us 576 FLUSH 0.12 41.01 us 13.58 us 177.65 us 669 STAT 0.12 98.04 us 48.32 us 312.39 us 288 OPEN 0.17 46.52 us 18.59 us 186.27 us 867 STATFS 0.53 99.96 us 39.17 us 393.71 us 1252 LOOKUP 2.09 2566.82 us 84.69 us 136604.30 us 192 UNLINK 7.48 6129.64 us 202.67 us 198818.41 us 288 CREATE 89.28 160.74 us 67.20 us 47777.54 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes Interval 0 Stats: Block Size: 131072b+ No. of Reads: 0 No. of Writes: 131040 %-latency Avg-latency Min-Latency Max-Latency No. of calls Fop --------- ----------- ----------- ----------- ------------ ---- 0.00 0.00 us 0.00 us 0.00 us 192 FORGET 0.00 0.00 us 0.00 us 0.00 us 576 RELEASE 0.05 57.70 us 19.13 us 345.61 us 192 LK 0.07 104.82 us 49.85 us 500.99 us 163 SETATTR 0.09 36.39 us 10.25 us 406.40 us 576 FLUSH 0.12 41.01 us 13.58 us 177.65 us 669 STAT 0.12 98.04 us 48.32 us 312.39 us 288 OPEN 0.17 46.52 us 18.59 us 186.27 us 867 STATFS 0.53 99.96 us 39.17 us 393.71 us 1252 LOOKUP 2.09 2566.82 us 84.69 us 136604.30 us 192 UNLINK 7.48 6129.64 us 202.67 us 198818.41 us 288 CREATE 89.28 160.72 us 67.20 us 47777.54 us 131040 WRITE Duration: 375 seconds Data Read: 0 bytes Data Written: 17175674880 bytes From subbarao at computer.org Thu Feb 21 02:49:32 2019 From: subbarao at computer.org (Kartik Subbarao) Date: Wed, 20 Feb 2019 21:49:32 -0500 Subject: [Gluster-users] glusterfsd Ubuntu 18.04 high iowait issues Message-ID: We're running gluster on two hypervisors running Ubuntu. When we upgraded from Ubuntu 14.04 to 18.04, it upgraded gluster from 3.4.2 to 3.13.2. As soon as we upgraded and since then, we've been seeing substantially higher iowait on the system, as measured by top and iotop, and iotop indicates that glusterfsd is the culprit. For some reason, glusterfsd is doing more disk reads and/or those reads are being held up up at a greater rate. The guest VMs are also seeing more iowait -- their images are hosted on the gluster volume. This is causing inconsistent responsiveness from the services hosted on the VMs. I'm looking for any recommendations on how to troubleshoot and/or resolve this problem. We have other sites that are still running 14.04, so I can compare/contrast any configuration parameters and performance. The block scheduler on 14.04 was set to deadline and 18.04 was set to cfq. But changing the 18.04 scheduler to deadline didn't make any difference. I was wondering whether glusterfsd on 18.04 isn't caching as much as it should. We tried increasing performance.cache-size substantially but that didn't make any difference. Another option we're considering but haven't tried yet is upgrading to gluster 5.3 by back-porting the package from Ubuntu 19.04 to 18.04. Does anyone think this might help? Is there any particular debug logging we could set up or other commands we could run to troubleshoot this better? Any thoughts, suggestions, ideas would be greatly appreciated. Thanks, ??? -Kartik From atumball at redhat.com Thu Feb 21 05:18:44 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Thu, 21 Feb 2019 10:48:44 +0530 Subject: [Gluster-users] glusterfsd Ubuntu 18.04 high iowait issues In-Reply-To: References: Message-ID: If you have both systems to get some idea, can you get the `gluster profile info' output? That helps a bit to understand the issue. On Thu, Feb 21, 2019 at 8:20 AM Kartik Subbarao wrote: > We're running gluster on two hypervisors running Ubuntu. When we > upgraded from Ubuntu 14.04 to 18.04, it upgraded gluster from 3.4.2 to > 3.13.2. As soon as we upgraded and since then, we've been seeing > substantially higher iowait on the system, as measured by top and iotop, > and iotop indicates that glusterfsd is the culprit. For some reason, > glusterfsd is doing more disk reads and/or those reads are being held up > up at a greater rate. The guest VMs are also seeing more iowait -- their > images are hosted on the gluster volume. This is causing inconsistent > responsiveness from the services hosted on the VMs. > > I'm looking for any recommendations on how to troubleshoot and/or > resolve this problem. We have other sites that are still running 14.04, > so I can compare/contrast any configuration parameters and performance. > > The block scheduler on 14.04 was set to deadline and 18.04 was set to > cfq. But changing the 18.04 scheduler to deadline didn't make any > difference. > > I was wondering whether glusterfsd on 18.04 isn't caching as much as it > should. We tried increasing performance.cache-size substantially but > that didn't make any difference. > > Another option we're considering but haven't tried yet is upgrading to > gluster 5.3 by back-porting the package from Ubuntu 19.04 to 18.04. Does > anyone think this might help? > > Is there any particular debug logging we could set up or other commands > we could run to troubleshoot this better? Any thoughts, suggestions, > ideas would be greatly appreciated. > > Thanks, > > -Kartik > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From subbarao at computer.org Thu Feb 21 16:34:04 2019 From: subbarao at computer.org (Kartik Subbarao) Date: Thu, 21 Feb 2019 11:34:04 -0500 Subject: [Gluster-users] glusterfsd Ubuntu 18.04 high iowait issues In-Reply-To: References: Message-ID: <13d40e22-4fd2-0a8d-9606-d536afb12d52@computer.org> Here are three profile reports from 60-second intervals: Ubuntu 18.04 system with low load: https://pastebin.com/XzgmjeuJ Ubuntu 14.04 system with low load: https://pastebin.com/5BEHDFwq Ubuntu 14.04 system with high load: https://pastebin.com/CFSWW4qn Each of these systems is "gluster1" in the report. In each cluster, there are two bricks, gluster1:/md3/gluster and gluster2:/md3/gluster. The systems are identical hardware-wise (I noticed this morning that the 18.04 upgrade applied a powersave governor to the CPU. I changed it to the performance governor before running the profile, but that doesn't seem to have changed the iowait behavior or the profile report appreciably). What jumps out at me for the 18.04 systems is: 1) The excessively high average latency of the FINODELK operations on the *local* brick (i.e. gluster1:/md3/gluster). The latency is far lower for these FINODELK operations against the other node's brick (gluster2:/md3/gluster). This is puzzling to me. 2) Almost double higher average latency for FSYNC operations against both the gluster1 and gluster2 bricks. On the 14.04 systems, the number of FINODELK operations performed during the 60-second interval is much lower (even on the highload system). And the latencies are lower. Regards, ??? -Kartik On 2/21/19 12:18 AM, Amar Tumballi Suryanarayan wrote: > If you have both systems to get some idea, can you get the `gluster > profile info' output? That helps a bit to understand the issue. > > > On Thu, Feb 21, 2019 at 8:20 AM Kartik Subbarao > wrote: > > We're running gluster on two hypervisors running Ubuntu. When we > upgraded from Ubuntu 14.04 to 18.04, it upgraded gluster from > 3.4.2 to > 3.13.2. As soon as we upgraded and since then, we've been seeing > substantially higher iowait on the system, as measured by top and > iotop, > and iotop indicates that glusterfsd is the culprit. For some reason, > glusterfsd is doing more disk reads and/or those reads are being > held up > up at a greater rate. The guest VMs are also seeing more iowait -- > their > images are hosted on the gluster volume. This is causing inconsistent > responsiveness from the services hosted on the VMs. > > I'm looking for any recommendations on how to troubleshoot and/or > resolve this problem. We have other sites that are still running > 14.04, > so I can compare/contrast any configuration parameters and > performance. > > The block scheduler on 14.04 was set to deadline and 18.04 was set to > cfq. But changing the 18.04 scheduler to deadline didn't make any > difference. > > I was wondering whether glusterfsd on 18.04 isn't caching as much > as it > should. We tried increasing performance.cache-size substantially but > that didn't make any difference. > > Another option we're considering but haven't tried yet is > upgrading to > gluster 5.3 by back-porting the package from Ubuntu 19.04 to > 18.04. Does > anyone think this might help? > > Is there any particular debug logging we could set up or other > commands > we could run to troubleshoot this better? Any thoughts, suggestions, > ideas would be greatly appreciated. > > Thanks, > > ???? -Kartik > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ajays20078 at gmail.com Wed Feb 20 10:15:47 2019 From: ajays20078 at gmail.com (ajay s) Date: Wed, 20 Feb 2019 11:15:47 +0100 Subject: [Gluster-users] Gluster geo replication failing to sealheal with the below errors Message-ID: An HTML attachment was scrubbed... URL: From subbarao at computer.org Sat Feb 23 00:22:49 2019 From: subbarao at computer.org (Kartik Subbarao) Date: Fri, 22 Feb 2019 19:22:49 -0500 Subject: [Gluster-users] glusterfsd Ubuntu 18.04 high iowait issues In-Reply-To: <13d40e22-4fd2-0a8d-9606-d536afb12d52@computer.org> References: <13d40e22-4fd2-0a8d-9606-d536afb12d52@computer.org> Message-ID: <17c220dc-9d0d-588e-e8d8-ea5438b6e967@computer.org> I have some good news -- upgrading to gluster 5.3 resolved the issue :-) Regards, ??? -Kartik On 2/21/2019 11:34 AM, Kartik Subbarao wrote: > Here are three profile reports from 60-second intervals: > > Ubuntu 18.04 system with low load: > https://pastebin.com/XzgmjeuJ > > Ubuntu 14.04 system with low load: > https://pastebin.com/5BEHDFwq > > Ubuntu 14.04 system with high load: > https://pastebin.com/CFSWW4qn > > Each of these systems is "gluster1" in the report. In each cluster, > there are two bricks, gluster1:/md3/gluster and gluster2:/md3/gluster. > The systems are identical hardware-wise (I noticed this morning that > the 18.04 upgrade applied a powersave governor to the CPU. I changed > it to the performance governor before running the profile, but that > doesn't seem to have changed the iowait behavior or the profile report > appreciably). > > What jumps out at me for the 18.04 systems is: > > 1) The excessively high average latency of the FINODELK operations on > the *local* brick (i.e. gluster1:/md3/gluster). The latency is far > lower for these FINODELK operations against the other node's brick > (gluster2:/md3/gluster). This is puzzling to me. > 2) Almost double higher average latency for FSYNC operations against > both the gluster1 and gluster2 bricks. > > On the 14.04 systems, the number of FINODELK operations performed > during the 60-second interval is much lower (even on the highload > system). And the latencies are lower. > > Regards, > > ??? -Kartik > > On 2/21/19 12:18 AM, Amar Tumballi Suryanarayan wrote: >> If you have both systems to get some idea, can you get the `gluster >> profile info' output? That helps a bit to understand the issue. >> >> >> On Thu, Feb 21, 2019 at 8:20 AM Kartik Subbarao >> > wrote: >> >> We're running gluster on two hypervisors running Ubuntu. When we >> upgraded from Ubuntu 14.04 to 18.04, it upgraded gluster from >> 3.4.2 to >> 3.13.2. As soon as we upgraded and since then, we've been seeing >> substantially higher iowait on the system, as measured by top and >> iotop, >> and iotop indicates that glusterfsd is the culprit. For some reason, >> glusterfsd is doing more disk reads and/or those reads are being >> held up >> up at a greater rate. The guest VMs are also seeing more iowait >> -- their >> images are hosted on the gluster volume. This is causing >> inconsistent >> responsiveness from the services hosted on the VMs. >> >> I'm looking for any recommendations on how to troubleshoot and/or >> resolve this problem. We have other sites that are still running >> 14.04, >> so I can compare/contrast any configuration parameters and >> performance. >> >> The block scheduler on 14.04 was set to deadline and 18.04 was >> set to >> cfq. But changing the 18.04 scheduler to deadline didn't make any >> difference. >> >> I was wondering whether glusterfsd on 18.04 isn't caching as much >> as it >> should. We tried increasing performance.cache-size substantially but >> that didn't make any difference. >> >> Another option we're considering but haven't tried yet is >> upgrading to >> gluster 5.3 by back-porting the package from Ubuntu 19.04 to >> 18.04. Does >> anyone think this might help? >> >> Is there any particular debug logging we could set up or other >> commands >> we could run to troubleshoot this better? Any thoughts, suggestions, >> ideas would be greatly appreciated. >> >> Thanks, >> >> ???? -Kartik >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> -- >> Amar Tumballi (amarts) > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rightkicktech at gmail.com Sat Feb 23 15:54:42 2019 From: rightkicktech at gmail.com (Alex K) Date: Sat, 23 Feb 2019 17:54:42 +0200 Subject: [Gluster-users] Gluster and bonding Message-ID: Hi all, I have a replica 3 setup where each server was configured with a dual interfaces in mode 6 bonding. All cables were connected to one common network switch. To add redundancy to the switch, and avoid being a single point of failure, I connected each second cable of each server to a second switch. This turned out to not function as gluster was refusing to start the volume logging "transport endpoint is disconnected" although all nodes were able to reach each other (ping) in the storage network. I switched the mode to mode 1 (active/passive) and initially it worked but following a reboot of all cluster same issue appeared. Gluster is not starting the volumes. Isn't active/passive supposed to work like that? Can one have such redundant network setup or are there any other recommended approaches? Thanx, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From dm at belkam.com Mon Feb 25 04:41:38 2019 From: dm at belkam.com (Dmitry Melekhov) Date: Mon, 25 Feb 2019 08:41:38 +0400 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: 23.02.2019 19:54, Alex K ?????: > Hi all, > > I have a replica 3 setup where each server was configured with a dual > interfaces in mode 6 bonding. All cables were connected to one common > network switch. > > To add redundancy to the switch, and avoid being a single point of > failure, I connected each second cable of each server to a second > switch. This turned out to not function as gluster was refusing to > start the volume logging "transport endpoint is disconnected" although > all nodes were able to reach each other (ping) in the storage network. > I switched the mode to mode 1 (active/passive) and initially it worked > but following a reboot of all cluster same issue appeared. Gluster is > not starting the volumes. > > Isn't active/passive supposed to work like that? Can one have such > redundant network setup or are there any other recommended approaches? > Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no doubt, best way. > Thanx, > Alex > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorick at netbulae.eu Mon Feb 25 09:27:11 2019 From: jorick at netbulae.eu (Jorick Astrego) Date: Mon, 25 Feb 2019 10:27:11 +0100 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: Hi, We use bonding mode 6 (balance-alb) for GlusterFS traffic https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 Preferred bonding mode for Red Hat Gluster Storage client is mode 6 (balance-alb), this allows client to transmit writes in parallel on separate NICs much of the time. Regards, Jorick Astrego On 2/25/19 5:41 AM, Dmitry Melekhov wrote: > 23.02.2019 19:54, Alex K ?????: >> Hi all, >> >> I have a replica 3 setup where each server was configured with a dual >> interfaces in mode 6 bonding. All cables were connected to one common >> network switch. >> >> To add redundancy to the switch, and avoid being a single point of >> failure, I connected each second cable of each server to a second >> switch. This turned out to not function as gluster was refusing to >> start the volume logging "transport endpoint is disconnected" >> although all nodes were able to reach each other (ping) in the >> storage network. I switched the mode to mode 1 (active/passive) and >> initially it worked but following a reboot of all cluster same issue >> appeared. Gluster is not starting the volumes. >> >> Isn't active/passive supposed to work like that? Can one have such >> redundant network setup or are there any other recommended approaches? >> > > Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no > doubt, best way. > > >> Thanx, >> Alex >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Met vriendelijke groet, With kind regards, Jorick Astrego Netbulae Virtualization Experts ---------------- Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 ---------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From rightkicktech at gmail.com Mon Feb 25 10:43:24 2019 From: rightkicktech at gmail.com (Alex K) Date: Mon, 25 Feb 2019 12:43:24 +0200 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: Hi All, I was asking if it is possible to have the two separate cables connected to two different physical switched. When trying mode6 or mode1 in this setup gluster was refusing to start the volumes, giving me "transport endpoint is not connected". server1: cable1 ---------------- switch1 --------------------- server2: cable1 | server1: cable2 ---------------- switch2 --------------------- server2: cable2 Both switches are connected with each other also. This is done to achieve redundancy for the switches. When disconnecting cable2 from both servers, then gluster was happy. What could be the problem? Thanx, Alex On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego wrote: > Hi, > > We use bonding mode 6 (balance-alb) for GlusterFS traffic > > > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 > > Preferred bonding mode for Red Hat Gluster Storage client is mode 6 > (balance-alb), this allows client to transmit writes in parallel on > separate NICs much of the time. > > Regards, > > Jorick Astrego > On 2/25/19 5:41 AM, Dmitry Melekhov wrote: > > 23.02.2019 19:54, Alex K ?????: > > Hi all, > > I have a replica 3 setup where each server was configured with a dual > interfaces in mode 6 bonding. All cables were connected to one common > network switch. > > To add redundancy to the switch, and avoid being a single point of > failure, I connected each second cable of each server to a second switch. > This turned out to not function as gluster was refusing to start the volume > logging "transport endpoint is disconnected" although all nodes were able > to reach each other (ping) in the storage network. I switched the mode to > mode 1 (active/passive) and initially it worked but following a reboot of > all cluster same issue appeared. Gluster is not starting the volumes. > > Isn't active/passive supposed to work like that? Can one have such > redundant network setup or are there any other recommended approaches? > > > Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no > doubt, best way. > > > Thanx, > Alex > > _______________________________________________ > Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > Gluster-users mailing listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users > > > > > > Met vriendelijke groet, With kind regards, > > Jorick Astrego > > *Netbulae Virtualization Experts * > ------------------------------ > Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 > Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 > ------------------------------ > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From snowmailer at gmail.com Mon Feb 25 12:16:19 2019 From: snowmailer at gmail.com (Martin Toth) Date: Mon, 25 Feb 2019 13:16:19 +0100 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: Hi Alex, you have to use bond mode 4 (LACP - 802.3ad) in order to achieve redundancy of cables/ports/switches. I suppose this is what you want. BR, Martin > On 25 Feb 2019, at 11:43, Alex K wrote: > > Hi All, > > I was asking if it is possible to have the two separate cables connected to two different physical switched. When trying mode6 or mode1 in this setup gluster was refusing to start the volumes, giving me "transport endpoint is not connected". > > server1: cable1 ---------------- switch1 --------------------- server2: cable1 > | > server1: cable2 ---------------- switch2 --------------------- server2: cable2 > > Both switches are connected with each other also. This is done to achieve redundancy for the switches. > When disconnecting cable2 from both servers, then gluster was happy. > What could be the problem? > > Thanx, > Alex > > > On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego > wrote: > Hi, > > We use bonding mode 6 (balance-alb) for GlusterFS traffic > > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 > Preferred bonding mode for Red Hat Gluster Storage client is mode 6 (balance-alb), this allows client to transmit writes in parallel on separate NICs much of the time. > Regards, > > Jorick Astrego > On 2/25/19 5:41 AM, Dmitry Melekhov wrote: >> 23.02.2019 19:54, Alex K ?????: >>> Hi all, >>> >>> I have a replica 3 setup where each server was configured with a dual interfaces in mode 6 bonding. All cables were connected to one common network switch. >>> >>> To add redundancy to the switch, and avoid being a single point of failure, I connected each second cable of each server to a second switch. This turned out to not function as gluster was refusing to start the volume logging "transport endpoint is disconnected" although all nodes were able to reach each other (ping) in the storage network. I switched the mode to mode 1 (active/passive) and initially it worked but following a reboot of all cluster same issue appeared. Gluster is not starting the volumes. >>> >>> Isn't active/passive supposed to work like that? Can one have such redundant network setup or are there any other recommended approaches? >>> >> >> Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no doubt, best way. >> >> >>> Thanx, >>> Alex >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > Met vriendelijke groet, With kind regards, > > Jorick Astrego > > Netbulae Virtualization Experts > Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 > Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From jim.kinney at gmail.com Mon Feb 25 12:17:30 2019 From: jim.kinney at gmail.com (Jim Kinney) Date: Mon, 25 Feb 2019 07:17:30 -0500 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: <92ABE315-CA69-4757-A9B3-3060512E08A8@gmail.com> Unless the link between the two switches is set as a dedicated management link, won't that link create a problem? On the dual switch setup I have, there's a dedicated connection that handles inter-switch data. I'm not using bonding or teaming at the servers as I have 40Gb ethernet nics. Gluster is fine across this. On February 25, 2019 5:43:24 AM EST, Alex K wrote: >Hi All, > >I was asking if it is possible to have the two separate cables >connected to >two different physical switched. When trying mode6 or mode1 in this >setup >gluster was refusing to start the volumes, giving me "transport >endpoint is >not connected". > >server1: cable1 ---------------- switch1 --------------------- server2: >cable1 > | >server1: cable2 ---------------- switch2 --------------------- server2: >cable2 > >Both switches are connected with each other also. This is done to >achieve >redundancy for the switches. >When disconnecting cable2 from both servers, then gluster was happy. >What could be the problem? > >Thanx, >Alex > > >On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego >wrote: > >> Hi, >> >> We use bonding mode 6 (balance-alb) for GlusterFS traffic >> >> >> >https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 >> >> Preferred bonding mode for Red Hat Gluster Storage client is mode 6 >> (balance-alb), this allows client to transmit writes in parallel on >> separate NICs much of the time. >> >> Regards, >> >> Jorick Astrego >> On 2/25/19 5:41 AM, Dmitry Melekhov wrote: >> >> 23.02.2019 19:54, Alex K ?????: >> >> Hi all, >> >> I have a replica 3 setup where each server was configured with a dual >> interfaces in mode 6 bonding. All cables were connected to one common >> network switch. >> >> To add redundancy to the switch, and avoid being a single point of >> failure, I connected each second cable of each server to a second >switch. >> This turned out to not function as gluster was refusing to start the >volume >> logging "transport endpoint is disconnected" although all nodes were >able >> to reach each other (ping) in the storage network. I switched the >mode to >> mode 1 (active/passive) and initially it worked but following a >reboot of >> all cluster same issue appeared. Gluster is not starting the volumes. >> >> Isn't active/passive supposed to work like that? Can one have such >> redundant network setup or are there any other recommended >approaches? >> >> >> Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no >> doubt, best way. >> >> >> Thanx, >> Alex >> >> _______________________________________________ >> Gluster-users mailing >listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> _______________________________________________ >> Gluster-users mailing >listGluster-users at gluster.orghttps://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> >> >> Met vriendelijke groet, With kind regards, >> >> Jorick Astrego >> >> *Netbulae Virtualization Experts * >> ------------------------------ >> Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 >> Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW >NL821234584B01 >> ------------------------------ >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users -- Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dm at belkam.com Mon Feb 25 12:30:05 2019 From: dm at belkam.com (Dmitry Melekhov) Date: Mon, 25 Feb 2019 16:30:05 +0400 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: <47420d75-88c2-1595-7e2c-f188ac6ea05d@belkam.com> 25.02.2019 14:43, Alex K ?????: > Hi All, > > I was asking if it is possible to have the two separate cables > connected to two different physical switched. Yes, if these switches are in cluster, we use comware switches, so we use IRF, I guess cisco has lacp support on several switches in nexus.. > When trying mode6 or mode1 in this setup gluster was refusing to start > the volumes, giving me "transport endpoint is not connected". > > server1: cable1 ---------------- switch1 --------------------- > server2: cable1 > ?????????????????????????????? ? ? ? ? ? ?? | > server1: cable2 ---------------- switch2 --------------------- > server2: cable2 > > Both switches are connected with each other also. This is done to > achieve redundancy for the switches. > When disconnecting cable2 from both servers, then gluster was happy. > What could be the?problem? If you need just redundancy, may be you can use STP? combine port in bridge. Never tried this though, don't know how good is STP support in linux bridge... btw, I don't think this is gluster problem, I think you have to ask in sort of linux networking list. From jorick at netbulae.eu Mon Feb 25 12:44:04 2019 From: jorick at netbulae.eu (Jorick Astrego) Date: Mon, 25 Feb 2019 13:44:04 +0100 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: Hi, Well no, mode 5 and mode 6 also have fault tollerance and don't need any special switch config. Quick google search: https://serverfault.com/questions/734246/does-balance-alb-and-balance-tlb-support-fault-tolerance Bonding Mode 5 (balance-tlb) works by looking at all the devices in the bond, and sending out the slave with the least current traffic load. Traffic is only received by one slave (the "primary slave"). If a slave is lost, that slave is not considered for transmission, so this mode is fault-tolerant. Bonding Mode 6 (balance-alb) works as above, except incoming ARP requests are intercepted by the bonding driver, and the bonding driver generates ARP replies so that external hosts are tricked into sending their traffic into one of the other bonding slaves instead of the primary slave. If many hosts in the same broadcast domain contact the bond, then traffic should balance roughly evenly into all slaves. If a slave is lost in Mode 6, then it may take some time for a remote host to time out its ARP table entry and send a new ARP request. A TCP or SCTP retransmission tents to lead into ARP request fairly quickly, but a UDP datagram does not, and will rely on the usual ARP table refresh. So Mode 6 /is/ fault tolerant, but convergence on slave loss may take some time depending on the Layer 4 protocol used. If you are worried about fast fault tolerance, then consider using Mode 4 (802.3ad aka LACP) which negotiates link aggregation between the bond and the switch, and constantly updates the link status between the aggregation partners. Mode 4 also has configurable load balance hashing so is better for in-order delivery of TCP streams compared to Mode 5 or Mode 6. https://wiki.linuxfoundation.org/networking/bonding * *balance-tlb or 5* Adaptive transmit load balancing: channel bonding that does not require any special switch support. The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave. Incoming traffic is received by the current slave. *If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave.* o Prerequisite: 1. Ethtool support in the base drivers for retrieving the speed of each slave. * *balance-alb or 6?* Adaptive load balancing: *includes balance-tlb plus receive load balancing* (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. o The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server. o Receive traffic from connections created by the server is also balanced. When the local system sends an ARP Request the bonding driver copies and saves the peer's IP information from the ARP packet. o When the ARP Reply arrives from the peer, its hardware address is retrieved and the bonding driver initiates an ARP reply to this peer assigning it to one of the slaves in the bond. o A problematic outcome of using ARP negotiation for balancing is that each time that an ARP request is broadcast it uses the hardware address of the bond. Hence, peers learn the hardware address of the bond and the balancing of receive traffic collapses to the current slave. This is handled by sending updates (ARP Replies) to all the peers with their individually assigned hardware address such that the traffic is redistributed. Receive traffic is also redistributed when a new slave is added to the bond and when an inactive slave is re-activated. The receive load is distributed sequentially (round robin) among the group of highest speed slaves in the bond. o When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all active slaves in the bond by initiating ARP Replies with the selected mac address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch's forwarding delay so that the ARP Replies sent to the peers will not be blocked by the switch. On 2/25/19 1:16 PM, Martin Toth wrote: > Hi Alex, > > you have to use bond mode 4 (LACP - 802.3ad) in order to achieve > redundancy of cables/ports/switches. I suppose this is what you want. > > BR, > Martin > >> On 25 Feb 2019, at 11:43, Alex K > > wrote: >> >> Hi All, >> >> I was asking if it is possible to have the two separate cables >> connected to two different physical switched. When trying mode6 or >> mode1 in this setup gluster was refusing to start the volumes, giving >> me "transport endpoint is not connected". >> >> server1: cable1 ---------------- switch1 --------------------- >> server2: cable1 >> ?????????????????????????????? ? ? ? ? ? ?? | >> server1: cable2 ---------------- switch2 --------------------- >> server2: cable2 >> >> Both switches are connected with each other also. This is done to >> achieve redundancy for the switches. >> When disconnecting cable2 from both servers, then gluster was happy. >> What could be the?problem? >> >> Thanx, >> Alex >> >> >> On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego > > wrote: >> >> Hi, >> >> We use bonding mode 6 (balance-alb) for GlusterFS traffic >> >> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 >> >> Preferred bonding mode for Red Hat Gluster Storage client is >> mode 6 (balance-alb), this allows client to transmit writes >> in parallel on separate NICs much of the time. >> >> Regards, >> >> Jorick Astrego >> >> On 2/25/19 5:41 AM, Dmitry Melekhov wrote: >>> 23.02.2019 19:54, Alex K ?????: >>>> Hi all, >>>> >>>> I have a replica 3 setup where each server was configured with >>>> a dual interfaces in mode 6 bonding. All cables were connected >>>> to one common network switch. >>>> >>>> To add redundancy to the switch, and avoid being a single point >>>> of failure, I connected each second cable of each server to a >>>> second switch. This turned out to not function as gluster was >>>> refusing to start the volume logging "transport endpoint is >>>> disconnected" although all nodes were able to reach each other >>>> (ping) in the storage network. I switched the mode to mode 1 >>>> (active/passive) and initially it worked but following a reboot >>>> of all cluster same issue appeared. Gluster is not starting the >>>> volumes. >>>> >>>> Isn't active/passive supposed to work like that? Can one have >>>> such redundant network setup or are there any other recommended >>>> approaches? >>>> >>> >>> Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it >>> is, no doubt, best way. >>> >>> >>>> Thanx, >>>> Alex >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> >> Met vriendelijke groet, With kind regards, >> >> Jorick Astrego >> * >> Netbulae Virtualization Experts * >> ------------------------------------------------------------------------ >> Tel: 053 20 30 270 info at netbulae.eu >> Staalsteden 4-3A KvK 08198180 >> Fax: 053 20 30 271 www.netbulae.eu >> 7547 TA Enschede BTW NL821234584B01 >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users Met vriendelijke groet, With kind regards, Jorick Astrego Netbulae Virtualization Experts ---------------- Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 ---------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From snowmailer at gmail.com Mon Feb 25 13:22:05 2019 From: snowmailer at gmail.com (Martin Toth) Date: Mon, 25 Feb 2019 14:22:05 +0100 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: <9B82193A-15EF-43F6-882B-BA8E7862770A@gmail.com> How long does it take to your devices (using mode 5 or 6, ALB is prefered for GlusterFS) to take-over the MAC? This can result in your error - "transport endpoint is not connected? - there are some timeouts within gluster set by default. I am using LACP and it works without any problem. Can you share your mode 5 / 6 configuration ? Thanks. Martin > On 25 Feb 2019, at 13:44, Jorick Astrego wrote: > > Hi, > > Well no, mode 5 and mode 6 also have fault tollerance and don't need any special switch config. > > Quick google search: > > https://serverfault.com/questions/734246/does-balance-alb-and-balance-tlb-support-fault-tolerance > Bonding Mode 5 (balance-tlb) works by looking at all the devices in the bond, and sending out the slave with the least current traffic load. Traffic is only received by one slave (the "primary slave"). If a slave is lost, that slave is not considered for transmission, so this mode is fault-tolerant. > > Bonding Mode 6 (balance-alb) works as above, except incoming ARP requests are intercepted by the bonding driver, and the bonding driver generates ARP replies so that external hosts are tricked into sending their traffic into one of the other bonding slaves instead of the primary slave. If many hosts in the same broadcast domain contact the bond, then traffic should balance roughly evenly into all slaves. > > If a slave is lost in Mode 6, then it may take some time for a remote host to time out its ARP table entry and send a new ARP request. A TCP or SCTP retransmission tents to lead into ARP request fairly quickly, but a UDP datagram does not, and will rely on the usual ARP table refresh. So Mode 6 is fault tolerant, but convergence on slave loss may take some time depending on the Layer 4 protocol used. > > If you are worried about fast fault tolerance, then consider using Mode 4 (802.3ad aka LACP) which negotiates link aggregation between the bond and the switch, and constantly updates the link status between the aggregation partners. Mode 4 also has configurable load balance hashing so is better for in-order delivery of TCP streams compared to Mode 5 or Mode 6. > > https://wiki.linuxfoundation.org/networking/bonding > balance-tlb or 5 > Adaptive transmit load balancing: channel bonding that does not require any special switch support. The outgoing traffic is distributed according to the current load (computed relative to the speed) on each slave. Incoming traffic is received by the current slave. If the receiving slave fails, another slave takes over the MAC address of the failed receiving slave. > Prerequisite: > Ethtool support in the base drivers for retrieving the speed of each slave. > balance-alb or 6 > Adaptive load balancing: includes balance-tlb plus receive load balancing (rlb) for IPV4 traffic, and does not require any special switch support. The receive load balancing is achieved by ARP negotiation. > The bonding driver intercepts the ARP Replies sent by the local system on their way out and overwrites the source hardware address with the unique hardware address of one of the slaves in the bond such that different peers use different hardware addresses for the server. > Receive traffic from connections created by the server is also balanced. When the local system sends an ARP Request the bonding driver copies and saves the peer's IP information from the ARP packet. > When the ARP Reply arrives from the peer, its hardware address is retrieved and the bonding driver initiates an ARP reply to this peer assigning it to one of the slaves in the bond. > A problematic outcome of using ARP negotiation for balancing is that each time that an ARP request is broadcast it uses the hardware address of the bond. Hence, peers learn the hardware address of the bond and the balancing of receive traffic collapses to the current slave. This is handled by sending updates (ARP Replies) to all the peers with their individually assigned hardware address such that the traffic is redistributed. Receive traffic is also redistributed when a new slave is added to the bond and when an inactive slave is re-activated. The receive load is distributed sequentially (round robin) among the group of highest speed slaves in the bond. > When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all active slaves in the bond by initiating ARP Replies with the selected mac address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch's forwarding delay so that the ARP Replies sent to the peers will not be blocked by the switch. > On 2/25/19 1:16 PM, Martin Toth wrote: >> Hi Alex, >> >> you have to use bond mode 4 (LACP - 802.3ad) in order to achieve redundancy of cables/ports/switches. I suppose this is what you want. >> >> BR, >> Martin >> >>> On 25 Feb 2019, at 11:43, Alex K > wrote: >>> >>> Hi All, >>> >>> I was asking if it is possible to have the two separate cables connected to two different physical switched. When trying mode6 or mode1 in this setup gluster was refusing to start the volumes, giving me "transport endpoint is not connected". >>> >>> server1: cable1 ---------------- switch1 --------------------- server2: cable1 >>> | >>> server1: cable2 ---------------- switch2 --------------------- server2: cable2 >>> >>> Both switches are connected with each other also. This is done to achieve redundancy for the switches. >>> When disconnecting cable2 from both servers, then gluster was happy. >>> What could be the problem? >>> >>> Thanx, >>> Alex >>> >>> >>> On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego > wrote: >>> Hi, >>> >>> We use bonding mode 6 (balance-alb) for GlusterFS traffic >>> >>> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 >>> Preferred bonding mode for Red Hat Gluster Storage client is mode 6 (balance-alb), this allows client to transmit writes in parallel on separate NICs much of the time. >>> Regards, >>> >>> Jorick Astrego >>> On 2/25/19 5:41 AM, Dmitry Melekhov wrote: >>>> 23.02.2019 19:54, Alex K ?????: >>>>> Hi all, >>>>> >>>>> I have a replica 3 setup where each server was configured with a dual interfaces in mode 6 bonding. All cables were connected to one common network switch. >>>>> >>>>> To add redundancy to the switch, and avoid being a single point of failure, I connected each second cable of each server to a second switch. This turned out to not function as gluster was refusing to start the volume logging "transport endpoint is disconnected" although all nodes were able to reach each other (ping) in the storage network. I switched the mode to mode 1 (active/passive) and initially it worked but following a reboot of all cluster same issue appeared. Gluster is not starting the volumes. >>>>> >>>>> Isn't active/passive supposed to work like that? Can one have such redundant network setup or are there any other recommended approaches? >>>>> >>>> >>>> Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no doubt, best way. >>>> >>>> >>>>> Thanx, >>>>> Alex >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> >>> Met vriendelijke groet, With kind regards, >>> >>> Jorick Astrego >>> >>> Netbulae Virtualization Experts >>> Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 >>> Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > Met vriendelijke groet, With kind regards, > > Jorick Astrego > > Netbulae Virtualization Experts > Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 > Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorick at netbulae.eu Mon Feb 25 14:24:09 2019 From: jorick at netbulae.eu (Jorick Astrego) Date: Mon, 25 Feb 2019 15:24:09 +0100 Subject: [Gluster-users] Gluster and bonding In-Reply-To: <9B82193A-15EF-43F6-882B-BA8E7862770A@gmail.com> References: <9B82193A-15EF-43F6-882B-BA8E7862770A@gmail.com> Message-ID: Hi, Have not measured it as we have been running this way for years now and haven't experienced any problems with "transport endpoint is not connected? with this setup. We used the default options "BONDING_OPTS='mode=6 miimon=100'" |miimon=/time_in_milliseconds/ | Specifies (in milliseconds) how often MII link monitoring occurs. This is useful if high availability is required because MII is used to verify that the NIC is active. On 2/25/19 2:22 PM, Martin Toth wrote: > How long does it take to your devices (using mode 5 or 6, ALB is > prefered for GlusterFS) to take-over the MAC? This can result in your > error -??"transport endpoint is not connected? - there are some > timeouts within gluster set by default. > I am using LACP and it works without any problem. Can you share your > mode 5 / 6 configuration ? > > Thanks. > Martin > >> On 25 Feb 2019, at 13:44, Jorick Astrego > > wrote: >> >> Hi, >> >> Well no, mode 5 and mode 6 also have fault tollerance and don't need >> any special switch config. >> >> Quick google search: >> >> https://serverfault.com/questions/734246/does-balance-alb-and-balance-tlb-support-fault-tolerance >> >> Bonding Mode 5 (balance-tlb) works by looking at all the devices >> in the bond, and sending out the slave with the least current >> traffic load. Traffic is only received by one slave (the "primary >> slave"). If a slave is lost, that slave is not considered for >> transmission, so this mode is fault-tolerant. >> >> Bonding Mode 6 (balance-alb) works as above, except incoming ARP >> requests are intercepted by the bonding driver, and the bonding >> driver generates ARP replies so that external hosts are tricked >> into sending their traffic into one of the other bonding slaves >> instead of the primary slave. If many hosts in the same broadcast >> domain contact the bond, then traffic should balance roughly >> evenly into all slaves. >> >> If a slave is lost in Mode 6, then it may take some time for a >> remote host to time out its ARP table entry and send a new ARP >> request. A TCP or SCTP retransmission tents to lead into ARP >> request fairly quickly, but a UDP datagram does not, and will >> rely on the usual ARP table refresh. So Mode 6 /is/ fault >> tolerant, but convergence on slave loss may take some time >> depending on the Layer 4 protocol used. >> >> If you are worried about fast fault tolerance, then consider >> using Mode 4 (802.3ad aka LACP) which negotiates link aggregation >> between the bond and the switch, and constantly updates the link >> status between the aggregation partners. Mode 4 also has >> configurable load balance hashing so is better for in-order >> delivery of TCP streams compared to Mode 5 or Mode 6. >> >> https://wiki.linuxfoundation.org/networking/bonding >> >> * >> *balance-tlb or 5* >> Adaptive transmit load balancing: channel bonding that does not >> require any special switch support. The outgoing traffic is >> distributed according to the current load (computed relative to >> the speed) on each slave. Incoming traffic is received by the >> current slave. *If the receiving slave fails, another slave takes >> over the MAC address of the failed receiving slave.* >> o >> Prerequisite: >> 1. >> Ethtool support in the base drivers for retrieving the >> speed of each slave. >> * >> *balance-alb or 6?* >> Adaptive load balancing: *includes balance-tlb plus receive load >> balancing* (rlb) for IPV4 traffic, and does not require any >> special switch support. The receive load balancing is achieved by >> ARP negotiation. >> o >> The bonding driver intercepts the ARP Replies sent by the >> local system on their way out and overwrites the source >> hardware address with the unique hardware address of one of >> the slaves in the bond such that different peers use >> different hardware addresses for the server. >> o >> Receive traffic from connections created by the server is >> also balanced. When the local system sends an ARP Request the >> bonding driver copies and saves the peer's IP information >> from the ARP packet. >> o >> When the ARP Reply arrives from the peer, its hardware >> address is retrieved and the bonding driver initiates an ARP >> reply to this peer assigning it to one of the slaves in the bond. >> o >> A problematic outcome of using ARP negotiation for balancing >> is that each time that an ARP request is broadcast it uses >> the hardware address of the bond. Hence, peers learn the >> hardware address of the bond and the balancing of receive >> traffic collapses to the current slave. This is handled by >> sending updates (ARP Replies) to all the peers with their >> individually assigned hardware address such that the traffic >> is redistributed. Receive traffic is also redistributed when >> a new slave is added to the bond and when an inactive slave >> is re-activated. The receive load is distributed sequentially >> (round robin) among the group of highest speed slaves in the >> bond. >> o >> When a link is reconnected or a new slave joins the bond the >> receive traffic is redistributed among all active slaves in >> the bond by initiating ARP Replies with the selected mac >> address to each of the clients. The updelay parameter >> (detailed below) must be set to a value equal or greater than >> the switch's forwarding delay so that the ARP Replies sent to >> the peers will not be blocked by the switch. >> >> On 2/25/19 1:16 PM, Martin Toth wrote: >>> Hi Alex, >>> >>> you have to use bond mode 4 (LACP - 802.3ad) in order to achieve >>> redundancy of cables/ports/switches. I suppose this is what you want. >>> >>> BR, >>> Martin >>> >>>> On 25 Feb 2019, at 11:43, Alex K >>> > wrote: >>>> >>>> Hi All, >>>> >>>> I was asking if it is possible to have the two separate cables >>>> connected to two different physical switched. When trying mode6 or >>>> mode1 in this setup gluster was refusing to start the volumes, >>>> giving me "transport endpoint is not connected". >>>> >>>> server1: cable1 ---------------- switch1 --------------------- >>>> server2: cable1 >>>> ?????????????????????????????? ? ? ? ? ? ?? | >>>> server1: cable2 ---------------- switch2 --------------------- >>>> server2: cable2 >>>> >>>> Both switches are connected with each other also. This is done to >>>> achieve redundancy for the switches. >>>> When disconnecting cable2 from both servers, then gluster was happy. >>>> What could be the?problem? >>>> >>>> Thanx, >>>> Alex >>>> >>>> >>>> On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego >>> > wrote: >>>> >>>> Hi, >>>> >>>> We use bonding mode 6 (balance-alb) for GlusterFS traffic >>>> >>>> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 >>>> >>>> Preferred bonding mode for Red Hat Gluster Storage client >>>> is mode 6 (balance-alb), this allows client to transmit >>>> writes in parallel on separate NICs much of the time. >>>> >>>> Regards, >>>> >>>> Jorick Astrego >>>> >>>> On 2/25/19 5:41 AM, Dmitry Melekhov wrote: >>>>> 23.02.2019 19:54, Alex K ?????: >>>>>> Hi all, >>>>>> >>>>>> I have a replica 3 setup where each server was configured >>>>>> with a dual interfaces in mode 6 bonding. All cables were >>>>>> connected to one common network switch. >>>>>> >>>>>> To add redundancy to the switch, and avoid being a single >>>>>> point of failure, I connected each second cable of each >>>>>> server to a second switch. This turned out to not function as >>>>>> gluster was refusing to start the volume logging "transport >>>>>> endpoint is disconnected" although all nodes were able to >>>>>> reach each other (ping) in the storage network. I switched >>>>>> the mode to mode 1 (active/passive) and initially it worked >>>>>> but following a reboot of all cluster same issue appeared. >>>>>> Gluster is not starting the volumes. >>>>>> >>>>>> Isn't active/passive supposed to work like that? Can one have >>>>>> such redundant network setup or are there any other >>>>>> recommended approaches? >>>>>> >>>>> >>>>> Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it >>>>> is, no doubt, best way. >>>>> >>>>> >>>>>> Thanx, >>>>>> Alex >>>>>> >>>>>> _______________________________________________ >>>>>> Gluster-users mailing list >>>>>> Gluster-users at gluster.org >>>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Gluster-users mailing list >>>>> Gluster-users at gluster.org >>>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> >>>> >>>> >>>> Met vriendelijke groet, With kind regards, >>>> >>>> Jorick Astrego >>>> * >>>> Netbulae Virtualization Experts * >>>> ------------------------------------------------------------------------ >>>> Tel: 053 20 30 270 info at netbulae.eu >>>> Staalsteden 4-3A KvK 08198180 >>>> Fax: 053 20 30 271 www.netbulae.eu >>>> 7547 TA Enschede BTW NL821234584B01 >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>>> >>>> _______________________________________________ >>>> Gluster-users mailing list >>>> Gluster-users at gluster.org >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> >> Met vriendelijke groet, With kind regards, >> >> Jorick Astrego >> * >> Netbulae Virtualization Experts * >> ------------------------------------------------------------------------ >> Tel: 053 20 30 270 info at netbulae.eu >> Staalsteden 4-3A KvK 08198180 >> Fax: 053 20 30 271 www.netbulae.eu 7547 TA >> Enschede BTW NL821234584B01 >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > Met vriendelijke groet, With kind regards, Jorick Astrego Netbulae Virtualization Experts ---------------- Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 ---------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From srangana at redhat.com Mon Feb 25 15:10:21 2019 From: srangana at redhat.com (Shyam Ranganathan) Date: Mon, 25 Feb 2019 10:10:21 -0500 Subject: [Gluster-users] [Gluster-Maintainers] glusterfs-6.0rc0 released In-Reply-To: References: <430948742.3.1550808940070.JavaMail.jenkins@jenkins-el7.rht.gluster.org> Message-ID: Hi, Release-6 RC0 packages are built (see mail below). This is a good time to start testing the release bits, and reporting any issues on bugzilla. Do post on the lists any testing done and feedback from the same. We have about 2 weeks to GA of release-6 barring any major blockers uncovered during the test phase. Please take this time to help make the release effective, by testing the same. Thanks, Shyam NOTE: CentOS StorageSIG packages for the same are still pending and should be available in due course. On 2/23/19 9:41 AM, Kaleb Keithley wrote: > > GlusterFS 6.0rc0 is built in Fedora 30 and Fedora 31/rawhide. > > Packages for Fedora 29, RHEL 8, RHEL 7, and RHEL 6* and Debian 9/stretch > and Debian 10/buster are at > https://download.gluster.org/pub/gluster/glusterfs/qa-releases/6.0rc0/ > > Packages are signed. The public key is at > https://download.gluster.org/pub/gluster/glusterfs/6/rsa.pub > > * RHEL 6 is client-side only. Fedora 29, RHEL 7, and RHEL 6 RPMs are > Fedora Koji scratch builds. RHEL 7 and RHEL 6 RPMs are provided here for > convenience only, and are independent of the RPMs in the CentOS Storage SIG. From bb at kernelpanic.ru Mon Feb 25 16:48:51 2019 From: bb at kernelpanic.ru (Boris Zhmurov) Date: Mon, 25 Feb 2019 16:48:51 +0000 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: <9B82193A-15EF-43F6-882B-BA8E7862770A@gmail.com> Message-ID: <063ec31e-69c0-295b-8a63-89e56073192b@kernelpanic.ru> On 25/02/2019 14:24, Jorick Astrego wrote: > > Hi, > > Have not measured it as we have been running this way for years now > and haven't experienced any problems with "transport endpoint is not > connected? with this setup. > Hello, Jorick, how often (during those years) did your NICs break? -- Kind regards, Boris Zhmurov mailto: bb at kernelpanic.ru From alvin at netvel.net Mon Feb 25 17:01:52 2019 From: alvin at netvel.net (Alvin Starr) Date: Mon, 25 Feb 2019 12:01:52 -0500 Subject: [Gluster-users] Gluster and bonding In-Reply-To: <063ec31e-69c0-295b-8a63-89e56073192b@kernelpanic.ru> References: <9B82193A-15EF-43F6-882B-BA8E7862770A@gmail.com> <063ec31e-69c0-295b-8a63-89e56073192b@kernelpanic.ru> Message-ID: <61d22968-576c-cc32-e351-fa48e862ec8a@netvel.net> On 2/25/19 11:48 AM, Boris Zhmurov wrote: > On 25/02/2019 14:24, Jorick Astrego wrote: >> >> Hi, >> >> Have not measured it as we have been running this way for years now >> and haven't experienced any problems with "transport endpoint is not >> connected? with this setup. >> > > Hello, > > Jorick, how often (during those years) did your NICs break? > Over the years(30) I have had problems with bad ports on switches. With some manufactures? being worse than others. -- Alvin Starr || land: (905)513-7688 Netvel Inc. || Cell: (416)806-0133 alvin at netvel.net || From atumball at redhat.com Mon Feb 25 18:11:34 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Mon, 25 Feb 2019 23:41:34 +0530 Subject: [Gluster-users] GlusterFS - 6.0RC - Test days (27th, 28th Feb) Message-ID: Hi all, We are calling out our users, and developers to contribute in validating ?glusterfs-6.0rc? build in their usecase. Specially for the cases of upgrade, stability, and performance. Some of the key highlights of the release are listed in release-notes draft . Please note that there are some of the features which are being dropped out of this release, and hence making sure your setup is not going to have an issue is critical. Also the default lru-limit option in fuse mount for Inodes should help to control the memory usage of client processes. All the good reason to give it a shot in your test setup. If you are developer using gfapi interface to integrate with other projects, you also have some signature changes, so please make sure your project would work with latest release. Or even if you are using a project which depends on gfapi, report the error with new RPMs (if any). We will help fix it. As part of test days, we want to focus on testing the latest upcoming release i.e. GlusterFS-6, and one or the other gluster volunteers would be there on #gluster channel on freenode to assist the people. Some of the key things we are looking as bug reports are: - See if upgrade from your current version to 6.0rc is smooth, and works as documented. - Report bugs in process, or in documentation if you find mismatch. - Functionality is all as expected for your usecase. - No issues with actual application you would run on production etc. - Performance has not degraded in your usecase. - While we have added some performance options to the code, not all of them are turned on, as they have to be done based on usecases. - Make sure the default setup is at least same as your current version - Try out few options mentioned in release notes (especially, --auto-invalidation=no) and see if it helps performance. - While doing all the above, check below: - see if the log files are making sense, and not flooding with some ?for developer only? type of messages. - get ?profile info? output from old and now, and see if there is anything which is out of normal expectation. Check with us on the numbers. - get a ?statedump? when there are some issues. Try to make sense of it, and raise a bug if you don?t understand it completely. Process expected on test days. - We have a tracker bug [0] - We will attach all the ?blocker? bugs to this bug. - Use this link to report bugs, so that we have more metadata around given bugzilla. - Click Here [1] - The test cases which are to be tested are listed here in this sheet [2], please add, update, and keep it up-to-date to reduce duplicate efforts. Lets together make this release a success. Also check if we covered some of the open issues from Weekly untriaged bugs [3] For details on build and RPMs check this email [4] Finally, the dates :-) - Wednesday - Feb 27th, and - Thursday - Feb 28th Note that our goal is to identify as many issues as possible in upgrade and stability scenarios, and if any blockers are found, want to make sure we release with the fix for same. So each of you, Gluster users, feel comfortable to upgrade to 6.0 version. Regards, Gluster Ants. -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From cboubacar at gmail.com Sat Feb 23 22:28:59 2019 From: cboubacar at gmail.com (Boubacar Cisse) Date: Sat, 23 Feb 2019 16:28:59 -0600 Subject: [Gluster-users] Geo-Replication in "FAULTY" state after files are added to master volume: gsyncd worker crashed in syncdutils with "OSError: [Errno 22] Invalid argument Message-ID: Hello all, I having trouble making gluster geo-replication on Ubuntu 18.04 (Bionic). Gluster version is 5.3. I'm able to successfully create the geo-replication session but status goes from "Initializing" to "Faulty" in a loop after session is started. I've created a bug report with all the necessary information at https://bugzilla.redhat.com/show_bug.cgi?id=1680324 Any assistance/tips fixing this issue will be greatly appreciated. 5/ Log entries [MASTER SERVER GEO REP LOG] root at media01:/var/log/glusterfs/geo-replication/gfs1_media03_gfs1# cat gsyncd.log [2019-02-23 21:36:43.851184] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Initializing... [2019-02-23 21:36:43.851489] I [monitor(monitor):157:monitor] Monitor: starting gsyncd worker brick=/gfs1-data/brick slave_node=media03 [2019-02-23 21:36:43.856857] D [monitor(monitor):228:monitor] Monitor: Worker would mount volume privately [2019-02-23 21:36:43.895652] I [gsyncd(agent /gfs1-data/brick):308:main] : Using session config file path=/var/lib/glusterd/geo-replication/gfs1_media03_gfs1/gsyncd.conf [2019-02-23 21:36:43.896118] D [subcmds(agent /gfs1-data/brick):103:subcmd_agent] : RPC FD rpc_fd='8,11,10,9' [2019-02-23 21:36:43.896435] I [changelogagent(agent /gfs1-data/brick):72:__init__] ChangelogAgent: Agent listining... [2019-02-23 21:36:43.897432] I [gsyncd(worker /gfs1-data/brick):308:main] : Using session config file path=/var/lib/glusterd/geo-replication/gfs1_media03_gfs1/gsyncd.conf [2019-02-23 21:36:43.904604] I [resource(worker /gfs1-data/brick):1366:connect_remote] SSH: Initializing SSH connection between master and slave... [2019-02-23 21:36:43.905631] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957803.9055686 __repce_version__() ... [2019-02-23 21:36:45.751853] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957803.9055686 __repce_version__ -> 1.0 [2019-02-23 21:36:45.752202] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957805.7521348 version() ... [2019-02-23 21:36:45.785690] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957805.7521348 version -> 1.0 [2019-02-23 21:36:45.786081] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957805.7860181 pid() ... [2019-02-23 21:36:45.820014] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957805.7860181 pid -> 24141 [2019-02-23 21:36:45.820337] I [resource(worker /gfs1-data/brick):1413:connect_remote] SSH: SSH connection between master and slave established. duration=1.9156 [2019-02-23 21:36:45.820520] I [resource(worker /gfs1-data/brick):1085:connect] GLUSTER: Mounting gluster volume locally... [2019-02-23 21:36:45.837300] D [resource(worker /gfs1-data/brick):859:inhibit] DirectMounter: auxiliary glusterfs mount in place [2019-02-23 21:36:46.843754] D [resource(worker /gfs1-data/brick):933:inhibit] DirectMounter: auxiliary glusterfs mount prepared [2019-02-23 21:36:46.844113] I [resource(worker /gfs1-data/brick):1108:connect] GLUSTER: Mounted gluster volume duration=1.0234 [2019-02-23 21:36:46.844283] I [subcmds(worker /gfs1-data/brick):80:subcmd_worker] : Worker spawn successful. Acknowledging back to monitor [2019-02-23 21:36:46.844623] D [master(worker /gfs1-data/brick):101:gmaster_builder] : setting up change detection mode mode=xsync [2019-02-23 21:36:46.844768] D [monitor(monitor):271:monitor] Monitor: worker(/gfs1-data/brick) connected [2019-02-23 21:36:46.846079] D [master(worker /gfs1-data/brick):101:gmaster_builder] : setting up change detection mode mode=changelog [2019-02-23 21:36:46.847300] D [master(worker /gfs1-data/brick):101:gmaster_builder] : setting up change detection mode mode=changeloghistory [2019-02-23 21:36:46.884938] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957806.8848307 version() ... [2019-02-23 21:36:46.885751] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957806.8848307 version -> 1.0 [2019-02-23 21:36:46.886019] D [master(worker /gfs1-data/brick):774:setup_working_dir] _GMaster: changelog working dir /var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick [2019-02-23 21:36:46.886212] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957806.8861625 init() ... [2019-02-23 21:36:46.892709] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957806.8861625 init -> None [2019-02-23 21:36:46.892794] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957806.892774 register('/gfs1-data/brick', '/var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick', '/var/log/glusterfs/geo-replication/gfs1_media03_gfs1/changes-gfs1-data-brick.log', 8, 5) ... [2019-02-23 21:36:48.896220] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957806.892774 register -> None [2019-02-23 21:36:48.896590] D [master(worker /gfs1-data/brick):774:setup_working_dir] _GMaster: changelog working dir /var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick [2019-02-23 21:36:48.896823] D [master(worker /gfs1-data/brick):774:setup_working_dir] _GMaster: changelog working dir /var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick [2019-02-23 21:36:48.897012] D [master(worker /gfs1-data/brick):774:setup_working_dir] _GMaster: changelog working dir /var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick [2019-02-23 21:36:48.897159] I [master(worker /gfs1-data/brick):1603:register] _GMaster: Working dir path=/var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick [2019-02-23 21:36:48.897512] I [resource(worker /gfs1-data/brick):1271:service_loop] GLUSTER: Register time time=1550957808 [2019-02-23 21:36:48.898130] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140322604570368:1550957808.898032 keep_alive(None,) ... [2019-02-23 21:36:48.907820] D [master(worker /gfs1-data/brick):536:crawlwrap] _GMaster: primary master with volume id f720f1cb-16de-47a4-b1da-49d348736b53 ... [2019-02-23 21:36:48.932170] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140322604570368:1550957808.898032 keep_alive -> 1 [2019-02-23 21:36:49.77565] I [gsyncdstatus(worker /gfs1-data/brick):281:set_active] GeorepStatus: Worker Status Change status=Active [2019-02-23 21:36:49.201132] I [gsyncdstatus(worker /gfs1-data/brick):253:set_worker_crawl_status] GeorepStatus: Crawl Status Change status=History Crawl [2019-02-23 21:36:49.201822] I [master(worker /gfs1-data/brick):1517:crawl] _GMaster: starting history crawl turns=1 stime=(1550858209, 637241) etime=1550957809 entry_stime=None [2019-02-23 21:36:49.202147] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957809.202051 history('/gfs1-data/brick/.glusterfs/changelogs', 1550858209, 1550957809, 3) ... [2019-02-23 21:36:49.203344] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957809.202051 history -> (0, 1550957807) [2019-02-23 21:36:49.203582] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957809.2035315 history_scan() ... [2019-02-23 21:36:49.204280] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957809.2035315 history_scan -> 1 [2019-02-23 21:36:49.204572] D [repce(worker /gfs1-data/brick):196:push] RepceClient: call 22733:140323447641920:1550957809.2045026 history_getchanges() ... [2019-02-23 21:36:49.205424] D [repce(worker /gfs1-data/brick):216:__call__] RepceClient: call 22733:140323447641920:1550957809.2045026 history_getchanges -> ['/var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick/.history/.processing/CHANGELOG.1550858215'] [2019-02-23 21:36:49.205678] I [master(worker /gfs1-data/brick):1546:crawl] _GMaster: slave's time stime=(1550858209, 637241) [2019-02-23 21:36:49.205953] D [master(worker /gfs1-data/brick):1454:changelogs_batch_process] _GMaster: processing changes batch=['/var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick/.history/.processing/CHANGELOG.1550858215'] [2019-02-23 21:36:49.206196] D [master(worker /gfs1-data/brick):1289:process] _GMaster: processing change changelog=/var/lib/misc/gluster/gsyncd/gfs1_media03_gfs1/gfs1-data-brick/.history/.processing/CHANGELOG.1550858215 [2019-02-23 21:36:49.206844] D [master(worker /gfs1-data/brick):1170:process_change] _GMaster: entries: [] [2019-02-23 21:36:49.295979] E [syncdutils(worker /gfs1-data/brick):338:log_raise_exception] : FAIL: Traceback (most recent call last): File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py", line 322, in main func(args) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/subcmds.py", line 82, in subcmd_worker local.service_loop(remote) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/resource.py", line 1277, in service_loop g3.crawlwrap(oneshot=True) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 599, in crawlwrap self.crawl() File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 1555, in crawl self.changelogs_batch_process(changes) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 1455, in changelogs_batch_process self.process(batch) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 1290, in process self.process_change(change, done, retry) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/master.py", line 1229, in process_change st = lstat(go[0]) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line 564, in lstat return errno_wrap(os.lstat, [e], [ENOENT], [ESTALE, EBUSY]) File "/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/syncdutils.py", line 546, in errno_wrap return call(*arg) OSError: [Errno 22] Invalid argument: '.gfid/00000000-0000-0000-0000-000000000001' [2019-02-23 21:36:49.323695] I [repce(agent /gfs1-data/brick):97:service_loop] RepceServer: terminating on reaching EOF. [2019-02-23 21:36:49.849243] I [monitor(monitor):278:monitor] Monitor: worker died in startup phase brick=/gfs1-data/brick [2019-02-23 21:36:49.896026] I [gsyncdstatus(monitor):248:set_worker_status] GeorepStatus: Worker Status Change status=Faulty [SLAVE SERVER GEO REP LOG] root at media03:/var/log/glusterfs/geo-replication-slaves/gfs1_media03_gfs1# cat gsyncd.log [2019-02-23 21:39:10.407784] W [gsyncd(slave media01/gfs1-data/brick):304:main] : Session config file not exists, using the default config path=/var/lib/glusterd/geo-replication/gfs1_media03_gfs1/gsyncd.conf [2019-02-23 21:39:10.414549] I [resource(slave media01/gfs1-data/brick):1085:connect] GLUSTER: Mounting gluster volume locally... [2019-02-23 21:39:10.472665] D [resource(slave media01/gfs1-data/brick):859:inhibit] MountbrokerMounter: auxiliary glusterfs mount in place [2019-02-23 21:39:11.555885] D [resource(slave media01/gfs1-data/brick):926:inhibit] MountbrokerMounter: Lazy umount done: /var/mountbroker-root/mb_hive/mntBkK4D5 [2019-02-23 21:39:11.556459] D [resource(slave media01/gfs1-data/brick):933:inhibit] MountbrokerMounter: auxiliary glusterfs mount prepared [2019-02-23 21:39:11.556585] I [resource(slave media01/gfs1-data/brick):1108:connect] GLUSTER: Mounted gluster volume duration=1.1420 [2019-02-23 21:39:11.556830] I [resource(slave media01/gfs1-data/brick):1135:service_loop] GLUSTER: slave listening [2019-02-23 21:39:15.55945] I [repce(slave media01/gfs1-data/brick):97:service_loop] RepceServer: terminating on reaching EOF. 6/ OS and Gluster Info [MASTER OS INFO] root at media01:/var/run/gluster# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.2 LTS Release: 18.04 Codename: bionic [SLAVE OS INFO] root at media03:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.2 LTS Release: 18.04 Codename: bionic [MASTER GLUSTER VERSION] root at media01:/var/run/gluster# glusterfs --version glusterfs 5.3 Repository revision: git://git.gluster.org/glusterfs.git Copyright (c) 2006-2016 Red Hat, Inc. GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. [SLAVE GLUSTER VERSION] root at media03:~# glusterfs --version glusterfs 5.3 Repository revision: git://git.gluster.org/glusterfs.git Copyright (c) 2006-2016 Red Hat, Inc. GlusterFS comes with ABSOLUTELY NO WARRANTY. It is licensed to you under your choice of the GNU Lesser General Public License, version 3 or any later version (LGPLv3 or later), or the GNU General Public License, version 2 (GPLv2), in all cases as published by the Free Software Foundation. 7/ Master and Slave Servers Config [MASTER /etc/glusterfs/glusterd.vol] root at media01:/var/run/gluster# cat /etc/glusterfs/glusterd.vol volume management type mgmt/glusterd option working-directory /var/lib/glusterd option transport-type socket,rdma option transport.socket.keepalive-time 10 option transport.socket.keepalive-interval 2 option transport.socket.read-fail-log off option ping-timeout 0 option event-threads 1 option rpc-auth-allow-insecure on # option lock-timer 180 # option transport.address-family inet6 # option base-port 49152 # option max-port 65535 end-volume [SLAVE /etc/glusterfs/glusterd.vol] root at media03:~# cat /etc/glusterfs/glusterd.vol volume management type mgmt/glusterd option working-directory /var/lib/glusterd option transport-type socket,rdma option transport.socket.keepalive-time 10 option transport.socket.keepalive-interval 2 option transport.socket.read-fail-log off option ping-timeout 0 option event-threads 1 option mountbroker-root /var/mountbroker-root option geo-replication-log-group geo-group option mountbroker-geo-replication.geo-user gfs2,gfs1 option rpc-auth-allow-insecure on # option lock-timer 180 # option transport.address-family inet6 # option base-port 49152 # option max-port 65535 [MASTER /etc/glusterfs/gsyncd.conf] root at media01:/var/run/gluster# cat /etc/glusterfs/gsyncd.conf [__meta__] version = 4.0 [master-bricks] configurable=false [slave-bricks] configurable=false [master-volume-id] configurable=false [slave-volume-id] configurable=false [master-replica-count] configurable=false type=int value=1 [master-disperse-count] configurable=false type=int value=1 [glusterd-workdir] value = /var/lib/glusterd [gluster-logdir] value = /var/log/glusterfs [gluster-rundir] value = /var/run/gluster [gsyncd-miscdir] value = /var/lib/misc/gluster/gsyncd [stime-xattr-prefix] value= [checkpoint] value=0 help=Set Checkpoint validation=unixtime type=int [gluster-cli-options] value= help=Gluster CLI Options [pid-file] value=${gluster_rundir}/gsyncd-${master}-${primary_slave_host}-${slavevol}.pid configurable=false template = true help=PID file path [state-file] value=${glusterd_workdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/monitor.status configurable=false template=true help=Status File path [georep-session-working-dir] value=${glusterd_workdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/ template=true help=Session Working directory configurable=false [access-mount] value=false type=bool validation=bool help=Do not lazy unmount the master volume. This allows admin to access the mount for debugging. [slave-access-mount] value=false type=bool validation=bool help=Do not lazy unmount the slave volume. This allows admin to access the mount for debugging. [isolated-slaves] value= help=List of Slave nodes which are isolated [changelog-batch-size] # Max size of Changelogs to process per batch, Changelogs Processing is # not limited by the number of changelogs but instead based on # size of the changelog file, One sample changelog file size was 145408 # with ~1000 CREATE and ~1000 DATA. 5 such files in one batch is 727040 # If geo-rep worker crashes while processing a batch, it has to retry only # that batch since stime will get updated after each batch. value=727040 help=Max size of Changelogs to process per batch. type=int [slave-timeout] value=120 type=int help=Timeout in seconds for Slave Gsyncd. If no activity from master for this timeout, Slave gsyncd will be disconnected. Set Timeout to zero to skip this check. [connection-timeout] value=60 type=int help=Timeout for mounts [replica-failover-interval] value=1 type=int help=Minimum time interval in seconds for passive worker to become Active [changelog-archive-format] value=%%Y%%m help=Processed changelogs will be archived in working directory. Pattern for archive file [use-meta-volume] value=false type=bool help=Use this to set Active Passive mode to meta-volume. [meta-volume-mnt] value=/var/run/gluster/shared_storage help=Meta Volume or Shared Volume mount path [allow-network] value= [change-interval] value=5 type=int [use-tarssh] value=false type=bool help=Use sync-mode as tarssh [remote-gsyncd] value=/usr/lib/x86_64-linux-gnu/glusterfs/gsyncd help=If SSH keys are not secured with gsyncd prefix then use this configuration to set the actual path of gsyncd(Usually /usr/libexec/glusterfs/gsyncd) [gluster-command-dir] value=/usr/sbin help=Directory where Gluster binaries exist on master [slave-gluster-command-dir] value=/usr/sbin help=Directory where Gluster binaries exist on slave [gluster-params] value = aux-gfid-mount acl help=Parameters for Gluster Geo-rep mount in Master [slave-gluster-params] value = aux-gfid-mount acl help=Parameters for Gluster Geo-rep mount in Slave [ignore-deletes] value = false type=bool help=Do not sync deletes in Slave [special-sync-mode] # tunables for failover/failback mechanism: # None - gsyncd behaves as normal # blind - gsyncd works with xtime pairs to identify # candidates for synchronization # wrapup - same as normal mode but does not assign # xtimes to orphaned files # see crawl() for usage of the above tunables value = help= [gfid-conflict-resolution] value = true validation=bool type=bool help=Disables automatic gfid conflict resolution while syncing [working-dir] value = ${gsyncd_miscdir}/${master}_${primary_slave_host}_${slavevol}/ template=true configurable=false help=Working directory for storing Changelogs [change-detector] value=changelog help=Change detector validation=choice allowed_values=changelog,xsync [cli-log-file] value=${gluster_logdir}/geo-replication/cli.log template=true configurable=false [cli-log-level] value=DEBUG help=Set CLI Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [log-file] value=${gluster_logdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/gsyncd.log configurable=false template=true [changelog-log-file] value=${gluster_logdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/changes-${local_id}.log configurable=false template=true [gluster-log-file] value=${gluster_logdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/mnt-${local_id}.log template=true configurable=false [slave-log-file] value=${gluster_logdir}/geo-replication-slaves/${master}_${primary_slave_host}_${slavevol}/gsyncd.log template=true configurable=false [slave-gluster-log-file] value=${gluster_logdir}/geo-replication-slaves/${master}_${primary_slave_host}_${slavevol}/mnt-${master_node}-${master_brick_id}.log template=true configurable=false [slave-gluster-log-file-mbr] value=${gluster_logdir}/geo-replication-slaves/${master}_${primary_slave_host}_${slavevol}/mnt-mbr-${master_node}-${master_brick_id}.log template=true configurable=false [log-level] value=DEBUG help=Set Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [gluster-log-level] value=DEBUG help=Set Gluster mount Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [changelog-log-level] value=DEBUG help=Set Changelog Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [slave-log-level] value=DEBUG help=Set Slave Gsyncd Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [slave-gluster-log-level] value=DEBUG help=Set Slave Gluster mount Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [ssh-port] value=2202 validation=int help=Set SSH port type=int [ssh-command] value=ssh help=Set ssh binary path validation=execpath [tar-command] value=tar help=Set tar command path validation=execpath [ssh-options] value = -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i ${glusterd_workdir}/geo-replication/secret.pem template=true [ssh-options-tar] value = -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i ${glusterd_workdir}/geo-replication/tar_ssh.pem template=true [gluster-command] value=gluster help=Set gluster binary path validation=execpath [sync-jobs] value=3 help=Number of Syncer jobs validation=minmax min=1 max=100 type=int [rsync-command] value=rsync help=Set rsync command path validation=execpath [rsync-options] value= [rsync-ssh-options] value= [rsync-opt-ignore-missing-args] value=true type=bool [rsync-opt-existing] value=true type=bool [log-rsync-performance] value=false help=Log Rsync performance validation=bool type=bool [use-rsync-xattrs] value=false type=bool [sync-xattrs] value=true type=bool [sync-acls] value=true type=bool [max-rsync-retries] value=10 type=int [state_socket_unencoded] # Unused, For backward compatibility value= [SLAVE /etc/glusterfs/gsyncd.conf] root at media03:~# cat /etc/glusterfs/gsyncd.conf [__meta__] version = 4.0 [master-bricks] configurable=false [slave-bricks] configurable=false [master-volume-id] configurable=false [slave-volume-id] configurable=false [master-replica-count] configurable=false type=int value=1 [master-disperse-count] configurable=false type=int value=1 [glusterd-workdir] value = /var/lib/glusterd [gluster-logdir] value = /var/log/glusterfs [gluster-rundir] value = /var/run/gluster [gsyncd-miscdir] value = /var/lib/misc/gluster/gsyncd [stime-xattr-prefix] value= [checkpoint] value=0 help=Set Checkpoint validation=unixtime type=int [gluster-cli-options] value= help=Gluster CLI Options [pid-file] value=${gluster_rundir}/gsyncd-${master}-${primary_slave_host}-${slavevol}.pid configurable=false template = true help=PID file path [state-file] value=${glusterd_workdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/monitor.status configurable=false template=true help=Status File path [georep-session-working-dir] value=${glusterd_workdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/ template=true help=Session Working directory configurable=false [access-mount] value=false type=bool validation=bool help=Do not lazy unmount the master volume. This allows admin to access the mount for debugging. [slave-access-mount] value=false type=bool validation=bool help=Do not lazy unmount the slave volume. This allows admin to access the mount for debugging. [isolated-slaves] value= help=List of Slave nodes which are isolated [changelog-batch-size] # Max size of Changelogs to process per batch, Changelogs Processing is # not limited by the number of changelogs but instead based on # size of the changelog file, One sample changelog file size was 145408 # with ~1000 CREATE and ~1000 DATA. 5 such files in one batch is 727040 # If geo-rep worker crashes while processing a batch, it has to retry only # that batch since stime will get updated after each batch. value=727040 help=Max size of Changelogs to process per batch. type=int [slave-timeout] value=120 type=int help=Timeout in seconds for Slave Gsyncd. If no activity from master for this timeout, Slave gsyncd will be disconnected. Set Timeout to zero to skip this check. [connection-timeout] value=60 type=int help=Timeout for mounts [replica-failover-interval] value=1 type=int help=Minimum time interval in seconds for passive worker to become Active [changelog-archive-format] value=%%Y%%m help=Processed changelogs will be archived in working directory. Pattern for archive file [use-meta-volume] value=false type=bool help=Use this to set Active Passive mode to meta-volume. [meta-volume-mnt] value=/var/run/gluster/shared_storage help=Meta Volume or Shared Volume mount path [allow-network] value= [change-interval] value=5 type=int [use-tarssh] value=false type=bool help=Use sync-mode as tarssh [remote-gsyncd] value=/usr/lib/x86_64-linux-gnu/glusterfs/gsyncd help=If SSH keys are not secured with gsyncd prefix then use this configuration to set the actual path of gsyncd(Usually /usr/libexec/glusterfs/gsyncd) [gluster-command-dir] value=/usr/sbin help=Directory where Gluster binaries exist on master [slave-gluster-command-dir] value=/usr/sbin help=Directory where Gluster binaries exist on slave [gluster-params] value = aux-gfid-mount acl help=Parameters for Gluster Geo-rep mount in Master [slave-gluster-params] value = aux-gfid-mount acl help=Parameters for Gluster Geo-rep mount in Slave [ignore-deletes] value = false type=bool help=Do not sync deletes in Slave [special-sync-mode] # tunables for failover/failback mechanism: # None - gsyncd behaves as normal # blind - gsyncd works with xtime pairs to identify # candidates for synchronization # wrapup - same as normal mode but does not assign # xtimes to orphaned files # see crawl() for usage of the above tunables value = help= [gfid-conflict-resolution] value = true validation=bool type=bool help=Disables automatic gfid conflict resolution while syncing [working-dir] value = ${gsyncd_miscdir}/${master}_${primary_slave_host}_${slavevol}/ template=true configurable=false help=Working directory for storing Changelogs [change-detector] value=changelog help=Change detector validation=choice allowed_values=changelog,xsync [cli-log-file] value=${gluster_logdir}/geo-replication/cli.log template=true configurable=false [cli-log-level] value=DEBUG help=Set CLI Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [log-file] value=${gluster_logdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/gsyncd.log configurable=false template=true [changelog-log-file] value=${gluster_logdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/changes-${local_id}.log configurable=false template=true [gluster-log-file] value=${gluster_logdir}/geo-replication/${master}_${primary_slave_host}_${slavevol}/mnt-${local_id}.log template=true configurable=false [slave-log-file] value=${gluster_logdir}/geo-replication-slaves/${master}_${primary_slave_host}_${slavevol}/gsyncd.log template=true configurable=false [slave-gluster-log-file] value=${gluster_logdir}/geo-replication-slaves/${master}_${primary_slave_host}_${slavevol}/mnt-${master_node}-${master_brick_id}.log template=true configurable=false [slave-gluster-log-file-mbr] value=${gluster_logdir}/geo-replication-slaves/${master}_${primary_slave_host}_${slavevol}/mnt-mbr-${master_node}-${master_brick_id}.log template=true configurable=false [log-level] value=DEBUG help=Set Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [gluster-log-level] value=DEBUG help=Set Gluster mount Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [changelog-log-level] value=DEBUG help=Set Changelog Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [slave-log-level] value=DEBUG help=Set Slave Gsyncd Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [slave-gluster-log-level] value=DEBUG help=Set Slave Gluster mount Log Level validation=choice allowed_values=ERROR,INFO,WARNING,DEBUG [ssh-port] value=2202 validation=int help=Set SSH port type=int [ssh-command] value=ssh help=Set ssh binary path validation=execpath [tar-command] value=tar help=Set tar command path validation=execpath [ssh-options] value = -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i ${glusterd_workdir}/geo-replication/secret.pem template=true [ssh-options-tar] value = -oPasswordAuthentication=no -oStrictHostKeyChecking=no -i ${glusterd_workdir}/geo-replication/tar_ssh.pem template=true [gluster-command] value=gluster help=Set gluster binary path validation=execpath [sync-jobs] value=3 help=Number of Syncer jobs validation=minmax min=1 max=100 type=int [rsync-command] value=rsync help=Set rsync command path validation=execpath [rsync-options] value= [rsync-ssh-options] value= [rsync-opt-ignore-missing-args] value=true type=bool [rsync-opt-existing] value=true type=bool [log-rsync-performance] value=false help=Log Rsync performance validation=bool type=bool [use-rsync-xattrs] value=false type=bool [sync-xattrs] value=true type=bool [sync-acls] value=true type=bool [max-rsync-retries] value=10 type=int [state_socket_unencoded] # Unused, For backward compatibility value= 8/ Master volume status root at media01:/var/run/gluster# gluster volume status Status of volume: gfs1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick media01:/gfs1-data/brick 49153 0 Y 8366 Brick media02:/gfs1-data/brick 49153 0 Y 5560 Self-heal Daemon on localhost N/A N/A Y 9170 Bitrot Daemon on localhost N/A N/A Y 9186 Scrubber Daemon on localhost N/A N/A Y 9212 Self-heal Daemon on media02 N/A N/A Y 6034 Bitrot Daemon on media02 N/A N/A Y 6050 Scrubber Daemon on media02 N/A N/A Y 6076 Task Status of Volume gfs1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gfs2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick media01:/gfs2-data/brick 49154 0 Y 8460 Brick media02:/gfs2-data/brick 49154 0 Y 5650 Self-heal Daemon on localhost N/A N/A Y 9170 Bitrot Daemon on localhost N/A N/A Y 9186 Scrubber Daemon on localhost N/A N/A Y 9212 Self-heal Daemon on media02 N/A N/A Y 6034 Bitrot Daemon on media02 N/A N/A Y 6050 Scrubber Daemon on media02 N/A N/A Y 6076 Task Status of Volume gfs2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: gluster_shared_storage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick media02:/var/lib/glusterd/ss_brick 49152 0 Y 2767 Brick media01:/var/lib/glusterd/ss_brick 49152 0 Y 3288 Self-heal Daemon on localhost N/A N/A Y 9170 Self-heal Daemon on media02 N/A N/A Y 6034 Task Status of Volume gluster_shared_storage ------------------------------------------------------------------------------ There are no active volume tasks 9/ Master gluster config root at media01:/var/run/gluster# gluster volume info Volume Name: gfs1 Type: Replicate Volume ID: f720f1cb-16de-47a4-b1da-49d348736b53 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: media01:/gfs1-data/brick Brick2: media02:/gfs1-data/brick Options Reconfigured: geo-replication.ignore-pid-check: on diagnostics.count-fop-hits: on diagnostics.latency-measurement: on changelog.changelog: on geo-replication.indexing: on encryption.data-key-size: 512 encryption.master-key: /var/lib/glusterd/vols/gfs1/gfs1-encryption.key performance.open-behind: off performance.write-behind: off performance.quick-read: off features.encryption: on server.ssl: on client.ssl: on performance.client-io-threads: off nfs.disable: on transport.address-family: inet features.utime: on performance.ctime-invalidation: on cluster.lookup-optimize: on cluster.self-heal-daemon: on server.allow-insecure: on cluster.ensure-durability: on cluster.nufa: enable auth.allow: * auth.ssl-allow: * features.bitrot: on features.scrub: Active cluster.enable-shared-storage: enable Volume Name: gfs2 Type: Replicate Volume ID: 3b506d7f-26cc-47e1-85f0-5e4047b3a526 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: media01:/gfs2-data/brick Brick2: media02:/gfs2-data/brick Options Reconfigured: geo-replication.ignore-pid-check: on diagnostics.count-fop-hits: on diagnostics.latency-measurement: on changelog.changelog: on geo-replication.indexing: on encryption.data-key-size: 512 encryption.master-key: /var/lib/glusterd/vols/gfs2/gfs2-encryption.key performance.open-behind: off performance.write-behind: off performance.quick-read: off features.encryption: on server.ssl: on client.ssl: on performance.client-io-threads: off nfs.disable: on transport.address-family: inet features.utime: on performance.ctime-invalidation: on cluster.lookup-optimize: on cluster.self-heal-daemon: on server.allow-insecure: on cluster.ensure-durability: on cluster.nufa: enable auth.allow: * auth.ssl-allow: * features.bitrot: on features.scrub: Active cluster.enable-shared-storage: enable Volume Name: gluster_shared_storage Type: Replicate Volume ID: 1aa8c5c9-a950-490a-8e7f-486d06fe68fa Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: media02:/var/lib/glusterd/ss_brick Brick2: media01:/var/lib/glusterd/ss_brick Options Reconfigured: transport.address-family: inet nfs.disable: on performance.client-io-threads: off cluster.enable-shared-storage: enable 10/ Slave gluster config root at media03:~# gluster volume info Volume Name: gfs1 Type: Distribute Volume ID: 45f73890-72f2-48a7-84e5-3bc87d995b62 Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: media03:/gfs1-data/brick Options Reconfigured: features.scrub: Active features.bitrot: on auth.ssl-allow: * auth.allow: * cluster.nufa: enable cluster.ensure-durability: on server.allow-insecure: on cluster.lookup-optimize: on performance.ctime-invalidation: on features.utime: on transport.address-family: inet nfs.disable: on client.ssl: on server.ssl: on features.encryption: on performance.quick-read: off performance.write-behind: off performance.open-behind: off encryption.master-key: /var/lib/glusterd/vols/gfs1/gfs1-encryption.key encryption.data-key-size: 512 geo-replication.indexing: on diagnostics.latency-measurement: on diagnostics.count-fop-hits: on features.shard: disable Volume Name: gfs2 Type: Distribute Volume ID: 98f4619a-c0c8-4fa0-b467-98ada511375a Status: Started Snapshot Count: 0 Number of Bricks: 1 Transport-type: tcp Bricks: Brick1: media03:/gfs2-data/brick Options Reconfigured: features.scrub: Active features.bitrot: on auth.ssl-allow: * auth.allow: * cluster.nufa: enable cluster.ensure-durability: on server.allow-insecure: on cluster.lookup-optimize: on performance.ctime-invalidation: on features.utime: on transport.address-family: inet nfs.disable: on client.ssl: on server.ssl: on features.encryption: on performance.quick-read: off performance.write-behind: off performance.open-behind: off encryption.master-key: /var/lib/glusterd/vols/gfs2/gfs2-encryption.key encryption.data-key-size: 512 geo-replication.indexing: on diagnostics.latency-measurement: on diagnostics.count-fop-hits: on features.shard: disable -Boubacar Cisse -------------- next part -------------- An HTML attachment was scrubbed... URL: From amye at redhat.com Tue Feb 26 00:47:26 2019 From: amye at redhat.com (Amye Scavarda) Date: Mon, 25 Feb 2019 16:47:26 -0800 Subject: [Gluster-users] Code of Conduct Update Message-ID: We've updated the code of conduct for Gluster to be more clear, it's now based off the Contributor's Covenant 1.4. (https://www.contributor-covenant.org/version/1/4/code-of-conduct.html) This is the same Code of Conduct that many other communities have adopted (https://www.contributor-covenant.org/adopters). You can find the code of conduct at https://www.gluster.org/legal-page/code-of-conduct/ Feel free to email the Technical Leadership Council (tlc@) with questions, our code of conduct is designed to make participation in the community easy. -- Amye Scavarda | amye at redhat.com | Gluster Community Lead From hunter86_bg at yahoo.com Tue Feb 26 04:53:35 2019 From: hunter86_bg at yahoo.com (Strahil) Date: Tue, 26 Feb 2019 06:53:35 +0200 Subject: [Gluster-users] Gluster and bonding Message-ID: Hi Alex, As per the following ( ttps://community.cisco.com/t5/switching/lacp-load-balancing-in-2-switches-part-of-3750-stack-switch/td-p/2268111 ) your switches need to be stacked in order to support lacp with your setup. Yet, I'm not sure if balance-alb will work with 2 separate switches - maybe some special configuration is needed ?!? As far as I know gluster can have multiple IPs matched to a single peer, but I'm not sure if having 2 separate networks will be used as active-backup or active-active. Someone more experienced should jump in. Best Regards, Strahil NikolovOn Feb 25, 2019 12:43, Alex K wrote: > > Hi All, > > I was asking if it is possible to have the two separate cables connected to two different physical switched. When trying mode6 or mode1 in this setup gluster was refusing to start the volumes, giving me "transport endpoint is not connected". > > server1: cable1 ---------------- switch1 --------------------- server2: cable1 > ?????????????????????????????? ? ? ? ? ? ?? | > server1: cable2 ---------------- switch2 --------------------- server2: cable2 > > Both switches are connected with each other also. This is done to achieve redundancy for the switches. > When disconnecting cable2 from both servers, then gluster was happy. > What could be the?problem? > > Thanx, > Alex > > > On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego wrote: >> >> Hi, >> >> We use bonding mode 6 (balance-alb) for GlusterFS traffic >> >> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 >>> >>> Preferred bonding mode for Red Hat Gluster Storage client is mode 6 (balance-alb), this allows client to transmit writes in parallel on separate NICs much of the time. >> >> Regards, >> >> Jorick Astrego >> >> On 2/25/19 5:41 AM, Dmitry Melekhov wrote: >>> >>> 23.02.2019 19:54, Alex K ?????: >>>> >>>> Hi all, >>>> >>>> I have a replica 3 setup where each server was configured with a dual interfaces in mode 6 bonding. All cables were connected to one common network switch. >>>> >>>> To add redundancy to the switch, and avoid being a single point of failure, I connected each second cable of each server to a second switch. This turned out to not function as gluster was refusing to start the volume logging "transport endpoint is disconnected" although all nodes were able to reach each other (ping) in the storage network. I switched the mode to mode 1 (active/passive) and initially it worked but following a reboot of all cluster same issue appeared. Gluster is not starting the volumes. >>>> >>>> Isn't active/passive supposed to work like that? Can one have such redundant network setup or are there any other recommended approaches? >>>> >>> >>> Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no doubt, best way. >>> >>> >>>> Thanx, >>>> Alex >>>> >>>> _______________________________________________ >>>> >>>> Gluster-users mailing list >>>> >>>> Gluster-users at gluster.org >>>> >>>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From kdhananj at redhat.com Tue Feb 26 07:01:17 2019 From: kdhananj at redhat.com (Krutika Dhananjay) Date: Tue, 26 Feb 2019 12:31:17 +0530 Subject: [Gluster-users] [ovirt-users] Tracking down high writes in GlusterFS volume In-Reply-To: References: Message-ID: On Fri, Feb 15, 2019 at 12:30 AM Jayme wrote: > Running an oVirt 4.3 HCI 3-way replica cluster with SSD backed storage. > I've noticed that my SSD writes (smart Total_LBAs_Written) are quite high > on one particular drive. Specifically I've noticed one volume is much much > higher total bytes written than others (despite using less overall space). > Writes are higher on one particular volume? Or did one brick witness more writes than its two replicas within the same volume? Could you share the volume info output of the affected volume plus the name of the affected brick if at all the issue is with one single brick? Also, did you check if the volume was undergoing any heals (`gluster volume heal info`)? -Krutika My volume is writing over 1TB of data per day (by my manual calculation, > and with glusterfs profiling) and wearing my SSDs quickly, how can I best > determine which VM or process is at fault here? > > There are 5 low use VMs using the volume in question. I'm attempting to > track iostats on each of the vm's individually but so far I'm not seeing > anything obvious that would account for 1TB of writes per day that the > gluster volume is reporting. > _______________________________________________ > Users mailing list -- users at ovirt.org > To unsubscribe send an email to users-leave at ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users at ovirt.org/message/OZHZXQS4GUPPJXOZSBTO6X5ZL6CATFXK/ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishpaliwal at gmail.com Tue Feb 26 12:17:05 2019 From: abhishpaliwal at gmail.com (ABHISHEK PALIWAL) Date: Tue, 26 Feb 2019 17:47:05 +0530 Subject: [Gluster-users] Version uplift query Message-ID: Hi, Currently we are using Glusterfs 3.7.6 and thinking to switch on Glusterfs 4.1 or 5.0, when I see there are too much code changes between these version, could you please let us know, is there any compatibility issue when we uplift any of the new mentioned version? Regards Abhishek -------------- next part -------------- An HTML attachment was scrubbed... URL: From rightkicktech at gmail.com Tue Feb 26 14:14:54 2019 From: rightkicktech at gmail.com (Alex K) Date: Tue, 26 Feb 2019 16:14:54 +0200 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: Thank you to all for your suggestions. I came here since only gluster was having issues to start. Ping and other networking services were showing everything fine, so I guess there is sth at gluster that does not like what I tried to do. Unfortunately I have this system in production and I cannot experiment. It was a customer request to add redundancy to the switch and I went with what I assumed would work. I guess I have to have the switches stacked, but the current ones do not support this. They are just simple managed switches. Multiple IPs per peers could be a solution. I will search a little more and in case I have sth I will get back. On Tue, Feb 26, 2019 at 6:52 AM Strahil wrote: > Hi Alex, > > As per the following ( ttps:// > community.cisco.com/t5/switching/lacp-load-balancing-in-2-switches-part-of-3750-stack-switch/td-p/2268111 > ) your switches need to be stacked in order to support lacp with your setup. > Yet, I'm not sure if balance-alb will work with 2 separate switches - > maybe some special configuration is needed ?!? > As far as I know gluster can have multiple IPs matched to a single peer, > but I'm not sure if having 2 separate networks will be used as > active-backup or active-active. > > Someone more experienced should jump in. > > Best Regards, > Strahil Nikolov > On Feb 25, 2019 12:43, Alex K wrote: > > Hi All, > > I was asking if it is possible to have the two separate cables connected > to two different physical switched. When trying mode6 or mode1 in this > setup gluster was refusing to start the volumes, giving me "transport > endpoint is not connected". > > server1: cable1 ---------------- switch1 --------------------- server2: > cable1 > | > server1: cable2 ---------------- switch2 --------------------- server2: > cable2 > > Both switches are connected with each other also. This is done to achieve > redundancy for the switches. > When disconnecting cable2 from both servers, then gluster was happy. > What could be the problem? > > Thanx, > Alex > > > On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego > wrote: > > Hi, > > We use bonding mode 6 (balance-alb) for GlusterFS traffic > > > > https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 > > Preferred bonding mode for Red Hat Gluster Storage client is mode 6 > (balance-alb), this allows client to transmit writes in parallel on > separate NICs much of the time. > > Regards, > > Jorick Astrego > On 2/25/19 5:41 AM, Dmitry Melekhov wrote: > > 23.02.2019 19:54, Alex K ?????: > > Hi all, > > I have a replica 3 setup where each server was configured with a dual > interfaces in mode 6 bonding. All cables were connected to one common > network switch. > > To add redundancy to the switch, and avoid being a single point of > failure, I connected each second cable of each server to a second switch. > This turned out to not function as gluster was refusing to start the volume > logging "transport endpoint is disconnected" although all nodes were able > to reach each other (ping) in the storage network. I switched the mode to > mode 1 (active/passive) and initially it worked but following a reboot of > all cluster same issue appeared. Gluster is not starting the volumes. > > Isn't active/passive supposed to work like that? Can one have such > redundant network setup or are there any other recommended approaches? > > > Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no > doubt, best way. > > > Thanx, > Alex > > _______________________________________________ > Gluster-users mailing listGluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From vincent at epicenergy.ca Tue Feb 26 15:29:54 2019 From: vincent at epicenergy.ca (Vincent Royer) Date: Tue, 26 Feb 2019 07:29:54 -0800 Subject: [Gluster-users] Gluster and bonding In-Reply-To: References: Message-ID: I do what you're trying to do, with bonds going to two independent switches (not stacked) from each host. Works fine for the engine, vms network, migration, etc, but I don't use Gluster. Either cable can be pulled or either switch rebooted without disturbing Ovirt. Switches are connected to each other as you've described. I'd like to get to the bottom of why it's not working with Gluster, because I plan the same as you. On Tue, Feb 26, 2019, 6:15 AM Alex K wrote: > > Thank you to all for your suggestions. > > I came here since only gluster was having issues to start. Ping and other > networking services were showing everything fine, so I guess there is sth > at gluster that does not like what I tried to do. > Unfortunately I have this system in production and I cannot experiment. It > was a customer request to add redundancy to the switch and I went with what > I assumed would work. > I guess I have to have the switches stacked, but the current ones do not > support this. They are just simple managed switches. > > Multiple IPs per peers could be a solution. > I will search a little more and in case I have sth I will get back. > > On Tue, Feb 26, 2019 at 6:52 AM Strahil wrote: > >> Hi Alex, >> >> As per the following ( ttps:// >> community.cisco.com/t5/switching/lacp-load-balancing-in-2-switches-part-of-3750-stack-switch/td-p/2268111 >> ) your switches need to be stacked in order to support lacp with your setup. >> Yet, I'm not sure if balance-alb will work with 2 separate switches - >> maybe some special configuration is needed ?!? >> As far as I know gluster can have multiple IPs matched to a single peer, >> but I'm not sure if having 2 separate networks will be used as >> active-backup or active-active. >> >> Someone more experienced should jump in. >> >> Best Regards, >> Strahil Nikolov >> On Feb 25, 2019 12:43, Alex K wrote: >> >> Hi All, >> >> I was asking if it is possible to have the two separate cables connected >> to two different physical switched. When trying mode6 or mode1 in this >> setup gluster was refusing to start the volumes, giving me "transport >> endpoint is not connected". >> >> server1: cable1 ---------------- switch1 --------------------- server2: >> cable1 >> | >> server1: cable2 ---------------- switch2 --------------------- server2: >> cable2 >> >> Both switches are connected with each other also. This is done to achieve >> redundancy for the switches. >> When disconnecting cable2 from both servers, then gluster was happy. >> What could be the problem? >> >> Thanx, >> Alex >> >> >> On Mon, Feb 25, 2019 at 11:32 AM Jorick Astrego >> wrote: >> >> Hi, >> >> We use bonding mode 6 (balance-alb) for GlusterFS traffic >> >> >> >> https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.4/html/administration_guide/network4 >> >> Preferred bonding mode for Red Hat Gluster Storage client is mode 6 >> (balance-alb), this allows client to transmit writes in parallel on >> separate NICs much of the time. >> >> Regards, >> >> Jorick Astrego >> On 2/25/19 5:41 AM, Dmitry Melekhov wrote: >> >> 23.02.2019 19:54, Alex K ?????: >> >> Hi all, >> >> I have a replica 3 setup where each server was configured with a dual >> interfaces in mode 6 bonding. All cables were connected to one common >> network switch. >> >> To add redundancy to the switch, and avoid being a single point of >> failure, I connected each second cable of each server to a second switch. >> This turned out to not function as gluster was refusing to start the volume >> logging "transport endpoint is disconnected" although all nodes were able >> to reach each other (ping) in the storage network. I switched the mode to >> mode 1 (active/passive) and initially it worked but following a reboot of >> all cluster same issue appeared. Gluster is not starting the volumes. >> >> Isn't active/passive supposed to work like that? Can one have such >> redundant network setup or are there any other recommended approaches? >> >> >> Yes, we use lacp, I guess this is mode 4 ( we use teamd ), it is, no >> doubt, best way. >> >> >> Thanx, >> Alex >> >> _______________________________________________ >> Gluster-users mailing listGluster-users at gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorick at netbulae.eu Wed Feb 27 09:20:16 2019 From: jorick at netbulae.eu (Jorick Astrego) Date: Wed, 27 Feb 2019 10:20:16 +0100 Subject: [Gluster-users] Gluster and bonding In-Reply-To: <61d22968-576c-cc32-e351-fa48e862ec8a@netvel.net> References: <9B82193A-15EF-43F6-882B-BA8E7862770A@gmail.com> <063ec31e-69c0-295b-8a63-89e56073192b@kernelpanic.ru> <61d22968-576c-cc32-e351-fa48e862ec8a@netvel.net> Message-ID: <538f10fe-1d9c-26b5-d178-da08498fd0a3@netbulae.eu> On 2/25/19 6:01 PM, Alvin Starr wrote: > On 2/25/19 11:48 AM, Boris Zhmurov wrote: >> On 25/02/2019 14:24, Jorick Astrego wrote: >>> >>> Hi, >>> >>> Have not measured it as we have been running this way for years now >>> and haven't experienced any problems with "transport endpoint is not >>> connected? with this setup. >>> >> >> Hello, >> >> Jorick, how often (during those years) did your NICs break? >> > Over the years(30) I have had problems with bad ports on switches. > > With some manufactures? being worse than others. > > Hi, Have been doing 25 years of infra and I have seen really everything break and PDSS (People Doing Stupid Sh*t) The NIC's these days are excellent quality and we never had one break in 10 years. We do a lot of testing before we put it into production and we have had some other issues that have the same effect (switch faillure, someone pulling the wrong cable, LACP mis configuration). Actually we went from LACP with stacked switches to balance-alb. There were more configuration errors with LACP and we had stacked switches getting messed up. We now have separate L2 storage switches. And the GlusterFS developers think it's the best bonding mode for their application, so you don't have to take my word for it ;-) https://docs.gluster.org/en/latest/Administrator%20Guide/Network%20Configurations%20Techniques/ *best bonding mode for Gluster client is mode 6 (balance-alb)*, this allows client to transmit writes in parallel on separate NICs much of the time. A peak throughput of 750 MB/s on writes from a single client was observed with bonding mode 6 on 2 10-GbE NICs with jumbo frames. That's 1.5 GB/s of network traffic. another way to balance both transmit and receive traffic is bonding mode 4 (802.3ad) but this requires switch configuration (trunking commands) still another way to load balance is bonding mode 2 (balance-xor) with option "xmit_hash_policy=layer3+4". The bonding modes 6 and 2 will not improve single-connection throughput, but improve aggregate throughput across all connections. Regards, Jorick Astrego Met vriendelijke groet, With kind regards, Jorick Astrego Netbulae Virtualization Experts ---------------- Tel: 053 20 30 270 info at netbulae.eu Staalsteden 4-3A KvK 08198180 Fax: 053 20 30 271 www.netbulae.eu 7547 TA Enschede BTW NL821234584B01 ---------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishpaliwal at gmail.com Wed Feb 27 11:15:11 2019 From: abhishpaliwal at gmail.com (ABHISHEK PALIWAL) Date: Wed, 27 Feb 2019 16:45:11 +0530 Subject: [Gluster-users] Version uplift query In-Reply-To: References: Message-ID: Hi, Could you please update on this and also let us know what is GlusterD2 (as it is under development in 5.0 release), so it is ok to uplift to 5.0? Regards, Abhishek On Tue, Feb 26, 2019 at 5:47 PM ABHISHEK PALIWAL wrote: > Hi, > > Currently we are using Glusterfs 3.7.6 and thinking to switch on Glusterfs > 4.1 or 5.0, when I see there are too much code changes between these > version, could you please let us know, is there any compatibility issue > when we uplift any of the new mentioned version? > > Regards > Abhishek > -- Regards Abhishek Paliwal -------------- next part -------------- An HTML attachment was scrubbed... URL: From atumball at redhat.com Wed Feb 27 15:11:25 2019 From: atumball at redhat.com (Amar Tumballi Suryanarayan) Date: Wed, 27 Feb 2019 20:41:25 +0530 Subject: [Gluster-users] Version uplift query In-Reply-To: References: Message-ID: GlusterD2 is not yet called out for standalone deployments. You can happily update to glusterfs-5.x (recommend you to wait for glusterfs-5.4 which is already tagged, and waiting for packages to be built). Regards, Amar On Wed, Feb 27, 2019 at 4:46 PM ABHISHEK PALIWAL wrote: > Hi, > > Could you please update on this and also let us know what is GlusterD2 > (as it is under development in 5.0 release), so it is ok to uplift to 5.0? > > Regards, > Abhishek > > On Tue, Feb 26, 2019 at 5:47 PM ABHISHEK PALIWAL > wrote: > >> Hi, >> >> Currently we are using Glusterfs 3.7.6 and thinking to switch on >> Glusterfs 4.1 or 5.0, when I see there are too much code changes between >> these version, could you please let us know, is there any compatibility >> issue when we uplift any of the new mentioned version? >> >> Regards >> Abhishek >> > > > -- > > > > > Regards > Abhishek Paliwal > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- Amar Tumballi (amarts) -------------- next part -------------- An HTML attachment was scrubbed... URL: From ingo at fischer-ka.de Wed Feb 27 15:24:32 2019 From: ingo at fischer-ka.de (Ingo Fischer) Date: Wed, 27 Feb 2019 16:24:32 +0100 Subject: [Gluster-users] Version uplift query In-Reply-To: References: Message-ID: Hi Amar, sorry to jump into this thread with an connected question. When installing via "apt-get" and so using debian packages and also systemd to start/stop glusterd is the online upgrade process from 3.x/4.x to 5.x still needed as described at https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/ ? Especially because there is manual killall and such for processes handled by systemd in my case. Or is there an other upgrade guide or recommendations for use on ubuntu? Would systemctl stop glusterd, then using apt-get update with changes sources and a reboot be enough? Ingo Am 27.02.19 um 16:11 schrieb Amar Tumballi Suryanarayan: > GlusterD2 is not yet called out for standalone deployments. > > You can happily update to glusterfs-5.x (recommend you to wait for > glusterfs-5.4 which is already tagged, and waiting for packages to be > built). > > Regards, > Amar > > On Wed, Feb 27, 2019 at 4:46 PM ABHISHEK PALIWAL > > wrote: > > Hi, > > Could? you please update on this and also let us know what is > GlusterD2 (as it is under development in 5.0 release), so it is ok > to uplift to 5.0? > > Regards, > Abhishek > > On Tue, Feb 26, 2019 at 5:47 PM ABHISHEK PALIWAL > > wrote: > > Hi, > > Currently we are using Glusterfs 3.7.6 and thinking to switch on > Glusterfs 4.1 or 5.0, when I see there are too much code changes > between these version, could you please let us know, is there > any compatibility issue when we uplift any of the new mentioned > version?? > > Regards > Abhishek > > > > -- > > > > > Regards > Abhishek Paliwal > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > From tmgreene364 at gmail.com Wed Feb 27 17:37:12 2019 From: tmgreene364 at gmail.com (Tami Greene) Date: Wed, 27 Feb 2019 12:37:12 -0500 Subject: [Gluster-users] Added bricks with wrong name and now need to remove them without destroying volume. Message-ID: Yes, I broke it. Now I need help fixing it. I have an existing Gluster Volume, spread over 16 bricks and 4 servers; 1.5P space with 49% currently used . Added an additional 4 bricks and server as we expect large influx of data in the next 4 to 6 months. The system had been established by my predecessor, who is no longer here. First solo addition of bricks to gluster. Everything went smoothly until ?gluster volume add-brick Volume newserver:/bricks/dataX/vol.name" (I don?t have the exact response as I worked on this for almost 5 hours last night) Unable to add-brick as ?it is already mounted? or something to that affect. Double checked my instructions, the name of the bricks. Everything seemed correct. Tried to add again adding ?force.? Again, ?unable to add-brick? Because of the keyword (in my mind) ?mounted? in the error, I checked /etc/fstab, where the name of the mount point is simply /bricks/dataX. This convention was the same across all servers, so I thought I had discovered an error in my notes and changed the name to newserver:/bricks/dataX. Still had to use force, but the bricks were added. Restarted the gluster volume vol.name. No errors. Rebooted; but /vol.name did not mount on reboot as the /etc/fstab instructs. So I attempted to mount manually and discovered a had a big mess on my hands. ?Transport endpoint not connected? in addition to other messages. Discovered an issue between certificates and the auth.ssl-allow list because of the hostname of new server. I made correction and /vol.name mounted. However, df -h indicated the 4 new bricks were not being seen as 400T were missing from what should have been available. Thankfully, I could add something to vol.name on one machine and see it on another machine and I wrongly assumed the volume was operational, even if the new bricks were not recognized. So I tried to correct the main issue by, gluster volume remove vol.name newserver/bricks/dataX/ received prompt, data will be migrated before brick is removed continue (or something to that) and I started the process, think this won?t take long because there is no data. After 10 minutes and no apparent progress on the process, I did panic, thinking worse case scenario ? it is writing zeros over my data. Executed the stop command and there was still no progress, and I assume it was due to no data on the brick to be remove causing the program to hang. Found the process ID and killed it. This morning, while all clients and servers can access /vol.name; not all of the data is present. I can find it under cluster, but users cannot reach it. I am, again, assume it is because of the 4 bricks that have been added, but aren't really a part of the volume because of their incorrect name. So ? how do I proceed from here. 1. Remove the 4 empty bricks from the volume without damaging data. 2. Correctly clear any metadata about these 4 bricks ONLY so they may be added correctly. If this doesn't restore the volume to full functionality, I'll write another post if I cannot find answer in the notes or on line. Tami-- -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmgreene364 at gmail.com Wed Feb 27 19:24:13 2019 From: tmgreene364 at gmail.com (Tami Greene) Date: Wed, 27 Feb 2019 14:24:13 -0500 Subject: [Gluster-users] Fwd: Added bricks with wrong name and now need to remove them without destroying volume. In-Reply-To: References: Message-ID: I sent this and realized I hadn't registered. My apologies for the duplication Subject: Added bricks with wrong name and now need to remove them without destroying volume. To: Yes, I broke it. Now I need help fixing it. I have an existing Gluster Volume, spread over 16 bricks and 4 servers; 1.5P space with 49% currently used . Added an additional 4 bricks and server as we expect large influx of data in the next 4 to 6 months. The system had been established by my predecessor, who is no longer here. First solo addition of bricks to gluster. Everything went smoothly until ?gluster volume add-brick Volume newserver:/bricks/dataX/vol.name" (I don?t have the exact response as I worked on this for almost 5 hours last night) Unable to add-brick as ?it is already mounted? or something to that affect. Double checked my instructions, the name of the bricks. Everything seemed correct. Tried to add again adding ?force.? Again, ?unable to add-brick? Because of the keyword (in my mind) ?mounted? in the error, I checked /etc/fstab, where the name of the mount point is simply /bricks/dataX. This convention was the same across all servers, so I thought I had discovered an error in my notes and changed the name to newserver:/bricks/dataX. Still had to use force, but the bricks were added. Restarted the gluster volume vol.name. No errors. Rebooted; but /vol.name did not mount on reboot as the /etc/fstab instructs. So I attempted to mount manually and discovered a had a big mess on my hands. ?Transport endpoint not connected? in addition to other messages. Discovered an issue between certificates and the auth.ssl-allow list because of the hostname of new server. I made correction and /vol.name mounted. However, df -h indicated the 4 new bricks were not being seen as 400T were missing from what should have been available. Thankfully, I could add something to vol.name on one machine and see it on another machine and I wrongly assumed the volume was operational, even if the new bricks were not recognized. So I tried to correct the main issue by, gluster volume remove vol.name newserver/bricks/dataX/ received prompt, data will be migrated before brick is removed continue (or something to that) and I started the process, think this won?t take long because there is no data. After 10 minutes and no apparent progress on the process, I did panic, thinking worse case scenario ? it is writing zeros over my data. Executed the stop command and there was still no progress, and I assume it was due to no data on the brick to be remove causing the program to hang. Found the process ID and killed it. This morning, while all clients and servers can access /vol.name; not all of the data is present. I can find it under cluster, but users cannot reach it. I am, again, assume it is because of the 4 bricks that have been added, but aren't really a part of the volume because of their incorrect name. So ? how do I proceed from here. 1. Remove the 4 empty bricks from the volume without damaging data. 2. Correctly clear any metadata about these 4 bricks ONLY so they may be added correctly. If this doesn't restore the volume to full functionality, I'll write another post if I cannot find answer in the notes or on line. Tami-- -- Tami -------------- next part -------------- An HTML attachment was scrubbed... URL: From jim.kinney at gmail.com Wed Feb 27 20:59:13 2019 From: jim.kinney at gmail.com (Jim Kinney) Date: Wed, 27 Feb 2019 15:59:13 -0500 Subject: [Gluster-users] Fwd: Added bricks with wrong name and now need to remove them without destroying volume. In-Reply-To: References: Message-ID: <45e6627e633e5acb1fb96fe6a457df1827187679.camel@gmail.com> Keep in mind that gluster is a metadata process. It doesn't really touch the actual volume files. The exception is the .glusterfs and .trashcan folders in the very top directory of the gluster volume. When you create a gluster volume from brick, it doesn't format the filesystem. It uses what's already there. So if you remove a volume and all it's bricks, you've not deleted data. That said, if you are using anything but replicated bricks, which is what I use exclusively for my needs, then reassembling them into a new volume with correct name might be tricky. By listing the bricks in the exact same order as they were listed when creating the wrong name volume when making the correct named volume, it should use the same method to put data on the drives as previously and not scramble anything. On Wed, 2019-02-27 at 14:24 -0500, Tami Greene wrote: > I sent this and realized I hadn't registered. My apologies for the > duplication > Subject: Added bricks with wrong name and now need to remove them > without destroying volume. > To: > > > > Yes, I broke it. Now I need help fixing it. > > I have an existing Gluster Volume, spread over 16 bricks and 4 > servers; 1.5P space with 49% currently used . Added an additional 4 > bricks and server as we expect large influx of data in the next 4 to > 6 months. The system had been established by my predecessor, who is > no longer here. > > First solo addition of bricks to gluster. > > Everything went smoothly until ?gluster volume add-brick Volume > newserver:/bricks/dataX/vol.name" > (I don?t have the exact response as I worked on this > for almost 5 hours last night) Unable to add-brick as ?it is already > mounted? or something to that affect. > Double checked my instructions, the name of the > bricks. Everything seemed correct. Tried to add again adding > ?force.? Again, ?unable to add-brick? > Because of the keyword (in my mind) ?mounted? in the > error, I checked /etc/fstab, where the name of the mount point is > simply /bricks/dataX. > This convention was the same across all servers, so I thought I had > discovered an error in my notes and changed the name to > newserver:/bricks/dataX. > Still had to use force, but the bricks were added. > Restarted the gluster volume vol.name. No errors. > Rebooted; but /vol.name did not mount on reboot as the /etc/fstab > instructs. So I attempted to mount manually and discovered a had a > big mess on my hands. > ?Transport endpoint not connected? in > addition to other messages. > Discovered an issue between certificates and the > auth.ssl-allow list because of the hostname of new server. I made > correction and /vol.name mounted. > However, df -h indicated the 4 new bricks were not > being seen as 400T were missing from what should have been available. > > Thankfully, I could add something to vol.name on one machine and see > it on another machine and I wrongly assumed the volume was > operational, even if the new bricks were not recognized. > So I tried to correct the main issue by, > gluster volume remove vol.name > newserver/bricks/dataX/ > received prompt, data will be migrated before brick > is removed continue (or something to that) and I started the process, > think this won?t take long because there is no data. > After 10 minutes and no apparent progress on the > process, I did panic, thinking worse case scenario ? it is writing > zeros over my data. > Executed the stop command and there was still no > progress, and I assume it was due to no data on the brick to be > remove causing the program to hang. > Found the process ID and killed it. > > > This morning, while all clients and servers can access /vol.name; not > all of the data is present. I can find it under cluster, but users > cannot reach it. I am, again, assume it is because of the 4 bricks > that have been added, but aren't really a part of the volume because > of their incorrect name. > > So ? how do I proceed from here. > > > 1. Remove the 4 empty bricks from the volume without damaging data. > 2. Correctly clear any metadata about these 4 bricks ONLY so they may > be added correctly. > > > If this doesn't restore the volume to full functionality, I'll write > another post if I cannot find answer in the notes or on line. > > Tami-- > > > > _______________________________________________Gluster-users mailing > listGluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -- James P. Kinney III Every time you stop a school, you will have to build a jail. What you gain at one end you lose at the other. It's like feeding a dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark Twain http://heretothereideas.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From tmgreene364 at gmail.com Wed Feb 27 21:56:06 2019 From: tmgreene364 at gmail.com (Tami Greene) Date: Wed, 27 Feb 2019 16:56:06 -0500 Subject: [Gluster-users] Fwd: Added bricks with wrong name and now need to remove them without destroying volume. In-Reply-To: <45e6627e633e5acb1fb96fe6a457df1827187679.camel@gmail.com> References: <45e6627e633e5acb1fb96fe6a457df1827187679.camel@gmail.com> Message-ID: That makes sense. System is made of four data arrays with a hardware RAID 6 and then the distributed volume on top. I honestly don't know how that works, but the previous administrator said we had redundancy. I'm hoping there is a way to bypass the safeguard of migrating data when removing a brick from the volume, which in my beginner's mind, would be a straight-forward way of remedying the problem. Hopefully once the empty bricks are removed, the "missing" data will be visible again in the volume. On Wed, Feb 27, 2019 at 3:59 PM Jim Kinney wrote: > Keep in mind that gluster is a metadata process. It doesn't really touch > the actual volume files. The exception is the .glusterfs and .trashcan > folders in the very top directory of the gluster volume. > > When you create a gluster volume from brick, it doesn't format the > filesystem. It uses what's already there. > > So if you remove a volume and all it's bricks, you've not deleted data. > > That said, if you are using anything but replicated bricks, which is what > I use exclusively for my needs, then reassembling them into a new volume > with correct name might be tricky. By listing the bricks in the exact same > order as they were listed when creating the wrong name volume when making > the correct named volume, it should use the same method to put data on the > drives as previously and not scramble anything. > > On Wed, 2019-02-27 at 14:24 -0500, Tami Greene wrote: > > I sent this and realized I hadn't registered. My apologies for the > duplication > > Subject: Added bricks with wrong name and now need to remove them without > destroying volume. > To: > > > > Yes, I broke it. Now I need help fixing it. > > > > I have an existing Gluster Volume, spread over 16 bricks and 4 servers; > 1.5P space with 49% currently used . Added an additional 4 bricks and > server as we expect large influx of data in the next 4 to 6 months. The > system had been established by my predecessor, who is no longer here. > > > > First solo addition of bricks to gluster. > > > > Everything went smoothly until ?gluster volume add-brick Volume > newserver:/bricks/dataX/vol.name" > > (I don?t have the exact response as I worked on this for > almost 5 hours last night) Unable to add-brick as ?it is already mounted? > or something to that affect. > > Double checked my instructions, the name of the bricks. > Everything seemed correct. Tried to add again adding ?force.? Again, > ?unable to add-brick? > > Because of the keyword (in my mind) ?mounted? in the > error, I checked /etc/fstab, where the name of the mount point is simply > /bricks/dataX. > > This convention was the same across all servers, so I thought I had > discovered an error in my notes and changed the name to > newserver:/bricks/dataX. > > Still had to use force, but the bricks were added. > > Restarted the gluster volume vol.name. No errors. > > Rebooted; but /vol.name did not mount on reboot as the /etc/fstab > instructs. So I attempted to mount manually and discovered a had a big mess > on my hands. > > ?Transport endpoint not connected? in > addition to other messages. > > Discovered an issue between certificates and the > auth.ssl-allow list because of the hostname of new server. I made > correction and /vol.name mounted. > > However, df -h indicated the 4 new bricks were not being > seen as 400T were missing from what should have been available. > > > > Thankfully, I could add something to vol.name on one machine and see it > on another machine and I wrongly assumed the volume was operational, even > if the new bricks were not recognized. > > So I tried to correct the main issue by, > > gluster volume remove vol.name newserver/bricks/dataX/ > > received prompt, data will be migrated before brick is > removed continue (or something to that) and I started the process, think > this won?t take long because there is no data. > > After 10 minutes and no apparent progress on the process, > I did panic, thinking worse case scenario ? it is writing zeros over my > data. > > Executed the stop command and there was still no progress, > and I assume it was due to no data on the brick to be remove causing the > program to hang. > > Found the process ID and killed it. > > > This morning, while all clients and servers can access /vol.name; not all > of the data is present. I can find it under cluster, but users cannot > reach it. I am, again, assume it is because of the 4 bricks that have been > added, but aren't really a part of the volume because of their incorrect > name. > > > > So ? how do I proceed from here. > > > 1. Remove the 4 empty bricks from the volume without damaging data. > > 2. Correctly clear any metadata about these 4 bricks ONLY so they may be > added correctly. > > > If this doesn't restore the volume to full functionality, I'll write > another post if I cannot find answer in the notes or on line. > > > Tami-- > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > > James P. Kinney III Every time you stop a school, you will have to build a > jail. What you gain at one end you lose at the other. It's like feeding a > dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark > Twain http://heretothereideas.blogspot.com/ > > -- Tami -------------- next part -------------- An HTML attachment was scrubbed... URL: From jim.kinney at gmail.com Wed Feb 27 22:08:16 2019 From: jim.kinney at gmail.com (Jim Kinney) Date: Wed, 27 Feb 2019 17:08:16 -0500 Subject: [Gluster-users] Fwd: Added bricks with wrong name and now need to remove them without destroying volume. In-Reply-To: References: <45e6627e633e5acb1fb96fe6a457df1827187679.camel@gmail.com> Message-ID: <0d79899c7a34939e64bff8e61ec29f2ac553f50b.camel@gmail.com> It sounds like new bricks were added and they mounted over the top of existing bricks. gluster volume status detail This will give the data you need to find where the real files are. You can look in those to see the data should be intact. Stopping the gluster volume is a good first step. Then as a safe guard you can unmount the filesystem that holds the data you want. Now remove the gluster volume(s) that are the problem - all if needed. Remount the real filesystem(s). Create new gluster volumes with correct names. On Wed, 2019-02-27 at 16:56 -0500, Tami Greene wrote: > That makes sense. System is made of four data arrays with a hardware > RAID 6 and then the distributed volume on top. I honestly don't know > how that works, but the previous administrator said we had > redundancy. I'm hoping there is a way to bypass the safeguard of > migrating data when removing a brick from the volume, which in my > beginner's mind, would be a straight-forward way of remedying the > problem. Hopefully once the empty bricks are removed, the "missing" > data will be visible again in the volume. > > On Wed, Feb 27, 2019 at 3:59 PM Jim Kinney > wrote: > > Keep in mind that gluster is a metadata process. It doesn't really > > touch the actual volume files. The exception is the .glusterfs and > > .trashcan folders in the very top directory of the gluster volume. > > > > When you create a gluster volume from brick, it doesn't format the > > filesystem. It uses what's already there. > > > > So if you remove a volume and all it's bricks, you've not deleted > > data. > > > > That said, if you are using anything but replicated bricks, which > > is what I use exclusively for my needs, then reassembling them into > > a new volume with correct name might be tricky. By listing the > > bricks in the exact same order as they were listed when creating > > the wrong name volume when making the correct named volume, it > > should use the same method to put data on the drives as previously > > and not scramble anything. > > > > On Wed, 2019-02-27 at 14:24 -0500, Tami Greene wrote: > > > I sent this and realized I hadn't registered. My apologies for > > > the duplication > > > Subject: Added bricks with wrong name and now need to remove them > > > without destroying volume. > > > To: > > > > > > > > > > > > Yes, I broke it. Now I need help fixing it. > > > > > > I have an existing Gluster Volume, spread over 16 bricks and 4 > > > servers; 1.5P space with 49% currently used . Added an > > > additional 4 bricks and server as we expect large influx of data > > > in the next 4 to 6 months. The system had been established by my > > > predecessor, who is no longer here. > > > > > > First solo addition of bricks to gluster. > > > > > > Everything went smoothly until ?gluster volume add-brick Volume > > > newserver:/bricks/dataX/vol.name" > > > (I don?t have the exact response as I worked on > > > this for almost 5 hours last night) Unable to add-brick as ?it is > > > already mounted? or something to that affect. > > > Double checked my instructions, the name of the > > > bricks. Everything seemed correct. Tried to add again adding > > > ?force.? Again, ?unable to add-brick? > > > Because of the keyword (in my mind) ?mounted? in > > > the error, I checked /etc/fstab, where the name of the mount > > > point is simply /bricks/dataX. > > > This convention was the same across all servers, so I thought I > > > had discovered an error in my notes and changed the name to > > > newserver:/bricks/dataX. > > > Still had to use force, but the bricks were added. > > > Restarted the gluster volume vol.name. No errors. > > > Rebooted; but /vol.name did not mount on reboot as the /etc/fstab > > > instructs. So I attempted to mount manually and discovered a had > > > a big mess on my hands. > > > ?Transport endpoint not > > > connected? in addition to other messages. > > > Discovered an issue between certificates and the > > > auth.ssl-allow list because of the hostname of new server. I > > > made correction and /vol.name mounted. > > > However, df -h indicated the 4 new bricks were > > > not being seen as 400T were missing from what should have been > > > available. > > > > > > Thankfully, I could add something to vol.name on one machine and > > > see it on another machine and I wrongly assumed the volume was > > > operational, even if the new bricks were not recognized. > > > So I tried to correct the main issue by, > > > gluster volume remove vol.name > > > newserver/bricks/dataX/ > > > received prompt, data will be migrated before > > > brick is removed continue (or something to that) and I started > > > the process, think this won?t take long because there is no data. > > > After 10 minutes and no apparent progress on the > > > process, I did panic, thinking worse case scenario ? it is > > > writing zeros over my data. > > > Executed the stop command and there was still no > > > progress, and I assume it was due to no data on the brick to be > > > remove causing the program to hang. > > > Found the process ID and killed it. > > > > > > > > > This morning, while all clients and servers can access /vol.name; > > > not all of the data is present. I can find it under cluster, but > > > users cannot reach it. I am, again, assume it is because of the > > > 4 bricks that have been added, but aren't really a part of the > > > volume because of their incorrect name. > > > > > > So ? how do I proceed from here. > > > > > > > > > 1. Remove the 4 empty bricks from the volume without damaging > > > data. > > > 2. Correctly clear any metadata about these 4 bricks ONLY so they > > > may be added correctly. > > > > > > > > > If this doesn't restore the volume to full functionality, I'll > > > write another post if I cannot find answer in the notes or on > > > line. > > > > > > Tami-- > > > > > > > > > > > > _______________________________________________Gluster-users > > > mailing listGluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > > James P. Kinney III > > Every time you stop a school, you will have to build a jail. What > > yougain at one end you lose at the other. It's like feeding a dog > > on hisown tail. It won't fatten the dog.- Speech 11/23/1900 Mark > > Twain > > http://heretothereideas.blogspot.com/ > > > > -- James P. Kinney III Every time you stop a school, you will have to build a jail. What you gain at one end you lose at the other. It's like feeding a dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark Twain http://heretothereideas.blogspot.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgurusid at redhat.com Thu Feb 28 02:46:15 2019 From: pgurusid at redhat.com (Poornima Gurusiddaiah) Date: Thu, 28 Feb 2019 08:16:15 +0530 Subject: [Gluster-users] Version uplift query In-Reply-To: References: Message-ID: On Wed, Feb 27, 2019, 11:52 PM Ingo Fischer wrote: > Hi Amar, > > sorry to jump into this thread with an connected question. > > When installing via "apt-get" and so using debian packages and also > systemd to start/stop glusterd is the online upgrade process from > 3.x/4.x to 5.x still needed as described at > https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/ ? > > Especially because there is manual killall and such for processes > handled by systemd in my case. Or is there an other upgrade guide or > recommendations for use on ubuntu? > > Would systemctl stop glusterd, then using apt-get update with changes > sources and a reboot be enough? > I think you would still need to kill the process manually, AFAIK systemd only stops glusterd not the other Gluster processes like glusterfsd(bricks), heal process etc. Reboot of system is not required, if that's what you meant by reboot. Also you need follow all the other steps mentioned, for the cluster to work smoothly after upgrade. Especially the steps to perform heal are important. Regards, Poornima > Ingo > > Am 27.02.19 um 16:11 schrieb Amar Tumballi Suryanarayan: > > GlusterD2 is not yet called out for standalone deployments. > > > > You can happily update to glusterfs-5.x (recommend you to wait for > > glusterfs-5.4 which is already tagged, and waiting for packages to be > > built). > > > > Regards, > > Amar > > > > On Wed, Feb 27, 2019 at 4:46 PM ABHISHEK PALIWAL > > > wrote: > > > > Hi, > > > > Could you please update on this and also let us know what is > > GlusterD2 (as it is under development in 5.0 release), so it is ok > > to uplift to 5.0? > > > > Regards, > > Abhishek > > > > On Tue, Feb 26, 2019 at 5:47 PM ABHISHEK PALIWAL > > > wrote: > > > > Hi, > > > > Currently we are using Glusterfs 3.7.6 and thinking to switch on > > Glusterfs 4.1 or 5.0, when I see there are too much code changes > > between these version, could you please let us know, is there > > any compatibility issue when we uplift any of the new mentioned > > version? > > > > Regards > > Abhishek > > > > > > > > -- > > > > > > > > > > Regards > > Abhishek Paliwal > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > > > > > -- > > Amar Tumballi (amarts) > > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From amudhan83 at gmail.com Thu Feb 28 06:10:36 2019 From: amudhan83 at gmail.com (Amudhan P) Date: Thu, 28 Feb 2019 11:40:36 +0530 Subject: [Gluster-users] Version uplift query In-Reply-To: References: Message-ID: Hi Poornima, Instead of killing process stopping volume followed by stopping service in nodes and update glusterfs. can't we follow the above step? regards Amudhan On Thu, Feb 28, 2019 at 8:16 AM Poornima Gurusiddaiah wrote: > > > On Wed, Feb 27, 2019, 11:52 PM Ingo Fischer wrote: > >> Hi Amar, >> >> sorry to jump into this thread with an connected question. >> >> When installing via "apt-get" and so using debian packages and also >> systemd to start/stop glusterd is the online upgrade process from >> 3.x/4.x to 5.x still needed as described at >> https://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_4.1/ ? >> >> Especially because there is manual killall and such for processes >> handled by systemd in my case. Or is there an other upgrade guide or >> recommendations for use on ubuntu? >> >> Would systemctl stop glusterd, then using apt-get update with changes >> sources and a reboot be enough? >> > > I think you would still need to kill the process manually, AFAIK systemd > only stops glusterd not the other Gluster processes like > glusterfsd(bricks), heal process etc. Reboot of system is not required, if > that's what you meant by reboot. Also you need follow all the other steps > mentioned, for the cluster to work smoothly after upgrade. Especially the > steps to perform heal are important. > > Regards, > Poornima > > >> Ingo >> >> Am 27.02.19 um 16:11 schrieb Amar Tumballi Suryanarayan: >> > GlusterD2 is not yet called out for standalone deployments. >> > >> > You can happily update to glusterfs-5.x (recommend you to wait for >> > glusterfs-5.4 which is already tagged, and waiting for packages to be >> > built). >> > >> > Regards, >> > Amar >> > >> > On Wed, Feb 27, 2019 at 4:46 PM ABHISHEK PALIWAL >> > > wrote: >> > >> > Hi, >> > >> > Could you please update on this and also let us know what is >> > GlusterD2 (as it is under development in 5.0 release), so it is ok >> > to uplift to 5.0? >> > >> > Regards, >> > Abhishek >> > >> > On Tue, Feb 26, 2019 at 5:47 PM ABHISHEK PALIWAL >> > > wrote: >> > >> > Hi, >> > >> > Currently we are using Glusterfs 3.7.6 and thinking to switch on >> > Glusterfs 4.1 or 5.0, when I see there are too much code changes >> > between these version, could you please let us know, is there >> > any compatibility issue when we uplift any of the new mentioned >> > version? >> > >> > Regards >> > Abhishek >> > >> > >> > >> > -- >> > >> > >> > >> > >> > Regards >> > Abhishek Paliwal >> > _______________________________________________ >> > Gluster-users mailing list >> > Gluster-users at gluster.org >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> > >> > >> > >> > -- >> > Amar Tumballi (amarts) >> > >> > _______________________________________________ >> > Gluster-users mailing list >> > Gluster-users at gluster.org >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> > >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From abhishpaliwal at gmail.com Thu Feb 28 07:10:51 2019 From: abhishpaliwal at gmail.com (ABHISHEK PALIWAL) Date: Thu, 28 Feb 2019 12:40:51 +0530 Subject: [Gluster-users] Version uplift query In-Reply-To: References: Message-ID: I am trying to build Gluster5.4 but getting below error at the time of configure conftest.c:11:28: fatal error: ac_nonexistent.h: No such file or directory Could you please help me what is the reason of the above error. Regards, Abhishek On Wed, Feb 27, 2019 at 8:42 PM Amar Tumballi Suryanarayan < atumball at redhat.com> wrote: > GlusterD2 is not yet called out for standalone deployments. > > You can happily update to glusterfs-5.x (recommend you to wait for > glusterfs-5.4 which is already tagged, and waiting for packages to be > built). > > Regards, > Amar > > On Wed, Feb 27, 2019 at 4:46 PM ABHISHEK PALIWAL > wrote: > >> Hi, >> >> Could you please update on this and also let us know what is GlusterD2 >> (as it is under development in 5.0 release), so it is ok to uplift to 5.0? >> >> Regards, >> Abhishek >> >> On Tue, Feb 26, 2019 at 5:47 PM ABHISHEK PALIWAL >> wrote: >> >>> Hi, >>> >>> Currently we are using Glusterfs 3.7.6 and thinking to switch on >>> Glusterfs 4.1 or 5.0, when I see there are too much code changes between >>> these version, could you please let us know, is there any compatibility >>> issue when we uplift any of the new mentioned version? >>> >>> Regards >>> Abhishek >>> >> >> >> -- >> >> >> >> >> Regards >> Abhishek Paliwal >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > Amar Tumballi (amarts) > -- Regards Abhishek Paliwal -------------- next part -------------- An HTML attachment was scrubbed... URL: From mchangir at redhat.com Thu Feb 28 07:31:24 2019 From: mchangir at redhat.com (Milind Changire) Date: Thu, 28 Feb 2019 13:01:24 +0530 Subject: [Gluster-users] [Gluster-devel] Version uplift query In-Reply-To: References: Message-ID: you might want to check what build.log says ... especially at the very bottom Here's a hint from StackExhange . On Thu, Feb 28, 2019 at 12:42 PM ABHISHEK PALIWAL wrote: > I am trying to build Gluster5.4 but getting below error at the time of > configure > > conftest.c:11:28: fatal error: ac_nonexistent.h: No such file or directory > > Could you please help me what is the reason of the above error. > > Regards, > Abhishek > > On Wed, Feb 27, 2019 at 8:42 PM Amar Tumballi Suryanarayan < > atumball at redhat.com> wrote: > >> GlusterD2 is not yet called out for standalone deployments. >> >> You can happily update to glusterfs-5.x (recommend you to wait for >> glusterfs-5.4 which is already tagged, and waiting for packages to be >> built). >> >> Regards, >> Amar >> >> On Wed, Feb 27, 2019 at 4:46 PM ABHISHEK PALIWAL >> wrote: >> >>> Hi, >>> >>> Could you please update on this and also let us know what is GlusterD2 >>> (as it is under development in 5.0 release), so it is ok to uplift to 5.0? >>> >>> Regards, >>> Abhishek >>> >>> On Tue, Feb 26, 2019 at 5:47 PM ABHISHEK PALIWAL < >>> abhishpaliwal at gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Currently we are using Glusterfs 3.7.6 and thinking to switch on >>>> Glusterfs 4.1 or 5.0, when I see there are too much code changes between >>>> these version, could you please let us know, is there any compatibility >>>> issue when we uplift any of the new mentioned version? >>>> >>>> Regards >>>> Abhishek >>>> >>> >>> >>> -- >>> >>> >>> >>> >>> Regards >>> Abhishek Paliwal >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users at gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> >> >> -- >> Amar Tumballi (amarts) >> > > > -- > > > > > Regards > Abhishek Paliwal > _______________________________________________ > Gluster-devel mailing list > Gluster-devel at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Milind -------------- next part -------------- An HTML attachment was scrubbed... URL: From peljasz at yahoo.co.uk Thu Feb 28 11:30:47 2019 From: peljasz at yahoo.co.uk (lejeczek) Date: Thu, 28 Feb 2019 11:30:47 +0000 Subject: [Gluster-users] Gluster off PyPy Message-ID: <8ef13123-ab2f-d614-4733-a1196c48c38b@yahoo.co.uk> hi everyone I'm hoping devel might be reading this, but if not - anybody tried glusterfs off PyPy? If yes and it works then what was/is the experience? many thanks, L. -------------- next part -------------- A non-text attachment was scrubbed... Name: pEpkey.asc Type: application/pgp-keys Size: 1757 bytes Desc: not available URL: From tmgreene364 at gmail.com Thu Feb 28 15:14:05 2019 From: tmgreene364 at gmail.com (Tami Greene) Date: Thu, 28 Feb 2019 10:14:05 -0500 Subject: [Gluster-users] Fwd: Added bricks with wrong name and now need to remove them without destroying volume. In-Reply-To: <0d79899c7a34939e64bff8e61ec29f2ac553f50b.camel@gmail.com> References: <45e6627e633e5acb1fb96fe6a457df1827187679.camel@gmail.com> <0d79899c7a34939e64bff8e61ec29f2ac553f50b.camel@gmail.com> Message-ID: I'm missing some information about how the cluster volume creates the metadata allowing it to see and find the data on the bricks. I've been told not to write anything to the bricks directly as the glusterfs cannot create the metadata and therefore the data doesn't exist in the cluster world. So, if I destroy the current gluster volume, leaving the data on the hardware RAID volume, correct the names of the new empty bricks, recreate the cluster volume, import bricks, how does the metadata get created so the new cluster volume can find and access the data? It seems like I would be laying the glusterfs on top on hardware and "hiding" the data. On Wed, Feb 27, 2019 at 5:08 PM Jim Kinney wrote: > It sounds like new bricks were added and they mounted over the top of > existing bricks. > > gluster volume status detail > > This will give the data you need to find where the real files are. You can > look in those to see the data should be intact. > > Stopping the gluster volume is a good first step. Then as a safe guard you > can unmount the filesystem that holds the data you want. Now remove the > gluster volume(s) that are the problem - all if needed. Remount the real > filesystem(s). Create new gluster volumes with correct names. > > On Wed, 2019-02-27 at 16:56 -0500, Tami Greene wrote: > > That makes sense. System is made of four data arrays with a hardware RAID > 6 and then the distributed volume on top. I honestly don't know how that > works, but the previous administrator said we had redundancy. I'm hoping > there is a way to bypass the safeguard of migrating data when removing a > brick from the volume, which in my beginner's mind, would be a > straight-forward way of remedying the problem. Hopefully once the empty > bricks are removed, the "missing" data will be visible again in the volume. > > On Wed, Feb 27, 2019 at 3:59 PM Jim Kinney wrote: > > Keep in mind that gluster is a metadata process. It doesn't really touch > the actual volume files. The exception is the .glusterfs and .trashcan > folders in the very top directory of the gluster volume. > > When you create a gluster volume from brick, it doesn't format the > filesystem. It uses what's already there. > > So if you remove a volume and all it's bricks, you've not deleted data. > > That said, if you are using anything but replicated bricks, which is what > I use exclusively for my needs, then reassembling them into a new volume > with correct name might be tricky. By listing the bricks in the exact same > order as they were listed when creating the wrong name volume when making > the correct named volume, it should use the same method to put data on the > drives as previously and not scramble anything. > > On Wed, 2019-02-27 at 14:24 -0500, Tami Greene wrote: > > I sent this and realized I hadn't registered. My apologies for the > duplication > > Subject: Added bricks with wrong name and now need to remove them without > destroying volume. > To: > > > > Yes, I broke it. Now I need help fixing it. > > > > I have an existing Gluster Volume, spread over 16 bricks and 4 servers; > 1.5P space with 49% currently used . Added an additional 4 bricks and > server as we expect large influx of data in the next 4 to 6 months. The > system had been established by my predecessor, who is no longer here. > > > > First solo addition of bricks to gluster. > > > > Everything went smoothly until ?gluster volume add-brick Volume > newserver:/bricks/dataX/vol.name" > > (I don?t have the exact response as I worked on this for > almost 5 hours last night) Unable to add-brick as ?it is already mounted? > or something to that affect. > > Double checked my instructions, the name of the bricks. > Everything seemed correct. Tried to add again adding ?force.? Again, > ?unable to add-brick? > > Because of the keyword (in my mind) ?mounted? in the > error, I checked /etc/fstab, where the name of the mount point is simply > /bricks/dataX. > > This convention was the same across all servers, so I thought I had > discovered an error in my notes and changed the name to > newserver:/bricks/dataX. > > Still had to use force, but the bricks were added. > > Restarted the gluster volume vol.name. No errors. > > Rebooted; but /vol.name did not mount on reboot as the /etc/fstab > instructs. So I attempted to mount manually and discovered a had a big mess > on my hands. > > ?Transport endpoint not connected? in > addition to other messages. > > Discovered an issue between certificates and the > auth.ssl-allow list because of the hostname of new server. I made > correction and /vol.name mounted. > > However, df -h indicated the 4 new bricks were not being > seen as 400T were missing from what should have been available. > > > > Thankfully, I could add something to vol.name on one machine and see it > on another machine and I wrongly assumed the volume was operational, even > if the new bricks were not recognized. > > So I tried to correct the main issue by, > > gluster volume remove vol.name newserver/bricks/dataX/ > > received prompt, data will be migrated before brick is > removed continue (or something to that) and I started the process, think > this won?t take long because there is no data. > > After 10 minutes and no apparent progress on the process, > I did panic, thinking worse case scenario ? it is writing zeros over my > data. > > Executed the stop command and there was still no progress, > and I assume it was due to no data on the brick to be remove causing the > program to hang. > > Found the process ID and killed it. > > > This morning, while all clients and servers can access /vol.name; not all > of the data is present. I can find it under cluster, but users cannot > reach it. I am, again, assume it is because of the 4 bricks that have been > added, but aren't really a part of the volume because of their incorrect > name. > > > > So ? how do I proceed from here. > > > 1. Remove the 4 empty bricks from the volume without damaging data. > > 2. Correctly clear any metadata about these 4 bricks ONLY so they may be > added correctly. > > > If this doesn't restore the volume to full functionality, I'll write > another post if I cannot find answer in the notes or on line. > > > Tami-- > > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users at gluster.org > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > -- > > > James P. Kinney III > > > Every time you stop a school, you will have to build a jail. What you > > gain at one end you lose at the other. It's like feeding a dog on his > > own tail. It won't fatten the dog. > > - Speech 11/23/1900 Mark Twain > > > http://heretothereideas.blogspot.com/ > > > > > -- > > James P. Kinney III Every time you stop a school, you will have to build a > jail. What you gain at one end you lose at the other. It's like feeding a > dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark > Twain http://heretothereideas.blogspot.com/ > > -- Tami -------------- next part -------------- An HTML attachment was scrubbed... URL: From pgurusid at redhat.com Thu Feb 28 16:18:19 2019 From: pgurusid at redhat.com (Poornima Gurusiddaiah) Date: Thu, 28 Feb 2019 21:48:19 +0530 Subject: [Gluster-users] Fwd: Added bricks with wrong name and now need to remove them without destroying volume. In-Reply-To: References: <45e6627e633e5acb1fb96fe6a457df1827187679.camel@gmail.com> <0d79899c7a34939e64bff8e61ec29f2ac553f50b.camel@gmail.com> Message-ID: On Thu, Feb 28, 2019, 8:44 PM Tami Greene wrote: > I'm missing some information about how the cluster volume creates the > metadata allowing it to see and find the data on the bricks. I've been > told not to write anything to the bricks directly as the glusterfs cannot > create the metadata and therefore the data doesn't exist in the cluster > world. > > So, if I destroy the current gluster volume, leaving the data on the > hardware RAID volume, correct the names of the new empty bricks, recreate > the cluster volume, import bricks, how does the metadata get created so the > new cluster volume can find and access the data? It seems like I would be > laying the glusterfs on top on hardware and "hiding" the data. > I couldn't get all the details why it went wrong, but you can delete a Gluster volume and recreate it with the same bricks and the data should be accessible again AFAIK. Preferably create with the same volume name. Do not alter any data on the bricks, also make sure there is no valid data on the 4 bricks that were wrongly added by checking in the backend. +Atin, Sanju @Atin, Sanju, This should work right? Regards, Poornima > > On Wed, Feb 27, 2019 at 5:08 PM Jim Kinney wrote: > >> It sounds like new bricks were added and they mounted over the top of >> existing bricks. >> >> gluster volume status detail >> >> This will give the data you need to find where the real files are. You >> can look in those to see the data should be intact. >> >> Stopping the gluster volume is a good first step. Then as a safe guard >> you can unmount the filesystem that holds the data you want. Now remove the >> gluster volume(s) that are the problem - all if needed. Remount the real >> filesystem(s). Create new gluster volumes with correct names. >> >> On Wed, 2019-02-27 at 16:56 -0500, Tami Greene wrote: >> >> That makes sense. System is made of four data arrays with a hardware >> RAID 6 and then the distributed volume on top. I honestly don't know how >> that works, but the previous administrator said we had redundancy. I'm >> hoping there is a way to bypass the safeguard of migrating data when >> removing a brick from the volume, which in my beginner's mind, would be a >> straight-forward way of remedying the problem. Hopefully once the empty >> bricks are removed, the "missing" data will be visible again in the volume. >> >> On Wed, Feb 27, 2019 at 3:59 PM Jim Kinney wrote: >> >> Keep in mind that gluster is a metadata process. It doesn't really touch >> the actual volume files. The exception is the .glusterfs and .trashcan >> folders in the very top directory of the gluster volume. >> >> When you create a gluster volume from brick, it doesn't format the >> filesystem. It uses what's already there. >> >> So if you remove a volume and all it's bricks, you've not deleted data. >> >> That said, if you are using anything but replicated bricks, which is what >> I use exclusively for my needs, then reassembling them into a new volume >> with correct name might be tricky. By listing the bricks in the exact same >> order as they were listed when creating the wrong name volume when making >> the correct named volume, it should use the same method to put data on the >> drives as previously and not scramble anything. >> >> On Wed, 2019-02-27 at 14:24 -0500, Tami Greene wrote: >> >> I sent this and realized I hadn't registered. My apologies for the >> duplication >> >> Subject: Added bricks with wrong name and now need to remove them without >> destroying volume. >> To: >> >> >> >> Yes, I broke it. Now I need help fixing it. >> >> >> >> I have an existing Gluster Volume, spread over 16 bricks and 4 servers; >> 1.5P space with 49% currently used . Added an additional 4 bricks and >> server as we expect large influx of data in the next 4 to 6 months. The >> system had been established by my predecessor, who is no longer here. >> >> >> >> First solo addition of bricks to gluster. >> >> >> >> Everything went smoothly until ?gluster volume add-brick Volume >> newserver:/bricks/dataX/vol.name" >> >> (I don?t have the exact response as I worked on this for >> almost 5 hours last night) Unable to add-brick as ?it is already mounted? >> or something to that affect. >> >> Double checked my instructions, the name of the bricks. >> Everything seemed correct. Tried to add again adding ?force.? Again, >> ?unable to add-brick? >> >> Because of the keyword (in my mind) ?mounted? in the >> error, I checked /etc/fstab, where the name of the mount point is simply >> /bricks/dataX. >> >> This convention was the same across all servers, so I thought I had >> discovered an error in my notes and changed the name to >> newserver:/bricks/dataX. >> >> Still had to use force, but the bricks were added. >> >> Restarted the gluster volume vol.name. No errors. >> >> Rebooted; but /vol.name did not mount on reboot as the /etc/fstab >> instructs. So I attempted to mount manually and discovered a had a big mess >> on my hands. >> >> ?Transport endpoint not connected? in >> addition to other messages. >> >> Discovered an issue between certificates and the >> auth.ssl-allow list because of the hostname of new server. I made >> correction and /vol.name mounted. >> >> However, df -h indicated the 4 new bricks were not being >> seen as 400T were missing from what should have been available. >> >> >> >> Thankfully, I could add something to vol.name on one machine and see it >> on another machine and I wrongly assumed the volume was operational, even >> if the new bricks were not recognized. >> >> So I tried to correct the main issue by, >> >> gluster volume remove vol.name newserver/bricks/dataX/ >> >> received prompt, data will be migrated before brick is >> removed continue (or something to that) and I started the process, think >> this won?t take long because there is no data. >> >> After 10 minutes and no apparent progress on the process, >> I did panic, thinking worse case scenario ? it is writing zeros over my >> data. >> >> Executed the stop command and there was still no >> progress, and I assume it was due to no data on the brick to be remove >> causing the program to hang. >> >> Found the process ID and killed it. >> >> >> This morning, while all clients and servers can access /vol.name; not >> all of the data is present. I can find it under cluster, but users >> cannot reach it. I am, again, assume it is because of the 4 bricks that >> have been added, but aren't really a part of the volume because of their >> incorrect name. >> >> >> >> So ? how do I proceed from here. >> >> >> 1. Remove the 4 empty bricks from the volume without damaging data. >> >> 2. Correctly clear any metadata about these 4 bricks ONLY so they may be >> added correctly. >> >> >> If this doesn't restore the volume to full functionality, I'll write >> another post if I cannot find answer in the notes or on line. >> >> >> Tami-- >> >> >> _______________________________________________ >> >> Gluster-users mailing list >> >> Gluster-users at gluster.org >> >> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -- >> >> >> James P. Kinney III >> >> >> Every time you stop a school, you will have to build a jail. What you >> >> gain at one end you lose at the other. It's like feeding a dog on his >> >> own tail. It won't fatten the dog. >> >> - Speech 11/23/1900 Mark Twain >> >> >> http://heretothereideas.blogspot.com/ >> >> >> >> >> -- >> >> James P. Kinney III Every time you stop a school, you will have to build >> a jail. What you gain at one end you lose at the other. It's like feeding a >> dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 Mark >> Twain http://heretothereideas.blogspot.com/ >> >> > > -- > Tami > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From jim.kinney at gmail.com Thu Feb 28 19:18:32 2019 From: jim.kinney at gmail.com (Jim Kinney) Date: Thu, 28 Feb 2019 14:18:32 -0500 Subject: [Gluster-users] Fwd: Added bricks with wrong name and now need to remove them without destroying volume. In-Reply-To: References: <45e6627e633e5acb1fb96fe6a457df1827187679.camel@gmail.com> <0d79899c7a34939e64bff8e61ec29f2ac553f50b.camel@gmail.com> Message-ID: Orig file structure to share with gluster is /foo Volname is testvol Data exists in foo . You have 2 copies, one on machine a, another on b. When you create the testvol in gluster, it creates a folder /foo/.glusterfs and writes all gluster metadata there. There's config data written in gluster only space like var. When users write files into gluster volumes, gluster manages the writes to the actual filesystem in /foo on both a & b. It tracks writes in .glusterfs on both a & b. If you "un gluster", the user files in /foo on a & b are untouched. The /foo/.glusterfs folder is deleted on a &b. On February 28, 2019 10:14:05 AM EST, Tami Greene wrote: >I'm missing some information about how the cluster volume creates the >metadata allowing it to see and find the data on the bricks. I've been >told not to write anything to the bricks directly as the glusterfs >cannot >create the metadata and therefore the data doesn't exist in the cluster >world. > >So, if I destroy the current gluster volume, leaving the data on the >hardware RAID volume, correct the names of the new empty bricks, >recreate >the cluster volume, import bricks, how does the metadata get created so >the >new cluster volume can find and access the data? It seems like I would >be >laying the glusterfs on top on hardware and "hiding" the data. > > > >On Wed, Feb 27, 2019 at 5:08 PM Jim Kinney >wrote: > >> It sounds like new bricks were added and they mounted over the top of >> existing bricks. >> >> gluster volume status detail >> >> This will give the data you need to find where the real files are. >You can >> look in those to see the data should be intact. >> >> Stopping the gluster volume is a good first step. Then as a safe >guard you >> can unmount the filesystem that holds the data you want. Now remove >the >> gluster volume(s) that are the problem - all if needed. Remount the >real >> filesystem(s). Create new gluster volumes with correct names. >> >> On Wed, 2019-02-27 at 16:56 -0500, Tami Greene wrote: >> >> That makes sense. System is made of four data arrays with a hardware >RAID >> 6 and then the distributed volume on top. I honestly don't know how >that >> works, but the previous administrator said we had redundancy. I'm >hoping >> there is a way to bypass the safeguard of migrating data when >removing a >> brick from the volume, which in my beginner's mind, would be a >> straight-forward way of remedying the problem. Hopefully once the >empty >> bricks are removed, the "missing" data will be visible again in the >volume. >> >> On Wed, Feb 27, 2019 at 3:59 PM Jim Kinney >wrote: >> >> Keep in mind that gluster is a metadata process. It doesn't really >touch >> the actual volume files. The exception is the .glusterfs and >.trashcan >> folders in the very top directory of the gluster volume. >> >> When you create a gluster volume from brick, it doesn't format the >> filesystem. It uses what's already there. >> >> So if you remove a volume and all it's bricks, you've not deleted >data. >> >> That said, if you are using anything but replicated bricks, which is >what >> I use exclusively for my needs, then reassembling them into a new >volume >> with correct name might be tricky. By listing the bricks in the exact >same >> order as they were listed when creating the wrong name volume when >making >> the correct named volume, it should use the same method to put data >on the >> drives as previously and not scramble anything. >> >> On Wed, 2019-02-27 at 14:24 -0500, Tami Greene wrote: >> >> I sent this and realized I hadn't registered. My apologies for the >> duplication >> >> Subject: Added bricks with wrong name and now need to remove them >without >> destroying volume. >> To: >> >> >> >> Yes, I broke it. Now I need help fixing it. >> >> >> >> I have an existing Gluster Volume, spread over 16 bricks and 4 >servers; >> 1.5P space with 49% currently used . Added an additional 4 bricks >and >> server as we expect large influx of data in the next 4 to 6 months. >The >> system had been established by my predecessor, who is no longer here. >> >> >> >> First solo addition of bricks to gluster. >> >> >> >> Everything went smoothly until ?gluster volume add-brick Volume >> newserver:/bricks/dataX/vol.name" >> >> (I don?t have the exact response as I worked on this >for >> almost 5 hours last night) Unable to add-brick as ?it is already >mounted? >> or something to that affect. >> >> Double checked my instructions, the name of the >bricks. >> Everything seemed correct. Tried to add again adding ?force.? >Again, >> ?unable to add-brick? >> >> Because of the keyword (in my mind) ?mounted? in the >> error, I checked /etc/fstab, where the name of the mount point is >simply >> /bricks/dataX. >> >> This convention was the same across all servers, so I thought I had >> discovered an error in my notes and changed the name to >> newserver:/bricks/dataX. >> >> Still had to use force, but the bricks were added. >> >> Restarted the gluster volume vol.name. No errors. >> >> Rebooted; but /vol.name did not mount on reboot as the /etc/fstab >> instructs. So I attempted to mount manually and discovered a had a >big mess >> on my hands. >> >> ?Transport endpoint not connected? in >> addition to other messages. >> >> Discovered an issue between certificates and the >> auth.ssl-allow list because of the hostname of new server. I made >> correction and /vol.name mounted. >> >> However, df -h indicated the 4 new bricks were not >being >> seen as 400T were missing from what should have been available. >> >> >> >> Thankfully, I could add something to vol.name on one machine and see >it >> on another machine and I wrongly assumed the volume was operational, >even >> if the new bricks were not recognized. >> >> So I tried to correct the main issue by, >> >> gluster volume remove vol.name >newserver/bricks/dataX/ >> >> received prompt, data will be migrated before brick >is >> removed continue (or something to that) and I started the process, >think >> this won?t take long because there is no data. >> >> After 10 minutes and no apparent progress on the >process, >> I did panic, thinking worse case scenario ? it is writing zeros over >my >> data. >> >> Executed the stop command and there was still no >progress, >> and I assume it was due to no data on the brick to be remove causing >the >> program to hang. >> >> Found the process ID and killed it. >> >> >> This morning, while all clients and servers can access /vol.name; not >all >> of the data is present. I can find it under cluster, but users >cannot >> reach it. I am, again, assume it is because of the 4 bricks that >have been >> added, but aren't really a part of the volume because of their >incorrect >> name. >> >> >> >> So ? how do I proceed from here. >> >> >> 1. Remove the 4 empty bricks from the volume without damaging data. >> >> 2. Correctly clear any metadata about these 4 bricks ONLY so they may >be >> added correctly. >> >> >> If this doesn't restore the volume to full functionality, I'll write >> another post if I cannot find answer in the notes or on line. >> >> >> Tami-- >> >> >> _______________________________________________ >> >> Gluster-users mailing list >> >> Gluster-users at gluster.org >> >> >> https://lists.gluster.org/mailman/listinfo/gluster-users >> >> -- >> >> >> James P. Kinney III >> >> >> Every time you stop a school, you will have to build a jail. What you >> >> gain at one end you lose at the other. It's like feeding a dog on his >> >> own tail. It won't fatten the dog. >> >> - Speech 11/23/1900 Mark Twain >> >> >> http://heretothereideas.blogspot.com/ >> >> >> >> >> -- >> >> James P. Kinney III Every time you stop a school, you will have to >build a >> jail. What you gain at one end you lose at the other. It's like >feeding a >> dog on his own tail. It won't fatten the dog. - Speech 11/23/1900 >Mark >> Twain http://heretothereideas.blogspot.com/ >> >> > >-- >Tami -- Sent from my Android device with K-9 Mail. All tyopes are thumb related and reflect authenticity. -------------- next part -------------- An HTML attachment was scrubbed... URL: