[Bugs] [Bug 1674225] New: flooding of "dict is NULL" logging & crash of client process

Sun Feb 10 06:04:18 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1674225

            Bug ID: 1674225
           Summary: flooding of "dict is NULL" logging & crash of client
                    process
           Product: GlusterFS
           Version: mainline
            Status: NEW
         Component: core
          Keywords: Triaged, ZStream
          Assignee: bugs at gluster.org
          Reporter: rgowdapp at redhat.com
                CC: amgad.saleh at nokia.com, archon810 at gmail.com,
                    atumball at redhat.com, bugs at gluster.org, vpvainio at iki.fi
        Depends On: 1671603, 1313567
            Blocks: 1667103 (glusterfs-5.4)
  Target Milestone: ---
    Classification: Community

+++ This bug was initially created as a clone of Bug #1671603 +++

+++ This bug was initially created as a clone of Bug #1313567 +++

Description of problem:
following logs flood the log files

[2016-03-01 10:45:51.688339] W [dict.c:1282:dict_foreach_match]
(-->/usr/lib64/libglusterfs.so.0(dict_foreach_match+0x65) [0x7ff139e1e5d5]
-->/usr/lib64/glusterfs/3.7.8/xlator/features/index.so(+0x3950)
[0x7ff12de49950] -->/usr/lib64/libglusterfs.so.0(dict_foreach_match+0xe1)
[0x7ff139e1e651] ) 0-dict: dict|match|action is NULL [Invalid argument]

Version-Release number of selected component (if applicable):
glusterfs-3.7.8

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:

Expected results:

Additional info:
I have seen the older one 
https://bugzilla.redhat.com/show_bug.cgi?id=1289893
but since i am using the latest version (3.7.8) fix is there. Could this be
related to another part of index.c ?

--- Additional comment from Red Hat Bugzilla Rules Engine on 2016-03-01
23:14:38 UTC ---

This bug is automatically being proposed for the current z-stream release of
Red Hat Gluster Storage 3 by setting the release flag 'rhgs‑3.1.z' to '?'. 

If this bug should be proposed for a different release, please manually change
the proposed release flag.

--- Additional comment from Nithya Balachandran on 2016-03-09 04:16:05 UTC ---

This looks like it refers to an upstream release (3.7.8). Changing the product
to reflect this.

--- Additional comment from evangelos on 2016-03-09 09:18:21 UTC ---

is there any update on this ?

thank you very much!

--- Additional comment from Nithya Balachandran on 2016-03-09 10:43:09 UTC ---

Moving this to Anuradha who worked on the original patch.

--- Additional comment from evangelos on 2016-04-12 11:20:00 UTC ---

is there any update on this ?

--- Additional comment from Anuradha on 2016-06-23 10:01:53 UTC ---

Hi evangelos,

That fix was made in 3.7.5. You say you have been the old issue. Did you
upgrade from 3.7.5 to 3.7.8 and are seeing problem or was this volume freshly
created based on 3.7.8?

As far as I know all the fixes for dict is NULL in index translator are sent.
But there is an issue when volfiles are not updated during an upgrade.
If you had upgraded the volume, could you please provided the steps that you
used to upgrade?

Also, could you also verify one thing for me from brick volfiles of you volume?
The brick volfiles are supposed to have the following lines:

volume test-index
    type features/index
    option xattrop-pending-watchlist trusted.afr.test-    <--------(1)
    option xattrop-dirty-watchlist trusted.afr.dirty      <--------(2)
    option index-base /export/test/brick2/.glusterfs/indices
    subvolumes test-barrier
end-volume

The two options mentioned above should exist. Otherwise you will see this
problem. You can find volfiles at /var/lib/glusterd/vols/<volname>.

Thanks.

--- Additional comment from evangelos on 2016-07-07 19:28:59 UTC ---

Hi Anuradha, 

in the mean time due to various issues we had we decided to downgrade to 3.6
branch so currently i do not have a 3.7 deployment up and running. But thx for
the hint i will keep this in mind for the future.
In the meantime you can close this bugzilla.

thank you

--- Additional comment from Anuradha on 2016-07-11 09:09:11 UTC ---

Hi Evangelos,

Thanks for the update.

Closing this bug as per comment#7

Thanks,
Anuradha.

--- Additional comment from Emerson Gomes on 2019-01-27 15:42:59 UTC ---

This error is still reproduceable in 5.3 when upgrading from a 3.x volume.
I had to recreate volume from scratch in 5.3 and copy data back in order to
avoid it.

--- Additional comment from Artem Russakovskii on 2019-01-30 20:23:44 UTC ---

I just started seeing this error after upgrading from 4.1 to 5.3.

[2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fd966fcd329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]

And it floods like crazy with these.

--- Additional comment from Emerson Gomes on 2019-01-30 20:33:12 UTC ---

I "solved" the issue after upgrading to 5.3 by creating a new volume and moving
all data to it.

Apparently something is missing on the volumes after upgrade.

--- Additional comment from Artem Russakovskii on 2019-01-30 20:37:13 UTC ---

I just sent a message to the gluster mailing list about this because that's not
how this problem should be resolved. I'm curious to hear what they say.

--- Additional comment from Emerson Gomes on 2019-01-30 20:39:04 UTC ---

Absolutely. That's the second big issue I had after upgrading. The first one is
https://bugzilla.redhat.com/show_bug.cgi?id=1651246
Still unsolved (open for more than 2 months now)

--- Additional comment from Artem Russakovskii on 2019-01-30 20:40:29 UTC ---

You know, I was *just* going to comment in a follow-up reply about whether the
issue here is possibly related to the one you just linked. Seeing tons of those
too, though at least the dupes are suppressed.

==> mnt-SITE_data1.log <==
[2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fd966fcd329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]

==> mnt-SITE_data3.log <==
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker]
2-epoll: Failed to dispatch handler" repeated 413 times between [2019-01-30
20:36:23.881090] and [2019-01-30 20:38:20.015593]
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk]
2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0"
repeated 42 times between [2019-01-30 20:36:23.290287] and [2019-01-30
20:38:20.280306]

==> mnt-SITE_data1.log <==
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk]
2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0"
repeated 50 times between [2019-01-30 20:36:22.247367] and [2019-01-30
20:38:19.459789]
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker]
2-epoll: Failed to dispatch handler" repeated 2654 times between [2019-01-30
20:36:22.667327] and [2019-01-30 20:38:20.546355]
[2019-01-30 20:38:21.492319] I [MSGID: 108031]
[afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting
local read_child SITE_data1-client-0

==> mnt-SITE_data3.log <==
[2019-01-30 20:38:22.349689] I [MSGID: 108031]
[afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: selecting
local read_child SITE_data3-client-0

==> mnt-SITE_data1.log <==
[2019-01-30 20:38:22.762941] E [MSGID: 101191]
[event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch
handler

--- Additional comment from Emerson Gomes on 2019-01-30 20:48:52 UTC ---

Yeah, both arrised after upgrading from 3.x to 5.1, persisting in 5.2 and 5.3.

The other issue is even more critical.
It causes crashes, making the mount point being inacessible ("Transport
endpoint is not connected" error) - Requiring a new manual umount/mount.
For now I have a crontab entry doing this, but I will have to downgrade if a
fix is not issued soon...

--- Additional comment from Artem Russakovskii on 2019-01-31 18:00:40 UTC ---

Damn, you weren't kidding, I wish I saw these bug reports before I updated from
rock solid 4.1.

Less than 24 hours after upgrading, I already got a crash that you referenced:

[2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fcccafcd329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
[2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fcccafcd329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
[2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fcccafcd329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
[2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fcccafcd329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk]
2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-3"
repeated 5 times between [2019-01-31 09:37:54.751905] and [2019-01-31
09:38:03.958061]
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker]
2-epoll: Failed to dispatch handler" repeated 72 times between [2019-01-31
09:37:53.746741] and [2019-01-31 09:38:04.696993]
pending frames:
frame : type(1) op(READ)
frame : type(1) op(OPEN)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2019-01-31 09:38:04
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.3
/usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6]
/lib64/libc.so.6(+0x36160)[0x7fccd622d160]
/lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0]
/lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1]
/lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa]
/lib64/libc.so.6(+0x2e772)[0x7fccd6225772]
/lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8]
/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d]
/usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778]
/usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820]
/usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063]
/usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2]
/usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3]
/lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559]
/lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f]
---------

--- Additional comment from Artem Russakovskii on 2019-02-02 20:15:09 UTC ---

The fuse crash happened again yesterday, to another volume. Are there any mount
options that could help mitigate this?

In the meantime, I set up a monit (https://mmonit.com/monit/) task to watch and
restart the mount, which works and recovers the mount point within a minute.
Not ideal, but a temporary workaround.

By the way, the way to reproduce this "Transport endpoint is not connected"
condition for testing purposes is to kill -9 the right "glusterfs
--process-name fuse" process.

monit check:
check filesystem glusterfs_data1 with path /mnt/glusterfs_data1
  start program  = "/bin/mount  /mnt/glusterfs_data1"
  stop program  = "/bin/umount /mnt/glusterfs_data1"
  if space usage > 90% for 5 times within 15 cycles
    then alert else if succeeded for 10 cycles then alert

stack trace:
[2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fa0249e4329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument]
[2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref]
(-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329)
[0x7fa0249e4329]
-->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5)
[0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58)
[0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument]
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker]
0-epoll: Failed to dispatch handler" repeated 26 times between [2019-02-01
23:21:20.857333] and [2019-02-01 23:21:56.164427]
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk]
0-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-3"
repeated 27 times between [2019-02-01 23:21:11.142467] and [2019-02-01
23:22:03.474036]
pending frames:
frame : type(1) op(LOOKUP)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 6
time of crash:
2019-02-01 23:22:03
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 5.3
/usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c]
/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6]
/lib64/libc.so.6(+0x36160)[0x7fa02c12d160]
/lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0]
/lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1]
/lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa]
/lib64/libc.so.6(+0x2e772)[0x7fa02c125772]
/lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8]
/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d]
/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1]
/usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f]
/usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820]
/usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063]
/usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2]
/usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3]
/lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559]
/lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f]

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1313567
[Bug 1313567] flooding of "dict is NULL" logging
https://bugzilla.redhat.com/show_bug.cgi?id=1667103
[Bug 1667103] GlusterFS 5.4 tracker
https://bugzilla.redhat.com/show_bug.cgi?id=1671603
[Bug 1671603] flooding of "dict is NULL" logging & crash of client process
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.