[Bugs] [Bug 1731513] New: snapd crashes sometimes

bugzilla at redhat.com bugzilla at redhat.com
Fri Jul 19 15:18:45 UTC 2019


https://bugzilla.redhat.com/show_bug.cgi?id=1731513

            Bug ID: 1731513
           Summary: snapd crashes sometimes
           Product: Red Hat Gluster Storage
            Status: NEW
         Component: snapshot
          Severity: high
          Priority: high
          Assignee: rkavunga at redhat.com
          Reporter: rabhat at redhat.com
        QA Contact: rhinduja at redhat.com
                CC: atumball at redhat.com, bugs at gluster.org,
                    rhs-bugs at redhat.com, sankarshan at redhat.com,
                    storage-qa-internal at redhat.com, sunkumar at redhat.com
        Depends On: 1726783
            Blocks: 1731509, 1731512
  Target Milestone: ---
    Classification: Red Hat



+++ This bug was initially created as a clone of Bug #1726783 +++

Description of problem:

This crash was found in the regression runs and it happens once in a while and
not always.


Core was generated by `/build/install/sbin/glusterfsd -s localhost --volfile-id
snapd/patchy -p /var/r'.
Program terminated with signal SIGSEGV, Segmentation fault.

bt
#0  0x00007f0a5d00b016 in __strcmp_sse42 () from ./lib64/libc.so.6
#1  0x00007f0a501deceb in mgmt_get_snapinfo_cbk (req=0x7f0a44001808,
iov=0x7f0a44001840, count=1, myframe=0x7f0a440014b8)
    at
/home/raghu/work/src/glusterfs/xlators/features/snapview-server/src/snapview-server-mgmt.c:393
#2  0x00007f0a5e67a58c in rpc_clnt_handle_reply (clnt=0xa4d850,
pollin=0x7f0a4c002080) at
/home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:764
#3  0x00007f0a5e67aab5 in rpc_clnt_notify (trans=0xa4da80, mydata=0xa4d880,
event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0a4c002080)
    at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-clnt.c:931
#4  0x00007f0a5e676a9c in rpc_transport_notify (this=0xa4da80,
event=RPC_TRANSPORT_MSG_RECEIVED, data=0x7f0a4c002080)
    at /home/raghu/work/src/glusterfs/rpc/rpc-lib/src/rpc-transport.c:547
#5  0x00007f0a52c85a19 in socket_event_poll_in (this=0xa4da80,
notify_handled=true) at
/home/raghu/work/src/glusterfs/rpc/rpc-transport/socket/src/socket.c:2582
#6  0x00007f0a52c86a2e in socket_event_handler (fd=9, idx=1, gen=1,
data=0xa4da80, poll_in=1, poll_out=0, poll_err=0, event_thread_died=0 '\000')
    at
/home/raghu/work/src/glusterfs/rpc/rpc-transport/socket/src/socket.c:2984
#7  0x00007f0a5e9574f2 in event_dispatch_epoll_handler (event_pool=0x9ff830,
event=0x7f0a50be7140)
    at /home/raghu/work/src/glusterfs/libglusterfs/src/event-epoll.c:648
#8  0x00007f0a5e957a0b in event_dispatch_epoll_worker (data=0xa4faf0) at
/home/raghu/work/src/glusterfs/libglusterfs/src/event-epoll.c:761
#9  0x00007f0a5d702dd5 in start_thread () from ./lib64/libpthread.so.0
#10 0x00007f0a5cfcaead in clone () from ./lib64/libc.so.6

(gdb) f 1
#1  0x00007f0a501deceb in mgmt_get_snapinfo_cbk (req=0x7f0a44001808,
iov=0x7f0a44001840, count=1, myframe=0x7f0a440014b8)
    at
/home/raghu/work/src/glusterfs/xlators/features/snapview-server/src/snapview-server-mgmt.c:393
393                     if ((!strcmp(old_dirents[i].name, dirents[j].name)) &&
(gdb) l
388         LOCK(&priv->snaplist_lock);
389         {
390             oldcount = priv->num_snaps;
391             for (i = 0; i < priv->num_snaps; i++) {
392                 for (j = 0; j < snapcount; j++) {
393                     if ((!strcmp(old_dirents[i].name, dirents[j].name)) &&
394                         (!strcmp(old_dirents[i].uuid, dirents[j].uuid))) {
395                         dirents[j].fs = old_dirents[i].fs;
396                         old_dirents[i].fs = NULL;
397                         break;
(gdb) p old_dirents
$1 = (snap_dirent_t *) 0x0
(gdb) p snap->dirents
No symbol "snap" in current context.
(gdb) p priv->dirents
$2 = (snap_dirent_t *) 0x7f0a440031e0
(gdb) p *priv->dirents
$3 = {name = "snap1", '\000' <repeats 249 times>, uuid =
"da58a428-260e-4c28-abae-6d426f7417fa", snap_volname =
"11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[0]
$4 = {name = "snap1", '\000' <repeats 249 times>, uuid =
"da58a428-260e-4c28-abae-6d426f7417fa", snap_volname =
"11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[1]
$5 = {name = "snap2", '\000' <repeats 249 times>, uuid =
"985b0bcc-279a-40d3-8a21-1cb7d250ae25", snap_volname =
"202c92fdf9624836931590897482a09b", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[2]
$6 = {name = "snap3", '\000' <repeats 249 times>, uuid =
"9c001fcf-b010-48d6-b6f4-b70f120b69f8", snap_volname =
"11f4f20ef1514be5a572b602ffe8d96d", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[3]
$7 = {name = "snap4", '\000' <repeats 249 times>, uuid =
"3b2eadfe-cad4-4c6f-ba2e-d2c89710d960", snap_volname =
"d28715725bb24411a7f4eebdff70e41a", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p priv->dirents[4]
$8 = {name = "\000\000\000\000\000\000\000\000a\325\001", '\000' <repeats 243
times>, uuid = '\000' <repeats 36 times>, snap_volname = '\000' <repeats 254
times>, fs = 0x0}
(gdb) p dirents
$9 = (snap_dirent_t *) 0x7f0a4c024190
(gdb) p dirents[0]
$10 = {name = "snap1", '\000' <repeats 249 times>, uuid =
"da58a428-260e-4c28-abae-6d426f7417fa", snap_volname =
"11536174454641028b9d901c7962abca", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[1]
$11 = {name = "snap2", '\000' <repeats 249 times>, uuid =
"985b0bcc-279a-40d3-8a21-1cb7d250ae25", snap_volname =
"202c92fdf9624836931590897482a09b", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[2]
$12 = {name = "snap3", '\000' <repeats 249 times>, uuid =
"9c001fcf-b010-48d6-b6f4-b70f120b69f8", snap_volname =
"11f4f20ef1514be5a572b602ffe8d96d", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[3]
$13 = {name = "snap4", '\000' <repeats 249 times>, uuid =
"3b2eadfe-cad4-4c6f-ba2e-d2c89710d960", snap_volname =
"d28715725bb24411a7f4eebdff70e41a", '\000' <repeats 222 times>, fs = 0x0}
(gdb) p dirents[4]
$14 = {
  name =
"\000\000\000\000\000\000\000\000\245\004\000\000\000\000\000\000\000@\315I\n\177\000\000\020\f\000L\n\177\000\000؍\356I\n\177\000\000\000\000\000\000\000\000\000\000@\311\001L\n\177\000\000`J\002L\n\177\000\000\000\000\000\000\000\000\000\000\320N\002L\n\177\000\000\000\000\000\000\000\000\000\000؍\356I\n\177\000\000\270\216\356I\n\177\000\000\250\216\356I\n\177\000\000\000\000\000\000\000\000\000\000h\216\356I\n\177\000\000x\216\356I\n\177\000\000\350\216\356I\n\177\000\000\370\216\356I\n\177\000\000\b\217\356I\n\177\000\000\210\216\356I\n\177\000\000\230\216\356I\n\177\000\000\370\215\356I\n\177\000\000\b\216\356I\n\177\000\000\350\215\356I\n\177",
'\000' <repeats 42 times>..., 
  uuid =
"\000\000\000\000\000\000\000\000\000؎\356I\n\177\000\000\000\000\000\000\000\000\000\000\030\216\356I\n\177\000\000\070\216\356I", 
  snap_volname = "\n\177\000\000(\216\356I\n\177\000\000H\216\356I\n\177",
'\000' <repeats 42 times>,
"H\217\356I\n\177\000\000\070\217\356I\n\177\000\000(\217\356I\n\177\000\000\030\217\356I\n\177",
'\000' <repeats 18 times>, "h\217\356I\n\177", '\000' <repeats 66 times>,
"X\217\356I\n\177", '\000' <repeats 68 times>, fs = 0x0}
(gdb) p snapcount
$15 = 4
(gdb) p oldcount
$16 = 4



Version-Release number of selected component (if applicable):


How reproducible:
Not always


Additional info:

Reason:

In snapshot daemon, mgmt_get_snapinfo_cbk is the callback for the rpc call to
the glusterd to get the list of snapshots available. The rpc call to glusterd
is made at two instances. One is when snapview-server xlator is spawned it
makes the call in the init function. Another is whenever there is connection
happening between snapd and glusterd (in svs_rpc_notify). 

But whenever snapd comes up, it is both inited and gets connected to glusterd.
So, there can be 2 rpc calls happening to glusterd and the order in which the
responses comes can lead to the above crash.

in mgmt_get_snapinfo_cbk we do this

1) Get the list of existing snapshots from the local information private to the
snapview-server xlator
2) Then process the list of snapshots that glusterd sent.

Ideally mgmt_get_snapinfo_cbk should be getting the current list of snapshots
from its private structure under a lock. If not, then at the time of getting
the local list, the list could be empty. But at the time of processing and
comparing local list with the new list obtained via rpc, the list could have
changed. So the empty lists results in a NULL pointer access. 

So the fix is to ensure that the local list of snapshots from its private
structure are obtained inside a lock.

--- Additional comment from Worker Ant on 2019-07-03 18:07:23 UTC ---

REVIEW: https://review.gluster.org/22989 (features/snapview-server: obtain the
list of snapshots inside the lock) posted (#1) for review on master by
Raghavendra Bhat

--- Additional comment from Worker Ant on 2019-07-03 18:08:58 UTC ---

REVIEW: https://review.gluster.org/22990 (features/snapview-server: obtain the
list of snapshots inside the lock) posted (#1) for review on master by
Raghavendra Bhat

--- Additional comment from Worker Ant on 2019-07-12 13:37:34 UTC ---

REVIEW: https://review.gluster.org/22990 (features/snapview-server: obtain the
list of snapshots inside the lock) merged (#2) on master by mohammed rafi  kc


Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1726783
[Bug 1726783] snapd crashes sometimes
https://bugzilla.redhat.com/show_bug.cgi?id=1731509
[Bug 1731509] snapd crashes sometimes
https://bugzilla.redhat.com/show_bug.cgi?id=1731512
[Bug 1731512] snapd crashes sometimes
-- 
You are receiving this mail because:
You are on the CC list for the bug.


More information about the Bugs mailing list