[Bugs] [Bug 1348085] New: [geo-rep]: Worker crashed with "KeyError: "

Mon Jun 20 06:38:18 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1348085

            Bug ID: 1348085
           Summary: [geo-rep]: Worker crashed with "KeyError: "
           Product: GlusterFS
           Version: 3.7.11
         Component: geo-replication
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: avishwan at redhat.com
                CC: bugs at gluster.org, csaba at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    storage-qa-internal at redhat.com
        Depends On: 1344826, 1345744

+++ This bug was initially created as a clone of Bug #1345744 +++

+++ This bug was initially created as a clone of Bug #1344826 +++

Description of problem:
=======================

While performing rm -rf on cascaded setup, found a worker crash on the primary
master and intermittent master volume with traceback as: 

Master Volume:
==============

[2016-06-11 09:41:17.359086] E
[syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in
service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in
crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in
crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in
changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in
process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'

Intermittent Master:
====================

[2016-06-11 09:41:51.681622] E
[syncdutils(/rhs/brick1/b1):276:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 201, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 720, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1497, in
service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 571, in
crawlwrap
    self.crawl()
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1201, in
crawl
    self.changelogs_batch_process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1107, in
changelogs_batch_process
    self.process(batch)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 984, in
process
    self.datas_in_batch.remove(unlinked_gfid)
KeyError: '.gfid/757b0ad8-b6f5-44da-b71a-1b1c25a72988'
[2016-06-11 09:41:51.684969] I [syncdutils(/rhs/brick1/b1):220:finalize] <top>:
exiting.

Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.9-10

How reproducible:
=================

Always, on cascaded setup upon remove (rm -rf)

Steps to Reproduce:
===================
1. Create geo-rep cascaded setup with (vol0,vol1,vol2). Such that vol0=>vol1,
vol1=>vol2
2. Mount the vol0 volume and perform fops like
(cp,create,chmod,chown,chgrp,symlink,hardlink,truncate) on vol0
3. Let it sync to slave (vol1) and (vol2)
4. Calculate arequal checksum after every fop. It should match.
5. perform rm -rf on vol0

Actual results:
===============

Worker crashed on vol1 and vol0 with keyerror.

Expected results:
=================

Worker shouldn't crash

Additional info:
================

Performed rm -rf on non cascaded setup and didn't see the crash. Also,
eventually files are removed from all Master and slaves.

--- Additional comment from Vijay Bellur on 2016-06-13 02:33:20 EDT ---

REVIEW: http://review.gluster.org/14706 (geo-rep: Safely handle if unliked GFID
not present in data list) posted (#1) for review on master by Aravinda VK
(avishwan at redhat.com)

--- Additional comment from Vijay Bellur on 2016-06-20 02:37:06 EDT ---

COMMIT: http://review.gluster.org/14706 committed in master by Aravinda VK
(avishwan at redhat.com) 
------
commit 4797ca3778d82a671716d4913c14f285591ae959
Author: Aravinda VK <avishwan at redhat.com>
Date:   Mon Jun 13 12:00:40 2016 +0530

    geo-rep: Safely handle if unliked GFID not present in data list

    If unlinked GFID is not present in data list to be synced then
    Geo-rep worker was crashing with KeyError. Handled KeyError with
    this patch.

    BUG: 1345744
    Change-Id: I5a1c9ca4473e32606df2e5c7e26c95faf55d44c0
    Signed-off-by: Aravinda VK <avishwan at redhat.com>
    Reviewed-on: http://review.gluster.org/14706
    Smoke: Gluster Build System <jenkins at build.gluster.org>
    NetBSD-regression: NetBSD Build System <jenkins at build.gluster.org>
    CentOS-regression: Gluster Build System <jenkins at build.gluster.org>
    Reviewed-by: Kotresh HR <khiremat at redhat.com>

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1344826
[Bug 1344826] [geo-rep]: Worker crashed with "KeyError: "
https://bugzilla.redhat.com/show_bug.cgi?id=1345744
[Bug 1345744] [geo-rep]: Worker crashed with "KeyError: "
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.