[Bugs] [Bug 1147422] New: dist-geo-rep: Session going into faulty with "Can no allocate memory" backtrace when pause, rename and resume is performed

Mon Sep 29 08:48:58 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1147422

            Bug ID: 1147422
           Summary: dist-geo-rep: Session going into faulty with "Can no
                    allocate memory" backtrace when pause, rename and
                    resume is performed
           Product: GlusterFS
           Version: 3.6.0
         Component: geo-replication
          Severity: high
          Assignee: gluster-bugs at redhat.com
          Reporter: avishwan at redhat.com
                CC: aavati at redhat.com, asrivast at redhat.com,
                    avishwan at redhat.com, bugs at gluster.org,
                    csaba at redhat.com, khiremat at redhat.com,
                    nlevinki at redhat.com, rhs-bugs at redhat.com,
                    smanjara at redhat.com, ssamanta at redhat.com,
                    storage-qa-internal at redhat.com, vbhat at redhat.com
        Depends On: 1144428, 1146823

+++ This bug was initially created as a clone of Bug #1146823 +++

+++ This bug was initially created as a clone of Bug #1144428 +++

Description of problem:
The session is going into faulty with OSError: [Errno 12] Cannot allocate
memory backtrace in the logs. The operation I performed was sync existing data
-> pause session -> rename all the files -> resume the session

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Hit only once. Not sure I will be able to reproduce again.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2
dist-rep slave volume.
2. Create and sync some 5k files in some directory structure.
3. Now pause the session.
5. rename all the files.
6. resume the session.

Actual results:
The session went to faulty

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE              
STATUS     CHECKPOINT STATUS    CRAWL STATUS        
-----------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com          master        /bricks/brick0    nirvana::slave     
faulty     N/A                  N/A                 
metallica.blr.redhat.com    master        /bricks/brick1    acdc::slave        
Passive    N/A                  N/A                 
beatles.blr.redhat.com      master        /bricks/brick3    rammstein::slave   
Passive    N/A                  N/A                 
pinkfloyd.blr.redhat.com    master        /bricks/brick2    led::slave         
faulty     N/A                  N/A                 

The backtrace in the master logs.

[2014-09-19 16:19:53.933645] I [master(/bricks/brick2):1225:crawl] _GMaster:
slave's time: (1411061833, 0)
[2014-09-19 16:20:33.653033] E [repce(/bricks/brick2):207:__call__]
RepceClient: call 18787:139727562630912:1411123833.64 (entry_ops) failed on
peer with OSError
[2014-09-19 16:20:33.653924] E
[syncdutils(/bricks/brick2):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1324, in
service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in
crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1236, in
crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in
process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in
process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
    raise res
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:20:33.657620] I [syncdutils(/bricks/brick2):214:finalize] <top>:
exiting.
[2014-09-19 16:20:33.663028] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2014-09-19 16:20:33.663907] I [syncdutils(agent):214:finalize] <top>: exiting.
[2014-09-19 16:20:33.795839] I [monitor(monitor):222:monitor] Monitor:
worker(/bricks/brick2) died in startup phase

This is a remote backtrace propagated to master via RPC. The actual backtrace
in slave logs are

[2014-09-19 16:27:45.780600] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in
entry_ops
    [ENOENT, ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 470, in
errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in
lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in
raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:27:45.794786] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.

Expected results:
There should be no backtraces and no faulty sessions.

Additional info:
The slave volume had Cluster.hash-range-gfid on

--- Additional comment from Anand Avati on 2014-09-26 03:40:46 EDT ---

REVIEW: http://review.gluster.org/8865 (geo-rep: Fix rename of directory
syncing.) posted (#1) for review on master by Kotresh HR (khiremat at redhat.com)

--- Additional comment from Anand Avati on 2014-09-29 02:32:50 EDT ---

COMMIT: http://review.gluster.org/8865 committed in master by Venky Shankar
(vshankar at redhat.com) 
------
commit 7113d873af1f129effd8c6da21b49e797de8eab0
Author: Kotresh HR <khiremat at redhat.com>
Date:   Thu Sep 25 17:34:43 2014 +0530

    geo-rep: Fix rename of directory syncing.

    The rename of directories are captured in all distributed
    brick changelogs. gsyncd processess these changelogs on
    each brick parallellaly. The first changelog to get processed
    will be successful. All subsequent ones will stat the 'src'
    and if not present, tries to create freshly on slave. It
    should be done only for files and not for directories.
    Hence when this code path was hit, regular file's blob
    is sent as directory's blob and gfid-access translator
    was erroring out as 'Invalid blob length' with errno as
    'ENOMEM'

    Change-Id: I50545b02b98846464876795159d2446340155c82
    BUG: 1146823
    Signed-off-by: Kotresh HR <khiremat at redhat.com>
    Reviewed-on: http://review.gluster.org/8865
    Reviewed-by: Aravinda VK <avishwan at redhat.com>
    Tested-by: Gluster Build System <jenkins at build.gluster.com>
    Reviewed-by: Venky Shankar <vshankar at redhat.com>
    Tested-by: Venky Shankar <vshankar at redhat.com>

--- Additional comment from Anand Avati on 2014-09-29 04:30:03 EDT ---

REVIEW: http://review.gluster.org/8880 (geo-rep: Fix rename of directory
syncing.) posted (#1) for review on release-3.6 by Aravinda VK
(avishwan at redhat.com)

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1144428
[Bug 1144428] dist-geo-rep: Session going into faulty with "Can no allocate
memory" backtrace when pause, rename and resume is performed
https://bugzilla.redhat.com/show_bug.cgi?id=1146823
[Bug 1146823] dist-geo-rep: Session going into faulty with "Can no allocate
memory" backtrace when pause, rename and resume is performed
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=Rp1lYvSmIn&a=cc_unsubscribe