[Bugs] [Bug 1159190] New: dist-geo-rep: Session going into faulty with "Can no allocate memory" backtrace when pause, rename and resume is performed

Fri Oct 31 07:52:34 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1159190

            Bug ID: 1159190
           Summary: dist-geo-rep: Session going into faulty with "Can no
                    allocate memory" backtrace when pause, rename and
                    resume is performed
           Product: GlusterFS
           Version: 3.6.0
         Component: geo-replication
          Severity: high
          Assignee: bugs at gluster.org
          Reporter: khiremat at redhat.com
                CC: aavati at redhat.com, asrivast at redhat.com,
                    avishwan at redhat.com, bugs at gluster.org,
                    csaba at redhat.com, gluster-bugs at redhat.com,
                    khiremat at redhat.com, nlevinki at redhat.com,
                    rhs-bugs at redhat.com, smanjara at redhat.com,
                    ssamanta at redhat.com, storage-qa-internal at redhat.com,
                    vbhat at redhat.com
        Depends On: 1144428, 1146823
            Blocks: 1147422

+++ This bug was initially created as a clone of Bug #1146823 +++

+++ This bug was initially created as a clone of Bug #1144428 +++

Description of problem:
The session is going into faulty with OSError: [Errno 12] Cannot allocate
memory backtrace in the logs. The operation I performed was sync existing data
-> pause session -> rename all the files -> resume the session

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Hit only once. Not sure I will be able to reproduce again.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2
dist-rep slave volume.
2. Create and sync some 5k files in some directory structure.
3. Now pause the session.
5. rename all the files.
6. resume the session.

Actual results:
The session went to faulty

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE              
STATUS     CHECKPOINT STATUS    CRAWL STATUS        
-----------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com          master        /bricks/brick0    nirvana::slave     
faulty     N/A                  N/A                 
metallica.blr.redhat.com    master        /bricks/brick1    acdc::slave        
Passive    N/A                  N/A                 
beatles.blr.redhat.com      master        /bricks/brick3    rammstein::slave   
Passive    N/A                  N/A                 
pinkfloyd.blr.redhat.com    master        /bricks/brick2    led::slave         
faulty     N/A                  N/A                 

The backtrace in the master logs.

[2014-09-19 16:19:53.933645] I [master(/bricks/brick2):1225:crawl] _GMaster:
slave's time: (1411061833, 0)
[2014-09-19 16:20:33.653033] E [repce(/bricks/brick2):207:__call__]
RepceClient: call 18787:139727562630912:1411123833.64 (entry_ops) failed on
peer with OSError
[2014-09-19 16:20:33.653924] E
[syncdutils(/bricks/brick2):270:log_raise_exception] <top>: FAIL: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
    main_i()
  File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in
main_i
    local.service_loop(*[r for r in [remote] if r])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1324, in
service_loop
    g3.crawlwrap(oneshot=True)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in
crawlwrap
    self.crawl(no_stime_update=no_stime_update)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1236, in
crawl
    self.process(changes)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in
process
    self.process_change(change, done, retry)
  File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in
process_change
    self.slave.server.entry_ops(entries)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
    return self.ins(self.meth, *a)
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
    raise res
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:20:33.657620] I [syncdutils(/bricks/brick2):214:finalize] <top>:
exiting.
[2014-09-19 16:20:33.663028] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2014-09-19 16:20:33.663907] I [syncdutils(agent):214:finalize] <top>: exiting.
[2014-09-19 16:20:33.795839] I [monitor(monitor):222:monitor] Monitor:
worker(/bricks/brick2) died in startup phase

This is a remote backtrace propagated to master via RPC. The actual backtrace
in slave logs are

[2014-09-19 16:27:45.780600] E [repce(slave):117:worker] <top>: call failed: 
Traceback (most recent call last):
  File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
    res = getattr(self.obj, rmeth)(*in_data[2:])
  File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in
entry_ops
    [ENOENT, ESTALE, EINVAL])
  File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 470, in
errno_wrap
    return call(*arg)
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in
lsetxattr
    cls.raise_oserr()
  File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in
raise_oserr
    raise OSError(errn, os.strerror(errn))
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:27:45.794786] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.

Expected results:
There should be no backtraces and no faulty sessions.

Additional info:
The slave volume had Cluster.hash-range-gfid on

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1144428
[Bug 1144428] dist-geo-rep: Session going into faulty with "Can no allocate
memory" backtrace when pause, rename and resume is performed
https://bugzilla.redhat.com/show_bug.cgi?id=1146823
[Bug 1146823] dist-geo-rep: Session going into faulty with "Can no allocate
memory" backtrace when pause, rename and resume is performed
https://bugzilla.redhat.com/show_bug.cgi?id=1147422
[Bug 1147422] dist-geo-rep: Session going into faulty with "Can no allocate
memory" backtrace when pause, rename and resume is performed
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.