[Bugs] [Bug 1147422] New: dist-geo-rep: Session going into faulty with "Can no allocate memory" backtrace when pause, rename and resume is performed
bugzilla at redhat.com
bugzilla at redhat.com
Mon Sep 29 08:48:58 UTC 2014
https://bugzilla.redhat.com/show_bug.cgi?id=1147422
Bug ID: 1147422
Summary: dist-geo-rep: Session going into faulty with "Can no
allocate memory" backtrace when pause, rename and
resume is performed
Product: GlusterFS
Version: 3.6.0
Component: geo-replication
Severity: high
Assignee: gluster-bugs at redhat.com
Reporter: avishwan at redhat.com
CC: aavati at redhat.com, asrivast at redhat.com,
avishwan at redhat.com, bugs at gluster.org,
csaba at redhat.com, khiremat at redhat.com,
nlevinki at redhat.com, rhs-bugs at redhat.com,
smanjara at redhat.com, ssamanta at redhat.com,
storage-qa-internal at redhat.com, vbhat at redhat.com
Depends On: 1144428, 1146823
+++ This bug was initially created as a clone of Bug #1146823 +++
+++ This bug was initially created as a clone of Bug #1144428 +++
Description of problem:
The session is going into faulty with OSError: [Errno 12] Cannot allocate
memory backtrace in the logs. The operation I performed was sync existing data
-> pause session -> rename all the files -> resume the session
Version-Release number of selected component (if applicable):
mainline
How reproducible:
Hit only once. Not sure I will be able to reproduce again.
Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 2*2
dist-rep slave volume.
2. Create and sync some 5k files in some directory structure.
3. Now pause the session.
5. rename all the files.
6. resume the session.
Actual results:
The session went to faulty
MASTER NODE MASTER VOL MASTER BRICK SLAVE
STATUS CHECKPOINT STATUS CRAWL STATUS
-----------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com master /bricks/brick0 nirvana::slave
faulty N/A N/A
metallica.blr.redhat.com master /bricks/brick1 acdc::slave
Passive N/A N/A
beatles.blr.redhat.com master /bricks/brick3 rammstein::slave
Passive N/A N/A
pinkfloyd.blr.redhat.com master /bricks/brick2 led::slave
faulty N/A N/A
The backtrace in the master logs.
[2014-09-19 16:19:53.933645] I [master(/bricks/brick2):1225:crawl] _GMaster:
slave's time: (1411061833, 0)
[2014-09-19 16:20:33.653033] E [repce(/bricks/brick2):207:__call__]
RepceClient: call 18787:139727562630912:1411123833.64 (entry_ops) failed on
peer with OSError
[2014-09-19 16:20:33.653924] E
[syncdutils(/bricks/brick2):270:log_raise_exception] <top>: FAIL:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 164, in main
main_i()
File "/usr/libexec/glusterfs/python/syncdaemon/gsyncd.py", line 643, in
main_i
local.service_loop(*[r for r in [remote] if r])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 1324, in
service_loop
g3.crawlwrap(oneshot=True)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 524, in
crawlwrap
self.crawl(no_stime_update=no_stime_update)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 1236, in
crawl
self.process(changes)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 927, in
process
self.process_change(change, done, retry)
File "/usr/libexec/glusterfs/python/syncdaemon/master.py", line 891, in
process_change
self.slave.server.entry_ops(entries)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 226, in
__call__
return self.ins(self.meth, *a)
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 208, in
__call__
raise res
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:20:33.657620] I [syncdutils(/bricks/brick2):214:finalize] <top>:
exiting.
[2014-09-19 16:20:33.663028] I [repce(agent):92:service_loop] RepceServer:
terminating on reaching EOF.
[2014-09-19 16:20:33.663907] I [syncdutils(agent):214:finalize] <top>: exiting.
[2014-09-19 16:20:33.795839] I [monitor(monitor):222:monitor] Monitor:
worker(/bricks/brick2) died in startup phase
This is a remote backtrace propagated to master via RPC. The actual backtrace
in slave logs are
[2014-09-19 16:27:45.780600] E [repce(slave):117:worker] <top>: call failed:
Traceback (most recent call last):
File "/usr/libexec/glusterfs/python/syncdaemon/repce.py", line 113, in worker
res = getattr(self.obj, rmeth)(*in_data[2:])
File "/usr/libexec/glusterfs/python/syncdaemon/resource.py", line 662, in
entry_ops
[ENOENT, ESTALE, EINVAL])
File "/usr/libexec/glusterfs/python/syncdaemon/syncdutils.py", line 470, in
errno_wrap
return call(*arg)
File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 78, in
lsetxattr
cls.raise_oserr()
File "/usr/libexec/glusterfs/python/syncdaemon/libcxattr.py", line 37, in
raise_oserr
raise OSError(errn, os.strerror(errn))
OSError: [Errno 12] Cannot allocate memory
[2014-09-19 16:27:45.794786] I [repce(slave):92:service_loop] RepceServer:
terminating on reaching EOF.
Expected results:
There should be no backtraces and no faulty sessions.
Additional info:
The slave volume had Cluster.hash-range-gfid on
--- Additional comment from Anand Avati on 2014-09-26 03:40:46 EDT ---
REVIEW: http://review.gluster.org/8865 (geo-rep: Fix rename of directory
syncing.) posted (#1) for review on master by Kotresh HR (khiremat at redhat.com)
--- Additional comment from Anand Avati on 2014-09-29 02:32:50 EDT ---
COMMIT: http://review.gluster.org/8865 committed in master by Venky Shankar
(vshankar at redhat.com)
------
commit 7113d873af1f129effd8c6da21b49e797de8eab0
Author: Kotresh HR <khiremat at redhat.com>
Date: Thu Sep 25 17:34:43 2014 +0530
geo-rep: Fix rename of directory syncing.
The rename of directories are captured in all distributed
brick changelogs. gsyncd processess these changelogs on
each brick parallellaly. The first changelog to get processed
will be successful. All subsequent ones will stat the 'src'
and if not present, tries to create freshly on slave. It
should be done only for files and not for directories.
Hence when this code path was hit, regular file's blob
is sent as directory's blob and gfid-access translator
was erroring out as 'Invalid blob length' with errno as
'ENOMEM'
Change-Id: I50545b02b98846464876795159d2446340155c82
BUG: 1146823
Signed-off-by: Kotresh HR <khiremat at redhat.com>
Reviewed-on: http://review.gluster.org/8865
Reviewed-by: Aravinda VK <avishwan at redhat.com>
Tested-by: Gluster Build System <jenkins at build.gluster.com>
Reviewed-by: Venky Shankar <vshankar at redhat.com>
Tested-by: Venky Shankar <vshankar at redhat.com>
--- Additional comment from Anand Avati on 2014-09-29 04:30:03 EDT ---
REVIEW: http://review.gluster.org/8880 (geo-rep: Fix rename of directory
syncing.) posted (#1) for review on release-3.6 by Aravinda VK
(avishwan at redhat.com)
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1144428
[Bug 1144428] dist-geo-rep: Session going into faulty with "Can no allocate
memory" backtrace when pause, rename and resume is performed
https://bugzilla.redhat.com/show_bug.cgi?id=1146823
[Bug 1146823] dist-geo-rep: Session going into faulty with "Can no allocate
memory" backtrace when pause, rename and resume is performed
--
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=Rp1lYvSmIn&a=cc_unsubscribe
More information about the Bugs
mailing list