[Bugs] [Bug 1287503] New: Full heal of volume fails on some nodes "Commit failed on X", and glustershd logs "Couldn't get xlator xl-0"
bugzilla at redhat.com
bugzilla at redhat.com
Wed Dec 2 08:36:19 UTC 2015
https://bugzilla.redhat.com/show_bug.cgi?id=1287503
Bug ID: 1287503
Summary: Full heal of volume fails on some nodes "Commit failed
on X", and glustershd logs "Couldn't get xlator xl-0"
Product: GlusterFS
Version: mainline
Component: glusterd
Keywords: Triaged
Severity: medium
Assignee: bugs at gluster.org
Reporter: ravishankar at redhat.com
CC: bugs at gluster.org, bugs at medgen.ugent.be,
gluster-bugs at redhat.com, rkavunga at redhat.com
Depends On: 1284863
+++ This bug was initially created as a clone of Bug #1284863 +++
Description of problem:
-----------------------
Problems with unsuccessful full heal on all volumes started after upgrading the
6 node cluster from 3.7.2 to 3.7.6 on Ubuntu Trusty (kernel 3.13.0-49-generic)
On a Distributed-Replicate volume named test (vol info below), executing
`gluster volume heal test full` is unsuccessful and returns different
messages/errors depending on which node the command was executed on:
- When run from node *a*,*d* or *e* the cli tool returns:
> Launching heal operation to perform full self heal on volume test has been unsuccessful
With following errors/warnings on the node the command is run (no log items on
other nodes)
> E [glusterfsd-mgmt.c:619:glusterfs_handle_translator_op] 0-glusterfs: Couldn't get xlator xl-0
==> /var/log/glusterfs/cli.log
> I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6
> I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
> W [socket.c:588:__socket_rwv] 0-glusterfs: readv on /var/run/gluster/quotad.socket failed (Invalid argument)
> I [cli-rpc-ops.c:8348:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume
> I [input.c:36:cli_batch] 0-: Exiting with: -2
==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
> I [MSGID: 106533] [glusterd-volume-ops.c:861:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume test
- When run from node *b* the cli tool returns:
> Commit failed on d.storage. Please check log file for details.
> Commit failed on e.storage. Please check log file for details.
No Errors in any log files on any nodes at that time-point (only info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)
- When run from node *c* the cli tool returns:
> Commit failed on e.storage. Please check log file for details.
> Commit failed on a.storage. Please check log file for details.
No Errors in any log files on any nodes at that time-point (only info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)
- When run from node *f* the cli tool returns:
> Commit failed on a.storage. Please check log file for details.
> Commit failed on d.storage. Please check log file for details.
No Errors in any log files on any nodes at that time-point (only log info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)
Additional info:
----------------
**Volume info**
Volume Name: test
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: a.storage:/storage/bricks/test/brick
Brick2: b.storage:/storage/bricks/test/brick
Brick3: c.storage:/storage/bricks/test/brick
Brick4: d.storage:/storage/bricks/test/brick
Brick5: e.storage:/storage/bricks/test/brick
Brick6: f.storage:/storage/bricks/test/brick
Options Reconfigured:
performance.readdir-ahead: on
features.trash: off
nfs.disable: off
**volume status info**
Status of volume: test
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick a.storage:/storage/bricks/test/brick 49156 0 Y 783
Brick b.storage:/storage/bricks/test/brick 49160 0 Y 33394
Brick c.storage:/storage/bricks/test/brick 49156 0 Y 545
Brick d.storage:/storage/bricks/test/brick 49158 0 Y 14983
Brick e.storage:/storage/bricks/test/brick 49156 0 Y 22585
Brick f.storage:/storage/bricks/test/brick 49155 0 Y 2397
NFS Server on localhost 2049 0 Y 49084
Self-heal Daemon on localhost N/A N/A Y 49092
NFS Server on b.storage 2049 0 Y 20138
Self-heal Daemon on b.storage N/A N/A Y 20146
NFS Server on f.storage 2049 0 Y 37158
Self-heal Daemon on f.storage N/A N/A Y 37180
NFS Server on a.storage 2049 0 Y 35744
Self-heal Daemon on a.storage N/A N/A Y 35749
NFS Server on c.storage 2049 0 Y 35479
Self-heal Daemon on c.storage N/A N/A Y 35485
NFS Server on e.storage 2049 0 Y 8512
Self-heal Daemon on e.storage N/A N/A Y 8520
Task Status of Volume test
------------------------------------------------------------------------------
There are no active volume tasks
--- Additional comment from Ravishankar N on 2015-12-02 03:35:36 EST ---
Looks like a regression introduced by http://review.gluster.org/#/c/12344/.
I'll send a fix for this particular error in itself but it might be worth
noting that heal full does not work as expected in all scenarios (See BZ
1112158). But the idea is to eventually eliminate the the need for heal full
because 'replace-brick' and 'add-brick' use cases will automatically trigger
heals. See comments 2 and 3 in 1112158.
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1284863
[Bug 1284863] Full heal of volume fails on some nodes "Commit failed on X",
and glustershd logs "Couldn't get xlator xl-0"
--
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
More information about the Bugs
mailing list