[Bugs] [Bug 1287503] New: Full heal of volume fails on some nodes "Commit failed on X", and glustershd logs "Couldn't get xlator xl-0"

Wed Dec 2 08:36:19 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1287503

            Bug ID: 1287503
           Summary: Full heal of volume fails on some nodes "Commit failed
                    on X", and glustershd logs "Couldn't get xlator xl-0"
           Product: GlusterFS
           Version: mainline
         Component: glusterd
          Keywords: Triaged
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: ravishankar at redhat.com
                CC: bugs at gluster.org, bugs at medgen.ugent.be,
                    gluster-bugs at redhat.com, rkavunga at redhat.com
        Depends On: 1284863

+++ This bug was initially created as a clone of Bug #1284863 +++

Description of problem:
-----------------------
Problems with unsuccessful full heal on all volumes started after upgrading the
6 node cluster from 3.7.2 to 3.7.6 on Ubuntu Trusty (kernel 3.13.0-49-generic)
On a Distributed-Replicate volume named test (vol info below), executing
`gluster volume heal test full` is unsuccessful and returns different
messages/errors depending on which node the command was executed on:

- When run from node *a*,*d* or *e* the cli tool returns:

> Launching heal operation to perform full self heal on volume test has been unsuccessful

With following errors/warnings on the node the command is run (no log items on
other nodes)

> E [glusterfsd-mgmt.c:619:glusterfs_handle_translator_op] 0-glusterfs: Couldn't get xlator xl-0

==> /var/log/glusterfs/cli.log

> I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6
> I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
> W [socket.c:588:__socket_rwv] 0-glusterfs: readv on /var/run/gluster/quotad.socket failed (Invalid argument)
> I [cli-rpc-ops.c:8348:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume
> I [input.c:36:cli_batch] 0-: Exiting with: -2

==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
> I [MSGID: 106533] [glusterd-volume-ops.c:861:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume test

- When run from node *b* the cli tool returns:

> Commit failed on d.storage. Please check log file for details.
> Commit failed on e.storage. Please check log file for details.

No Errors in any log files on any nodes at that time-point (only info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)

- When run from node *c* the cli tool returns:

> Commit failed on e.storage. Please check log file for details.
> Commit failed on a.storage. Please check log file for details.

No Errors in any log files on any nodes at that time-point (only info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)

- When run from node *f* the cli tool returns:

> Commit failed on a.storage. Please check log file for details.
> Commit failed on d.storage. Please check log file for details.

No Errors in any log files on any nodes at that time-point (only log info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)

Additional info:
----------------

**Volume info**
Volume Name: test
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: a.storage:/storage/bricks/test/brick
Brick2: b.storage:/storage/bricks/test/brick
Brick3: c.storage:/storage/bricks/test/brick
Brick4: d.storage:/storage/bricks/test/brick
Brick5: e.storage:/storage/bricks/test/brick
Brick6: f.storage:/storage/bricks/test/brick
Options Reconfigured:
performance.readdir-ahead: on
features.trash: off
nfs.disable: off

**volume status info**
Status of volume: test
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick a.storage:/storage/bricks/test/brick  49156     0          Y       783  
Brick b.storage:/storage/bricks/test/brick  49160     0          Y       33394
Brick c.storage:/storage/bricks/test/brick  49156     0          Y       545  
Brick d.storage:/storage/bricks/test/brick  49158     0          Y       14983
Brick e.storage:/storage/bricks/test/brick  49156     0          Y       22585
Brick f.storage:/storage/bricks/test/brick  49155     0          Y       2397 
NFS Server on localhost                     2049      0          Y       49084
Self-heal Daemon on localhost               N/A       N/A        Y       49092
NFS Server on b.storage                     2049      0          Y       20138
Self-heal Daemon on b.storage               N/A       N/A        Y       20146
NFS Server on f.storage                     2049      0          Y       37158
Self-heal Daemon on f.storage               N/A       N/A        Y       37180
NFS Server on a.storage                     2049      0          Y       35744
Self-heal Daemon on a.storage               N/A       N/A        Y       35749
NFS Server on c.storage                     2049      0          Y       35479
Self-heal Daemon on c.storage               N/A       N/A        Y       35485
NFS Server on e.storage                     2049      0          Y       8512 
Self-heal Daemon on e.storage               N/A       N/A        Y       8520 

Task Status of Volume test
------------------------------------------------------------------------------
There are no active volume tasks

--- Additional comment from Ravishankar N on 2015-12-02 03:35:36 EST ---

Looks like a regression introduced by http://review.gluster.org/#/c/12344/.
I'll send a fix for this particular error in itself but it might be worth
noting that heal full does not work as expected in all scenarios (See BZ
1112158). But the idea is to eventually eliminate the  the need for heal full
because 'replace-brick' and 'add-brick' use cases will automatically trigger
heals. See comments 2 and 3 in 1112158.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1284863
[Bug 1284863] Full heal of volume fails on some nodes "Commit failed on X",
and glustershd logs "Couldn't get xlator xl-0"
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.