[Bugs] [Bug 1284863] New: Full heal of volume fails on some nodes "Commit failed on X", and glustershd logs "Couldn't get xlator xl-0"

Tue Nov 24 11:10:20 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1284863

            Bug ID: 1284863
           Summary: Full heal of volume fails on some nodes "Commit failed
                    on X", and glustershd logs "Couldn't get xlator xl-0"
           Product: GlusterFS
           Version: 3.7.6
         Component: glusterd
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: bugs at medgen.ugent.be
                CC: bugs at gluster.org, gluster-bugs at redhat.com


Description of problem:
-----------------------
Problems with unsuccessful full heal on all volumes started after upgrading the
6 node cluster from 3.7.2 to 3.7.6 on Ubuntu Trusty (kernel 3.13.0-49-generic)
On a Distributed-Replicate volume named test (vol info below), executing
`gluster volume heal test full` is unsuccessful and returns different
messages/errors depending on which node the command was executed on:

- When run from node *a*,*d* or *e* the cli tool returns:

> Launching heal operation to perform full self heal on volume test has been unsuccessful

With following errors/warnings on the node the command is run (no log items on
other nodes)

> E [glusterfsd-mgmt.c:619:glusterfs_handle_translator_op] 0-glusterfs: Couldn't get xlator xl-0

==> /var/log/glusterfs/cli.log

> I [cli.c:721:main] 0-cli: Started running gluster with version 3.7.6
> I [MSGID: 101190] [event-epoll.c:632:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
> W [socket.c:588:__socket_rwv] 0-glusterfs: readv on /var/run/gluster/quotad.socket failed (Invalid argument)
> I [cli-rpc-ops.c:8348:gf_cli_heal_volume_cbk] 0-cli: Received resp to heal volume
> I [input.c:36:cli_batch] 0-: Exiting with: -2

==> /var/log/glusterfs/etc-glusterfs-glusterd.vol.log
> I [MSGID: 106533] [glusterd-volume-ops.c:861:__glusterd_handle_cli_heal_volume] 0-management: Received heal vol req for volume test

- When run from node *b* the cli tool returns:

> Commit failed on d.storage. Please check log file for details.
> Commit failed on e.storage. Please check log file for details.

No Errors in any log files on any nodes at that time-point (only info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)

- When run from node *c* the cli tool returns:

> Commit failed on e.storage. Please check log file for details.
> Commit failed on a.storage. Please check log file for details.

No Errors in any log files on any nodes at that time-point (only info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)

- When run from node *f* the cli tool returns:

> Commit failed on a.storage. Please check log file for details.
> Commit failed on d.storage. Please check log file for details.

No Errors in any log files on any nodes at that time-point (only log info msg
"starting full sweep on subvol" and "finished full sweep on subvol" on the
other 4 nodes for which no commit failed msg was returned by the cli)


Additional info:
----------------

**Volume info**
Volume Name: test
Type: Distributed-Replicate
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: a.storage:/storage/bricks/test/brick
Brick2: b.storage:/storage/bricks/test/brick
Brick3: c.storage:/storage/bricks/test/brick
Brick4: d.storage:/storage/bricks/test/brick
Brick5: e.storage:/storage/bricks/test/brick
Brick6: f.storage:/storage/bricks/test/brick
Options Reconfigured:
performance.readdir-ahead: on
features.trash: off
nfs.disable: off

**volume status info**
Status of volume: test
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick a.storage:/storage/bricks/test/brick  49156     0          Y       783  
Brick b.storage:/storage/bricks/test/brick  49160     0          Y       33394
Brick c.storage:/storage/bricks/test/brick  49156     0          Y       545  
Brick d.storage:/storage/bricks/test/brick  49158     0          Y       14983
Brick e.storage:/storage/bricks/test/brick  49156     0          Y       22585
Brick f.storage:/storage/bricks/test/brick  49155     0          Y       2397 
NFS Server on localhost                     2049      0          Y       49084
Self-heal Daemon on localhost               N/A       N/A        Y       49092
NFS Server on b.storage                     2049      0          Y       20138
Self-heal Daemon on b.storage               N/A       N/A        Y       20146
NFS Server on f.storage                     2049      0          Y       37158
Self-heal Daemon on f.storage               N/A       N/A        Y       37180
NFS Server on a.storage                     2049      0          Y       35744
Self-heal Daemon on a.storage               N/A       N/A        Y       35749
NFS Server on c.storage                     2049      0          Y       35479
Self-heal Daemon on c.storage               N/A       N/A        Y       35485
NFS Server on e.storage                     2049      0          Y       8512 
Self-heal Daemon on e.storage               N/A       N/A        Y       8520 

Task Status of Volume test
------------------------------------------------------------------------------
There are no active volume tasks

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.