[Bugs] [Bug 1157979] New: Executing volume status for 2X2 dis-rep volume leads to "Failed to aggregate response from node/brick " errors in logs

Tue Oct 28 07:36:16 UTC 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1157979

            Bug ID: 1157979
           Summary: Executing volume status for 2X2 dis-rep volume leads
                    to "Failed to aggregate response from node/brick "
                    errors in logs
           Product: GlusterFS
           Version: mainline
         Component: glusterd
          Severity: high
          Assignee: kaushal at redhat.com
          Reporter: kaushal at redhat.com
                CC: amukherj at redhat.com, bbandari at redhat.com,
                    bugs at gluster.org, gluster-bugs at redhat.com,
                    nlevinki at redhat.com, sasundar at redhat.com,
                    sbhaloth at redhat.com, ssamanta at redhat.com,
                    storage-qa-internal at redhat.com, vbellur at redhat.com
            Blocks: 1123732

+++ This bug was initially created as a clone of Bug #1123732 +++

Description of problem:
**************************************************
Created a 2x2 dis-rep volume.Mount it via cifs and create few directories and
files on the mount point.Did volume set operation required for samba shares to
be mounted via cifs.restarted glusterd on all the nodes and checked volume
status.
After executing volume status it shows following errors in the logs:
***********************************************
volume req for volume newafr
[2014-07-28 05:50:50.986840] E
[glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status]
0-management: Local tasks count (1) and remote tasks count (0) do not match.
Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E
[glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to
aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E
[glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status]
0-management: Local tasks count (1) and remote tasks count (0) do not match.
Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E
[glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to
aggregate response from  node/brick

How reproducible:
tried once.

Steps to Reproduce:
1.create a 2x2 dis-rep volume
2.Mount it via cifs
3.create few directories/files on the mount point.
4.Run arequal checksum.
5.do volume set operation on the volume which is mounted.
6.Service glusterd restart
7.execute gluster vol status 
8.Check the volume logs.

Actual results:
*************************************
the logs shows following error :

volume req for volume newafr
[2014-07-28 05:50:50.986840] E
[glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status]
0-management: Local tasks count (1) and remote tasks count (0) do not match.
Not aggregating tasks status.
[2014-07-28 05:50:50.986893] E
[glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to
aggregate response from  node/brick
[2014-07-28 05:50:50.987082] E
[glusterd-utils.c:10038:glusterd_volume_status_aggregate_tasks_status]
0-management: Local tasks count (1) and remote tasks count (0) do not match.
Not aggregating tasks status.
[2014-07-28 05:50:50.987106] E
[glusterd-syncop.c:1014:_gd_syncop_commit_op_cbk] 0-management: Failed to
aggregate response from  node/brick

Expected results:
There should not be such errors on execution of gluster vol status

Additional info:
*********************************
Volume Name: newafr
Type: Distributed-Replicate
Volume ID: bd60f186-4bb0-49fa-bdd8-521e07e1b728
Status: Started
Snap Volume: no
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: srv1:/rhs/brick1/newafr/b1
Brick2: srv2:/rhs/brick1/newafr/b2
Brick3: srv1:/rhs/brick1/newafr/b3
Brick4: srv2:/rhs/brick1/newafr/b4
Options Reconfigured:
performance.readdir-ahead: on
storage.batch-fsync-delay-usec: 0
server.allow-insecure: on
performance.stat-prefetch: off
auto-delete: disable
snap-max-soft-limit: 90
snap-max-hard-limit: 256

[root at srv2 glusterfs]# gluster vol status newafr
Status of volume: newafr
Gluster process                        Port    Online    Pid
------------------------------------------------------------------------------
Brick srv1:/rhs/brick1/newafr/b1                49167    Y    2829
Brick srv2:/rhs/brick1/newafr/b2                49165    Y    5690
Brick srv1:/rhs/brick1/newafr/b3                49168    Y    2834
Brick srv2:/rhs/brick1/newafr/b4                49166    Y    5746
NFS Server on localhost                    2049    Y    24418
Self-heal Daemon on localhost                N/A    Y    24425
NFS Server on srv3                        2049    Y    21568
Self-heal Daemon on srv3                    N/A    Y    21575
NFS Server on srv4                        2049    Y    16899
Self-heal Daemon on srv4                    N/A    Y    16906
NFS Server on srv1                        2049    Y    17658
Self-heal Daemon on srv1                    N/A    Y    17665

--- Additional comment from Atin Mukherjee on 2014-07-28 16:36:19 IST ---

Surabhi,

Can you please attach the sosreports of all the nodes? Have you executed
remove-brick/rebalance or replace-brick in between as this mismatch can be seen
when u execute any of these operations.

--Atin

--- Additional comment from surabhi on 2014-07-28 18:06:48 IST ---

For this particular test when these errors were observed ,remove-brick and
rebalance is not been executed but there were several tests executed before
which included remove-brick/rebalance operation.

--- Additional comment from Kaushal on 2014-10-28 13:01:35 IST ---

This issue is caused by peers not participating in the rebalance not storing
the rebalance task. When a rebalance task is started, the task details are
stored in the node_state.info file. But this store was being performed only on
nodes on which rebalance process is started. On the non-participating nodes,
the task information would not be stored and would be only present in memory.
This meant the information was lost when Glusterd is restarted, which leads to
the above situation of having error logs.

A simple reproducer for this is,
1. Create a 3 node cluster
2. Create a distribute volume with bricks only on 2 of the peers.
3. Start rebalance on the volume.
4. Restart the 3rd peer.
5. Run 'volume status' from either of the first 2 peers.

This is not really a serious issue as it doesn't affect any operations. But I
will fix it.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1123732
[Bug 1123732] Executing volume status for 2X2 dis-rep volume leads to
"Failed to aggregate response from node/brick " errors in logs
-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=Na1akqDkcw&a=cc_unsubscribe