[Bugs] [Bug 1672205] New: [GSS] 'gluster get-state' command fails if volume brick doesn't exist.

Mon Feb 4 09:32:44 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1672205

            Bug ID: 1672205
           Summary: [GSS] 'gluster get-state' command fails if volume
                    brick doesn't exist.
           Product: GlusterFS
           Version: mainline
            Status: NEW
         Component: glusterd
          Keywords: Improvement
          Severity: medium
          Priority: medium
          Assignee: bugs at gluster.org
          Reporter: srakonde at redhat.com
        Depends On: 1669970
  Target Milestone: ---
             Group: private
    Classification: Community

Description of problem:

'gluster get-state' command fails when any brick of a volume is not present or
deleted. Instead the command output should report the brick failure.

When any brick of a volume is not available or being removed 'gluster
get-state' command fails with the following error:

'Failed to get daemon state. Check glusterd log file for more details'

The requirement is 'gluster get-state' command should not fail and generate
gluster brick's state in the output.

For example:

cat /var/run/gluster/glusterd_state_XYZ
...
Volume3.name: v02
Volume3.id: c194e70d-6738-4ba3-9502-ec5603aab679
Volume3.type: Distributed-Replicate
...
## HERE #
Volume3.Brick1.port: N/A or 0 or empty? 
Volume3.Brick1.rdma_port: 0
Volume3.Brick1.port_registered: N/A or 0 or empty?
Volume3.Brick1.status: Failed
Volume3.Brick1.spacefree: N/A or 0 or empty?
Volume3.Brick1.spacetotal: N/A or 0 or empty?
...

This situation can happen in production when a local storage on node is
'broken' or while using heketi with gluster. Volumes are present but bricks are
missing.

How reproducible:
Always

Version-Release number of selected component (if applicable): RHGS 3.X

Steps to Reproduce:
1. Delete a brick 
2. Run command 'gluster get-state'

Actual results: 
Command fails with the below message

'Failed to get daemon state. Check glusterd log file for more details'

Expected results:

'gluster get-state'Command should not fail. It should report the faulty brick's
state in the output so one can simply identify what is the problem with the
volumne. 
'gluster get-state' command should return a message regarding that 'faulty
brick'.

--- Additional comment from Atin Mukherjee on 2019-01-28 15:10:36 IST ---

Root cause:

from glusterd_get_state ()

<snip>
            ret = sys_statvfs(brickinfo->path, &brickstat);                     
            if (ret) {                                                          
                gf_msg(this->name, GF_LOG_ERROR, errno, GD_MSG_FILE_OP_FAILED,  
                       "statfs error: %s ", strerror(errno));                   
                goto out;                                                       
            }                                                                   

            memfree = brickstat.f_bfree * brickstat.f_bsize;                    
            memtotal = brickstat.f_blocks * brickstat.f_bsize;                  

            fprintf(fp, "Volume%d.Brick%d.spacefree: %" PRIu64 "Bytes\n",       
                    count_bkp, count, memfree);                                 
            fprintf(fp, "Volume%d.Brick%d.spacetotal: %" PRIu64 "Bytes\n",      
                    count_bkp, count, memtotal);   

</snip>

a statfs call is made on the brick path for every bricks of the volumes to
calculate the total vs free space. In this case we shouldn't error out on a
statfs failure and should report spacefree and spacetotal as unavailable or 0
bytes.

--- Additional comment from Atin Mukherjee on 2019-02-04 07:59:34 IST ---

We need to have a test coverage to ensure that get-state command should
generate an output successfully even if underlying brick(s) of volume(s) in the
cluster go bad.

--- Additional comment from sankarshan on 2019-02-04 14:48:30 IST ---

(In reply to Atin Mukherjee from comment #4)
> We need to have a test coverage to ensure that get-state command should
> generate an output successfully even if underlying brick(s) of volume(s) in
> the cluster go bad.

The test coverage flag needs to be set

-- 
You are receiving this mail because:
You are the assignee for the bug.