[Bugs] [Bug 1392181] New: [GSS]"gluster vol status all clients --xml" get malformed at times, causes gstatus to fail

Sat Nov 5 18:08:48 UTC 2016

https://bugzilla.redhat.com/show_bug.cgi?id=1392181

            Bug ID: 1392181
           Summary: [GSS]"gluster vol status all clients --xml" get
                    malformed at times, causes gstatus to fail
           Product: GlusterFS
           Version: 3.7.15
         Component: cli
          Severity: medium
          Assignee: bugs at gluster.org
          Reporter: giuseppe.ragusa at hotmail.com
                CC: amukherj at redhat.com, ashah at redhat.com,
                    bugs at gluster.org, rcyriac at redhat.com,
                    rhinduja at redhat.com, rhs-bugs at redhat.com,
                    rnalakka at redhat.com, sbairagy at redhat.com

+++ This bug was initially created as a clone of Bug #1359619 +++

Description of problem:

Sometimes the gstatus command below traceback instead of proper output. The
issue is because the glusterd is giving malformed xml outputs to gstatus
scripts. 

# gstatus

Traceback (most recent call last):ons
  File "/usr/bin/gstatus", line 221, in <module>
    main()
  File "/usr/bin/gstatus", line 135, in main
    cluster.update_state(self_heal_backlog)
  File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line
638, in update_state
    self.calc_connections()
  File "/usr/lib/python2.7/site-packages/gstatus/libgluster/cluster.py", line
713, in calc_connections
    cmd.run()
  File "/usr/lib/python2.7/site-packages/gstatus/libcommand/glustercmd.py",
line 100, in run
    xmldoc = ETree.fromstring(''.join(self.stdout))
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1301, in XML
    return parser.close()
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1654, in close
    self._raiseerror(v)
  File "/usr/lib64/python2.7/xml/etree/ElementTree.py", line 1506, in
_raiseerror
    raise err
xml.etree.ElementTree.ParseError: no element found: line 1, column 0

Version-Release number of selected component (if applicable):

RHGS 3.1.3 on RHEL 7 

How reproducible:

Not always reproducible. 

Steps to Reproduce:

1. Install RHGS 
2. Install gstatus
3. Run gstatus

Actual results:

gstatus gives traceback at times instead of cluster health status. 

Expected results:

glusterd should provide the proper xml output and gstatus should work properly. 
Additional info:

--- Additional comment from Atin Mukherjee on 2016-07-25 03:19:25 EDT ---

Given that customer hasn't captured the xml output for the same when the issue
is hit and its not reproducible, this will be hard to debug. We'll do a code
walk through on the XML generation part to see if anything is buggy here.

--- Additional comment from Atin Mukherjee on 2016-07-26 00:45:49 EDT ---

I see following error entries in the CLI log.

[2016-07-21 05:58:45.598277] E [cli-rpc-ops.c:7744:gf_cli_status_cbk] 0-cli:
Error outputting to xml
[2016-07-21 05:58:45.598341] E [cli-rpc-ops.c:8036:gf_cli_status_volume_all]
0-cli: status all failed

This indicates that while dumping the xml output some dict_get_* call failed.
CLI doesn't log these failures, so its difficult to figure out which key was
missing. GlusterD log however doesn't show any evidence of an error on the same
timestamp. Having said this, IMO there is an issue in the way we generate xml
output.

<snip>

/* <volume> */                                                              
        ret = xmlTextWriterStartElement (local->writer, (xmlChar *)"volume");   
        XML_RET_CHECK_AND_GOTO (ret, out);                                      

        ret = dict_get_str (dict, "volname", &volname);                         
        if (ret)                                                                
                goto out;                                                       
        ret = xmlTextWriterWriteFormatElement (local->writer,                   
                                               (xmlChar *)"volName", "%s",      
                                               volname);

................
................
out:                                                                            
        gf_log ("cli", GF_LOG_DEBUG, "Returning %d", ret);                      
        return ret; 

>From this we can clearly see that if in case dict_get_str fails we don't close
the xml tag. And once some other application (gStatus in this case) tries to
parse it, this may result into trace back as well.

--- Additional comment from Atin Mukherjee on 2016-09-01 08:20:07 EDT ---

Riyas,

Can you also file a bug on gstatus component for handling exceptions? gStatus
needs to also handle these exceptions and should not crash.

Thanks,
Atin

--- Additional comment from Atin Mukherjee on 2016-09-01 13:40:28 EDT ---

Riyas,

Unfortunately this will not help either as after detail code walkthrough I see
that the xml output doesn't get dumped into the stdout at the
xmlEndWriterElement which doesn't get called in this case as highlighted in
comment 1. 

Looking at the case 01674704, I see the case mentioning about command getting
timed out which is very well possible if the volume has a heavy list of
connected clients. But I am not able to map the exit code 90 here, in my setup
cli doesn't show up anything if the command fails but echo $? shows 1 instead
of 90. 

I've identified a fix at the CLI side to write the xml output and will be
sending a patch by tomorrow. However I'll continue to analyse we did we end up
with exit code 90.

--- Additional comment from Atin Mukherjee on 2016-09-06 00:34:06 EDT ---

http://review.gluster.org/15384 posted upstream for review.

--- Additional comment from Atin Mukherjee on 2016-09-19 09:03:51 EDT ---

Upstream mainline : http://review.gluster.org/15384
Upstream 3.8 : http://review.gluster.org/15428

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

--- Additional comment from Anil Shah on 2016-10-19 07:32:28 EDT ---

gluster volume status all clients --xml commands give output in xml format.

[root at rhs-client46 ~]# gluster volume status all clients --xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cliOutput>
  <opRet>0</opRet>
  <opErrno>0</opErrno>
  <opErrstr/>
  <volStatus>
    <volumes>
      <volume>
        <volName>arvol</volName>
        <nodeCount>7</nodeCount>
        <node>
          <hostname>10.70.36.70</hostname>
          <path>/rhs/brick1/b1</path>
          <peerid>02cff39e-b4c9-435d-85f0-8b9fe3b33193</peerid>
          <status>1</status>
          <port>49152</port>
          <ports>
            <tcp>49152</tcp>
            <rdma>N/A</rdma>
          </ports>
          <pid>25364</pid>
          <clientsStatus>
            <clientCount>10</clientCount>
            <client>
              <hostname>10.70.44.7:1015</hostname>
              <bytesRead>9678485624</bytesRead>
              <bytesWrite>23700708</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:1023</hostname>
              <bytesRead>18708</bytesRead>
              <bytesWrite>16012</bytesWrite>
            </client>
            <client>
              <hostname>10.70.44.7:1021</hostname>
              <bytesRead>1032</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.44.7:984</hostname>
              <bytesRead>25708</bytesRead>
              <bytesWrite>22956</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.46:1021</hostname>
              <bytesRead>1040</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:994</hostname>
              <bytesRead>32308</bytesRead>
              <bytesWrite>28108</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.46:991</hostname>
              <bytesRead>21680</bytesRead>
              <bytesWrite>19412</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.70:974</hostname>
              <bytesRead>49564</bytesRead>
              <bytesWrite>44564</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.71:1014</hostname>
              <bytesRead>1040</bytesRead>
              <bytesWrite>644</bytesWrite>
            </client>
            <client>
              <hostname>10.70.36.71:985</hostname>
              <bytesRead>46260</bytesRead>
              <bytesWrite>41644</bytesWrite>
            </client>
          </clientsStatus>

gstatus command output 
======================================

[root at rhs-client46 ~]# gstatus -a

     Product: RHGS Server v3.1Update3  Capacity:   3.90 TiB(raw bricks)
      Status: HEALTHY                       38.00 GiB(raw used)
   Glusterfs: 3.8.4                          1.40 TiB(usable from volumes)
  OverCommit: No                Snapshots:   9

   Nodes       :  4/  4          Volumes:   2 Up
   Self Heal   :  4/  2                     0 Up(Degraded)
   Bricks      :  8/  8                     0 Up(Partial)
   Connections :  4/  68                     0 Down

Volume Information
    arvol            UP - 6/6 bricks up - Distributed-Replicate
                     Capacity: (1% used) 13.00 GiB/1.20 TiB (used/total)
                     Snapshots: 9
                     Self Heal:  6/ 6
                     Tasks Active: None
                     Protocols: glusterfs:on  NFS:off  SMB:on
                     Gluster Connectivty: 4 hosts, 60 tcp connections

    testvol          UP - 2/2 bricks up - Replicate
                     Capacity: (0% used) 33.00 MiB/199.00 GiB (used/total)
                     Snapshots: 0
                     Self Heal:  2/ 2
                     Tasks Active: None
                     Protocols: glusterfs:on  NFS:off  SMB:on
                     Gluster Connectivty: 4 hosts, 8 tcp connections

Status Messages
  - Cluster is HEALTHY, all_bricks checks successful

Bug verified on build glusterfs-3.8.4-2.el7rhgs.x86_64

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.