[Gluster-users] Upgrading from 3.6.2-1 to 3.6.2-2 causes "failed to get the 'volume file' from server"

Wed Feb 25 19:52:02 UTC 2015

On a Debian testing glusterfs cluster, one node of six (web1) was
upgraded from 3.6.2-1 to 3.6.2-2. All looks good server side, and
gdash looks happy. The problem is this node is no longer able to mount
the volumes. The server config is in Ansible so the nodes should be
consistent.

web1# mount -t glusterfs localhost:/site-private
Mount failed. Please check the log file for more details.

web1# gluster volume info site-private

Volume Name: site-private
Type: Distributed-Replicate
Volume ID: 53cb154d-7e44-439f-b52c-ca10414327cb
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: web4:/var/gluster/site-private
Brick2: web5:/var/gluster/site-private
Brick3: web3:/var/gluster/site-private
Brick4: webw:/var/gluster/site-private
Options Reconfigured:
nfs.disable: on
auth.allow: 10.*

web1# gluster volume status site-private
Status of volume: site-private
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick web4:/var/gluster/site-private                      49152   Y       18544
Brick web5:/var/gluster/site-private                      49152   Y       3460
Brick web3:/var/gluster/site-private                      49152   Y       1171
Brick webw:/var/gluster/site-private                      49152   Y       8954
Self-heal Daemon on localhost                           N/A     Y       1410
Self-heal Daemon on web3                                 N/A     Y       6394
Self-heal Daemon on web5                                 N/A     Y       3726
Self-heal Daemon on web4                                 N/A     Y       18928
Self-heal Daemon on 10.0.0.22                       N/A     Y       3601
Self-heal Daemon on 10.0.0.153                      N/A     Y       23269

Task Status of Volume site-private
------------------------------------------------------------------------------
There are no active volume tasks

10.0.0.22 is web2, 10.0.0.153 is webw. It's irritating that gluster
swaps out some of the hostnames with IPs on an intermittent random
basis. Any way to fix this inconsistency?

web1# tail -f /var/log/glusterfs/var-www-html-site.example.com-sites-default-private.log

[2015-02-25 00:49:14.294562] I [MSGID: 100030]
[glusterfsd.c:2018:main] 0-/usr/sbin/glusterfs: Started running
/usr/sbin/glusterfs version 3.6.2 (args: /usr/sbin/glusterfs
--volfile-server=localhost --volfile-id=/site-private
/var/www/html/site.example.com/sites/default/private)
[2015-02-25 00:49:14.303008] E
[glusterfsd-mgmt.c:1494:mgmt_getspec_cbk] 0-glusterfs: failed to get
the 'volume file' from server
[2015-02-25 00:49:14.303153] E
[glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-mgmt: failed to fetch
volume file (key:/site-private)
[2015-02-25 00:49:14.303595] W [glusterfsd.c:1194:cleanup_and_exit]
(--> 0-: received signum (0), shutting down
[2015-02-25 00:49:14.303673] I [fuse-bridge.c:5599:fini] 0-fuse:
Unmounting '/var/www/html/site.example.com/sites/default/private'.

These lines appear in
/var/log/glusterfs/etc-glusterfs-glusterd.vol.log about every 5
seconds:
[2015-02-25 01:00:04.312532] W [socket.c:611:__socket_rwv]
0-management: readv on
/var/run/0ecb037a7fd562bf0d7ed973ccd33ed8.socket failed (Invalid
argument)

Thanks in advance for your time/help. :)