[Bugs] [Bug 1249124] New: Since 3.6.2: failed to get the 'volume file' from server

Fri Jul 31 14:40:17 UTC 2015

https://bugzilla.redhat.com/show_bug.cgi?id=1249124

            Bug ID: 1249124
           Summary: Since 3.6.2: failed to get the 'volume file' from
                    server
           Product: GlusterFS
           Version: 3.6.2
         Component: packaging
          Keywords: Triaged
          Severity: urgent
          Assignee: bugs at gluster.org
          Reporter: kkeithle at redhat.com
                CC: bugs at gluster.org, bugzilla.redhat.com at spider007.net,
                    gluster-bugs at redhat.com, kaushal at redhat.com,
                    kkeithle at redhat.com, kparthas at redhat.com,
                    lmohanty at redhat.com, monotek23 at gmail.com,
                    ndevos at redhat.com, pkarampu at redhat.com,
                    rabhat at redhat.com, redhat.bugs at pointb.co.uk,
                    romeo.r at gmail.com
        Depends On: 1191176

+++ This bug was initially created as a clone of Bug #1191176 +++

Description of problem:

I upgraded to Glusterfs-3.6.2 today, and it didn't work. Downgrading to 3.6.1
fixes the issue.

All clients say, this happens on localhost as well:

I [MSGID: 100030] [glusterfsd.c:2018:main] 0-/usr/bin/glusterfs: Started
running /usr/bin/glusterfs version  (args: /usr/bin/glusterfs
--volfile-server=xxx --volfile-id=/xxx /xxx)
E [glusterfsd-mgmt.c:1494:mgmt_getspec_cbk] 0-glusterfs: failed to get the
'volume file' from server
E [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-mgmt: failed to fetch volume file
(key:/xxx)
W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (0), shutting
down
I [fuse-bridge.c:5599:fini] 0-fuse: Unmounting '/shared'.
W [glusterfsd.c:1194:cleanup_and_exit] (--> 0-: received signum (15), shutting
down

The `gluster volume info` looks fine and all volume come up. I'd like to debug
this further but am not sure how.

--- Additional comment from  on 2015-02-12 04:54:13 EST ---

Are you missing information? This is a critical bug; glusterfs does not work
after upgrading to 3.6.2; don't you have tests for this? How can I debug this?
Does 3.6.2 work for other people?

--- Additional comment from Richard on 2015-02-16 06:36:29 EST ---

I've kept off the 3.5 and 3.6 version's and stuck with 3.4
(glusterfs-3.4.6-1.el6.x86_64) as that works best for me. 

On 3.6.x (any version including beta) I can't get my volumes to remount after
cleanly shutting down all servers and rebooting them. I get some constant error
about acl's in my logs so gave up with it.

in 3.5.x I can't peer probe a 2nd node after createing the 1st one which
obviously is a non starter for using that.

There are bug fixes that have got into 3.4 that just don't seem to be being
fixed in 3.5 or 3.6 releases :-(

--- Additional comment from Lalatendu Mohanty on 2015-02-17 07:15:07 EST ---

Have you upgraded all the nodes and then try to mount the volume?  Can you
please tell us more about the upgrade sequence you followed?. Or is it just yum
update on gluster nodes.

I think as of now having different versions of glusterfs in a gluster cluster
is not a supported use case, in case you were trying to do that.

--- Additional comment from Niels de Vos on 2015-02-17 07:42:17 EST ---

This sounds like an issue that Pranith debugged over email for a different
user. It should not happen on RPM based installation, but it could be an issue
in Debian/Ubuntu packaging.

> hey, I found the root cause. I am seeing the following log '[2015-02-09
> 09:48:04.312800] E [glusterd-handshake.c:771:__server_getspec] 0-glusterd:
> Unable to stat FILENAME_REDACTED (No such file or directory)' This means the
> configuration file for the mount is not available in the directory. In 3.6.2
> filenames of the vol-files i.e. configuration files changed. How did you
> upgrade to 3.6.2? Is it rpm based installation or something else?  In any
> case, this is what you need to do: stop the volume. killall the glusterds on
> the machines of the cluster.  run 'glusterd --xlator-option *.upgrade=on -N'
> on all the machines. Then restart glusterds on the machines of the cluster
> and everything should be operational after that.
> 
> Pranith.

--- Additional comment from  on 2015-02-18 03:48:41 EST ---

(In reply to Lalatendu Mohanty from comment #3)
> Have you upgraded all the nodes and then try to mount the volume?  Can you
> please tell us more about the upgrade sequence you followed?. Or is it just
> yum update on gluster nodes.

I use Archlinux; I've upgraded all nodes; stopped all volumes; rebooted all
machines and started the volume succesfully. Mounting simply doesn't work
anymore.

(In reply to Niels de Vos from comment #4)
> This sounds like an issue that Pranith debugged over email for a different
> user. It should not happen on RPM based installation, but it could be an
> issue in Debian/Ubuntu packaging.

You're right; this fixes the issue! Is this command supposted to be run after
each upgrade; or was this in a changelog somewhere?

Thanks!

--- Additional comment from Kaushal on 2015-02-18 04:53:29 EST ---

The names of volfiles on disk was changed for improved rdma support. This
change was introduced in 3.6.2. For reference the commit-ids of the changes
are,
 50952cd rdma: Client volfile name change for supporting rdma
 605db00 rdma: Wrong volfile fetch on fuse mounting tcp,rdma volume via rdma

This requires that the volfiles of existing volumes be regenerated, so that
they use the new names. Without the regeneration, glusterd would be looking for
files by the new names on a volfile fetch request but would not find them,
which would lead to a mount failure.

This regeneration is done by running glusterd in upgrade mode, 'glusterd
--xlator-option *.upgrade=on -N'.

RPM upgrades run this command as a part of the post-update. As we mainly test
on RPMs, we didn't hit the issues faced by you guys.

I suggest you open a bug on the Arch Linux package, to have a post upgrade step
added. I'd be happy to open the bug on your behalf.

I'll also make sure we add a release note stating this change for 3.6.3 at
least.

--- Additional comment from Kaushal on 2015-02-18 05:03:00 EST ---

Changing component to doc, as we need a proper release note.

--- Additional comment from  on 2015-02-18 05:12:40 EST ---

(In reply to Kaushal from comment #6)
> The names of volfiles on disk was changed for improved rdma support. This
> change was introduced in 3.6.2. For reference the commit-ids of the changes
> are,
>  50952cd rdma: Client volfile name change for supporting rdma
>  605db00 rdma: Wrong volfile fetch on fuse mounting tcp,rdma volume via rdma

Thanks; since I checked all commits from 3.6.1 to 3.6.2 I had a hunch it would
be related to this. But I didn't find any mention of the need to manually
upgrade the volume file in the commits or the blogpost. Is there another
channel I missed or should I check the rpm sources for this?

--- Additional comment from  on 2015-02-18 05:16:05 EST ---

(In reply to Kaushal from comment #6)
> RPM upgrades run this command as a part of the post-update. As we mainly
> test on RPMs, we didn't hit the issues faced by you guys.
> 
> I suggest you open a bug on the Arch Linux package, to have a post upgrade
> step added. I'd be happy to open the bug on your behalf.

I have reported this @ https://bugs.archlinux.org/task/43872

--- Additional comment from Raghavendra Bhat on 2015-05-20 07:39:27 EDT ---

This has been addressed in glusterfs-3.6.3.

--- Additional comment from Roman on 2015-07-14 08:05:38 EDT ---

Hi,

3.6.4 is out, but .deb pkgs are still affected.

--- Additional comment from André Bauer on 2015-07-29 14:42:33 EDT ---

Imho this also affects 3.7.3. Could not find anything in the postinst script of
the deb packages which is about volume rename.

The postinst script of deb packages should get something like:

#!/bin/bash
VOL_DIR="/var/lib/glusterd/vols"
for VOLUME in $(find ${VOL_DIR} -iname *-fuse.vol); do
    cp ${VOLUME} ${VOLUME}.dpkg-pre3.6.2
    mv ${VOLUME} $(echo ${VOLUME} | sed -e 's/-fuse.vol/.tcp-fuse.vol/g')
done

This is untested, cause im still on 3.5.
Don't know if gluster restart or something else is needed.

Referenced Bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1191176
[Bug 1191176] Since 3.6.2: failed to get the 'volume file' from server
-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.