[Gluster-Maintainers] Release 4.0: Unable to complete rolling upgrade tests

Fri Mar 2 05:31:06 UTC 2018

On 03/02/2018 10:11 AM, Ravishankar N wrote:
> + Anoop.
>
> It looks like clients on the old (3.12) nodes are not able to talk to 
> the upgraded (4.0) node. I see messages like these on the old clients:
>
>  [2018-03-02 03:49:13.483458] W [MSGID: 114007] 
> [client-handshake.c:1197:client_setvolume_cbk] 0-testvol-client-2: 
> failed to find key 'clnt-lk-version' in the options
>
I see this in a 2x1 plain distribute also. I see ENOTCONN for the 
upgraded brick on the old client:

[2018-03-02 04:58:54.559446] E [MSGID: 114058] 
[client-handshake.c:1571:client_query_portmap_cbk] 0-testvol-client-1: 
failed to get the port number for remote subvolume. Please run 'gluster 
volume status' on server to see if brick process is running.
[2018-03-02 04:58:54.559618] I [MSGID: 114018] 
[client.c:2285:client_rpc_notify] 0-testvol-client-1: disconnected from 
testvol-client-1. Client process will keep trying to connect to glusterd 
until brick's port is available
[2018-03-02 04:58:56.973199] I [rpc-clnt.c:1994:rpc_clnt_reconfig] 
0-testvol-client-1: changing port to 49152 (from 0)
[2018-03-02 04:58:56.975844] I [MSGID: 114057] 
[client-handshake.c:1484:select_server_supported_programs] 
0-testvol-client-1: Using Program GlusterFS 3.3, Num (1298437), Version 
(330)
[2018-03-02 04:58:56.978114] W [MSGID: 114007] 
[client-handshake.c:1197:client_setvolume_cbk] 0-testvol-client-1: 
failed to find key 'clnt-lk-version' in the options
[2018-03-02 04:58:46.618036] E [MSGID: 114031] 
[client-rpc-fops.c:2768:client3_3_opendir_cbk] 0-testvol-client-1: 
remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) 
[Transport endpoint is not connected]
The message "W [MSGID: 114031] 
[client-rpc-fops.c:2577:client3_3_readdirp_cbk] 0-testvol-client-1: 
remote operation failed [Transport endpoint is not connected]" repeated 
3 times between [2018-03-02 04:58:46.609529] and [2018-03-02 
04:58:46.618683]

Also, mkdir fails on the old mount with EIO, though physically 
succeeding on both bricks. Can the rpc folks offer a helping hand?

-Ravi
> Is there something more to be done on BZ 1544366?
>
> -Ravi
> On 03/02/2018 08:44 AM, Ravishankar N wrote:
>>
>> On 03/02/2018 07:26 AM, Shyam Ranganathan wrote:
>>> Hi Pranith/Ravi,
>>>
>>> So, to keep a long story short, post upgrading 1 node in a 3 node 3.13
>>> cluster, self-heal is not able to catch the heal backlog and this is a
>>> very simple synthetic test anyway, but the end result is that upgrade
>>> testing is failing.
>>
>> Let me try this now and get back. I had done some thing similar when 
>> testing the FIPS patch and the rolling upgrade had worked.
>> Thanks,
>> Ravi
>>>
>>> Here are the details,
>>>
>>> - Using
>>> https://hackmd.io/GYIwTADCDsDMCGBaArAUxAY0QFhBAbIgJwCMySIwJmAJvGMBvNEA#
>>> I setup 3 server containers to install 3.13 first as follows (within 
>>> the
>>> containers)
>>>
>>> (inside the 3 server containers)
>>> yum -y update; yum -y install centos-release-gluster313; yum install
>>> glusterfs-server; glusterd
>>>
>>> (inside centos-glfs-server1)
>>> gluster peer probe centos-glfs-server2
>>> gluster peer probe centos-glfs-server3
>>> gluster peer status
>>> gluster v create patchy replica 3 centos-glfs-server1:/d/brick1
>>> centos-glfs-server2:/d/brick2 centos-glfs-server3:/d/brick3
>>> centos-glfs-server1:/d/brick4 centos-glfs-server2:/d/brick5
>>> centos-glfs-server3:/d/brick6 force
>>> gluster v start patchy
>>> gluster v status
>>>
>>> Create a client container as per the document above, and mount the 
>>> above
>>> volume and create 1 file, 1 directory and a file within that directory.
>>>
>>> Now we start the upgrade process (as laid out for 3.13 here
>>> http://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_3.13/ ):
>>> - killall glusterfs glusterfsd glusterd
>>> - yum install
>>> http://cbs.centos.org/kojifiles/work/tasks/1548/311548/centos-release-gluster40-0.9-1.el7.centos.x86_64.rpm 
>>>
>>> - yum upgrade --enablerepo=centos-gluster40-test glusterfs-server
>>>
>>> < Go back to the client and edit the contents of one of the files and
>>> change the permissions of a directory, so that there are things to heal
>>> when we bring up the newly upgraded server>
>>>
>>> - gluster --version
>>> - glusterd
>>> - gluster v status
>>> - gluster v heal patchy
>>>
>>> The above starts failing as follows,
>>> [root at centos-glfs-server1 /]# gluster v heal patchy
>>> Launching heal operation to perform index self heal on volume patchy 
>>> has
>>> been unsuccessful:
>>> Commit failed on centos-glfs-server2.glfstest20. Please check log file
>>> for details.
>>> Commit failed on centos-glfs-server3. Please check log file for 
>>> details.
>>>
>>>  From here, if further files or directories are created from the 
>>> client,
>>> they just get added to the heal backlog, and heal does not catchup.
>>>
>>> As is obvious, I cannot proceed, as the upgrade procedure is broken. 
>>> The
>>> issue itself may not be selfheal deamon, but something around
>>> connections, but as the process fails here, looking to you guys to
>>> unblock this as soon as possible, as we are already running a day's 
>>> slip
>>> in the release.
>>>
>>> Thanks,
>>> Shyam
>>
>