[Gluster-users] [Gluster-devel] Upgrade testing to gluster 6

Thu Apr 4 17:43:43 UTC 2019

On Thu, 4 Apr 2019 at 22:10, Darrell Budic <budic at onholyground.com> wrote:

> Just the glusterd.log from each node, right?
>

Yes.

>
> On Apr 4, 2019, at 11:25 AM, Atin Mukherjee <amukherj at redhat.com> wrote:
>
> Darell,
>
> I fully understand that you can't reproduce it and you don't have
> bandwidth to test it again, but would you be able to send us the glusterd
> log from all the nodes when this happened. We would like to go through the
> logs and get back. I would particularly like to see if something has gone
> wrong with transport.socket.listen-port option. But with out the log files
> we can't find out anything. Hope you understand it.
>
> On Thu, Apr 4, 2019 at 9:27 PM Darrell Budic <budic at onholyground.com>
> wrote:
>
>> I didn’t follow any specific documents, just a generic rolling upgrade
>> one node at a time. Once the first node didn’t reconnect, I tried to follow
>> the workaround in the bug during the upgrade. Basic procedure was:
>>
>> - take 3 nodes that were initially installed with 3.12.x (forget which,
>> but low number) and had been upgraded directly to 5.5 from 3.12.15
>>   - op-version was 50400
>> - on node A:
>>   - yum install centos-release-gluster6
>>   - yum upgrade (was some ovirt cockpit components, gluster, and a lib or
>> two this time), hit yes
>>   - discover glusterd was dead
>>   - systemctl restart glusterd
>>   - no peer connections, try iptables -F; systemctl restart glusterd, no
>> change
>> - following the workaround in the bug, try iptables -F & restart glusterd
>> on other 2 nodes, no effect
>>   - nodes B & C were still connected to each other and all bricks were
>> fine at this point
>> - try upgrading other 2 nodes and restarting gluster, no effect (iptables
>> still empty)
>>   - lost quota here, so all bricks went offline
>> - read logs, not finding much, but looked at glusterd.vol and compared to
>> new versions
>> - updated glusterd.vol on A and restarted glusterd
>>   - A doesn’t show any connected peers, but both other nodes show A as
>> connected
>> - update glusterd.vol on B & C, restart glusterd
>>   - all nodes show connected and volumes are active and healing
>>
>> The only odd thing in my process was that node A did not have any active
>> bricks on it at the time of the upgrade. It doesn’t seem like this mattered
>> since B & C showed the same symptoms between themselves while being
>> upgraded, but I don’t know. The only log entry that referenced anything
>> about peer connections is included below already.
>>
>> Looks like it was related to my glusterd settings, since that’s what
>> fixed it for me. Unfortunately, I don’t have the bandwidth or the systems
>> to test different versions of that specifically, but maybe you guys can on
>> some test resources? Otherwise, I’ve got another cluster (my production
>> one!) that’s midway through the upgrade from 3.12.15 -> 5.5. I paused when
>> I started getting multiple brick processes on the two nodes that had gone
>> to 5.5 already. I think I’m going to jump the last node right to 6 to try
>> and avoid that mess, and it has the same glusterd.vol settings. I’ll try
>> and capture it’s logs during the upgrade and see if there’s any new info,
>> or if it has the same issues as this group did.
>>
>>   -Darrell
>>
>> On Apr 4, 2019, at 2:54 AM, Sanju Rakonde <srakonde at redhat.com> wrote:
>>
>> We don't hit https://bugzilla.redhat.com/show_bug.cgi?id=1694010 while
>> upgrading to glusterfs-6. We tested it in different setups and understood
>> that this issue is seen because of some issue in setup.
>>
>> regarding the issue you have faced, can you please let us know which
>> documentation you have followed for the upgrade? During our testing, we
>> didn't hit any such issue. we would like to understand what went wrong.
>>
>> On Thu, Apr 4, 2019 at 2:08 AM Darrell Budic <budic at onholyground.com>
>> wrote:
>>
>>> Hari-
>>>
>>> I was upgrading my test cluster from 5.5 to 6 and I hit this bug (
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1694010) or something
>>> similar. In my case, the workaround did not work, and I was left with a
>>> gluster that had gone into no-quorum mode and stopped all the bricks.
>>> Wasn’t much in the logs either, but I noticed my
>>> /etc/glusterfs/glusterd.vol files were not the same as the newer versions,
>>> so I updated them, restarted glusterd, and suddenly the updated node showed
>>> as peer-in-cluster again. Once I updated other notes the same way, things
>>> started working again. Maybe a place to look?
>>>
>>> My old config (all nodes):
>>> volume management
>>>     type mgmt/glusterd
>>>     option working-directory /var/lib/glusterd
>>>     option transport-type socket
>>>     option transport.socket.keepalive-time 10
>>>     option transport.socket.keepalive-interval 2
>>>     option transport.socket.read-fail-log off
>>>     option ping-timeout 10
>>>     option event-threads 1
>>>     option rpc-auth-allow-insecure on
>>> #   option transport.address-family inet6
>>> #   option base-port 49152
>>> end-volume
>>>
>>> changed to:
>>> volume management
>>>     type mgmt/glusterd
>>>     option working-directory /var/lib/glusterd
>>>     option transport-type socket,rdma
>>>     option transport.socket.keepalive-time 10
>>>     option transport.socket.keepalive-interval 2
>>>     option transport.socket.read-fail-log off
>>>     option transport.socket.listen-port 24007
>>>     option transport.rdma.listen-port 24008
>>>     option ping-timeout 0
>>>     option event-threads 1
>>>     option rpc-auth-allow-insecure on
>>> #   option lock-timer 180
>>> #   option transport.address-family inet6
>>> #   option base-port 49152
>>>     option max-port  60999
>>> end-volume
>>>
>>> the only thing I found in the glusterd logs that looks relevant was
>>> (repeated for both of the other nodes in this cluster), so no clue why it
>>> happened:
>>> [2019-04-03 20:19:16.802638] I [MSGID: 106004]
>>> [glusterd-handler.c:6427:__glusterd_peer_rpc_notify] 0-management: Peer
>>> <ossuary-san> (<0ecbf953-681b-448f-9746-d1c1fe7a0978>), in state <Peer in
>>> Cluster>, has disconnected from glusterd.
>>>
>>>
>>> On Apr 2, 2019, at 4:53 AM, Atin Mukherjee <atin.mukherjee83 at gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> On Mon, 1 Apr 2019 at 10:28, Hari Gowtham <hgowtham at redhat.com> wrote:
>>>
>>>> Comments inline.
>>>>
>>>> On Mon, Apr 1, 2019 at 5:55 AM Sankarshan Mukhopadhyay
>>>> <sankarshan.mukhopadhyay at gmail.com> wrote:
>>>> >
>>>> > Quite a considerable amount of detail here. Thank you!
>>>> >
>>>> > On Fri, Mar 29, 2019 at 11:42 AM Hari Gowtham <hgowtham at redhat.com>
>>>> wrote:
>>>> > >
>>>> > > Hello Gluster users,
>>>> > >
>>>> > > As you all aware that glusterfs-6 is out, we would like to inform
>>>> you
>>>> > > that, we have spent a significant amount of time in testing
>>>> > > glusterfs-6 in upgrade scenarios. We have done upgrade testing to
>>>> > > glusterfs-6 from various releases like 3.12, 4.1 and 5.3.
>>>> > >
>>>> > > As glusterfs-6 has got in a lot of changes, we wanted to test those
>>>> portions.
>>>> > > There were xlators (and respective options to enable/disable them)
>>>> > > added and deprecated in glusterfs-6 from various versions [1].
>>>> > >
>>>> > > We had to check the following upgrade scenarios for all such options
>>>> > > Identified in [1]:
>>>> > > 1) option never enabled and upgraded
>>>> > > 2) option enabled and then upgraded
>>>> > > 3) option enabled and then disabled and then upgraded
>>>> > >
>>>> > > We weren't manually able to check all the combinations for all the
>>>> options.
>>>> > > So the options involving enabling and disabling xlators were
>>>> prioritized.
>>>> > > The below are the result of the ones tested.
>>>> > >
>>>> > > Never enabled and upgraded:
>>>> > > checked from 3.12, 4.1, 5.3 to 6 the upgrade works.
>>>> > >
>>>> > > Enabled and upgraded:
>>>> > > Tested for tier which is deprecated, It is not a recommended
>>>> upgrade.
>>>> > > As expected the volume won't be consumable and will have a few more
>>>> > > issues as well.
>>>> > > Tested with 3.12, 4.1 and 5.3 to 6 upgrade.
>>>> > >
>>>> > > Enabled, disabled before upgrade.
>>>> > > Tested for tier with 3.12 and the upgrade went fine.
>>>> > >
>>>> > > There is one common issue to note in every upgrade. The node being
>>>> > > upgraded is going into disconnected state. You have to flush the
>>>> iptables
>>>> > > and the restart glusterd on all nodes to fix this.
>>>> > >
>>>> >
>>>> > Is this something that is written in the upgrade notes? I do not seem
>>>> > to recall, if not, I'll send a PR
>>>>
>>>> No this wasn't mentioned in the release notes. PRs are welcome.
>>>>
>>>> >
>>>> > > The testing for enabling new options is still pending. The new
>>>> options
>>>> > > won't cause as much issues as the deprecated ones so this was put at
>>>> > > the end of the priority list. It would be nice to get contributions
>>>> > > for this.
>>>> > >
>>>> >
>>>> > Did the range of tests lead to any new issues?
>>>>
>>>> Yes. In the first round of testing we found an issue and had to
>>>> postpone the
>>>> release of 6 until the fix was made available.
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1684029
>>>>
>>>> And then we tested it again after this patch was made available.
>>>> and came  across this:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1694010
>>>
>>>
>>> This isn’t a bug as we found that upgrade worked seamelessly in two
>>> different setup. So we have no issues in the upgrade path to glusterfs-6
>>> release.
>>>
>>> <https://bugzilla.redhat.com/show_bug.cgi?id=1694010>
>>>>
>>>> Have mentioned this in the second mail as to how to over this situation
>>>> for now until the fix is available.
>>>>
>>>> >
>>>> > > For the disable testing, tier was used as it covers most of the
>>>> xlator
>>>> > > that was removed. And all of these tests were done on a replica 3
>>>> volume.
>>>> > >
>>>> >
>>>> > I'm not sure if the Glusto team is reading this, but it would be
>>>> > pertinent to understand if the approach you have taken can be
>>>> > converted into a form of automated testing pre-release.
>>>>
>>>> I don't have an answer for this, have CCed Vijay.
>>>> He might have an idea.
>>>>
>>>> >
>>>> > > Note: This is only for upgrade testing of the newly added and
>>>> removed
>>>> > > xlators. Does not involve the normal tests for the xlator.
>>>> > >
>>>> > > If you have any questions, please feel free to reach us.
>>>> > >
>>>> > > [1]
>>>> https://docs.google.com/spreadsheets/d/1nh7T5AXaV6kc5KgILOy2pEqjzC3t_R47f1XUXSVFetI/edit?usp=sharing
>>>> > >
>>>> > > Regards,
>>>> > > Hari and Sanju.
>>>> > _______________________________________________
>>>> > Gluster-users mailing list
>>>> > Gluster-users at gluster.org
>>>> > https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Hari Gowtham.
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>>
>>> --
>>> --Atin
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>> --
>> Thanks,
>> Sanju
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>
> --
- Atin (atinm)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20190404/9fc9c540/attachment.html>