[Gluster-infra] Postmortem for Gerrit Upgrade
joe at julianfamily.org
Tue Jun 7 05:34:47 UTC 2016
On 06/06/2016 06:43 PM, Nigel Babu wrote:
> Hello folks,
> Here's a postmortem for the Gerrit migration issues
> # Timeline of Events
> May 25 - Test migration to PostgreSQL
> May 27 - Migration to PostgreSQL on production
> May 30 - Staging with Gerrit 2.12.2 available for testing
> Jun 01 - Gerrit upgrade complete ( 0310)
> Jun 01 - First notification of login issues (0628)
> Jun 03 - Fix applied on test server and email sent out to affected users to test.
> Jun 06 - Fix applied on production server
> # Problems
> Over the years, Gerrit has changed how it handles user accounts. In Gerrit 2.9,
> the Github plugin allowed users to sign up and then set their username. As we
> upgraded to 2.12, we discovered that your username defaults to your Github
> username. In our instance, this affects a small subset of people. Additionally,
> very few people who were affected by this bug actually tested out the staging
> instance (only one person ran into the bug). Even in production, only those
> users who signed out of review.gluster.org after the upgrade were actually
> affected. There were quite a few users who were affected who did not realize
> they were affected because they didnt log out. By the time the issues were
> reported, we were a few hours into our upgrade and rollback wasn't an option
> # Solution
> The first preference was given to checking if we had an easy fix. I looked at a
> different plugin for Github authentication. This plugin claims to allow
> users to map different external identities onto on Gerrit user. I timeboxed
> this testing down to a few hours and I found that it wasn't working during my
> limited testing. Now that quick fixes were eliminated, I spent some time
> diagnosing the issue in detail. In the meanwhile, I reached out to the gerrit
> mailing list for help. From conversations with Michael and Raghavendra, I
> learned that we've run into problems like this before and Justin has fixed
> them. I reached out to Justin as well for help.
> By the next morning (Jun 2), I had a good idea of what was wrong and a few
> ideas on how to fix them. Justin had gotten back to me as well, so I had more
> information to confirm my diagnosis. People who had Github username the same as
> their gerrit username had no issues. Some people had a completely different
> Github username from their gerrit username. And some people had multiple
> usernames against the same account_id (one of them matching their Github
> account). The older version of Gerrit + Github plugin seemed to tolerate both
> of these situations. The newer version was less forgiving about this
> inconsistency. When I removed the entry in accounts_external_ids which
> corresponded to gerrit:<github-username>, on next login, a new account would be
> created for those who had issues. This was the safest method of all. However,
> this had the side effect that the new user would have none of the history of
> the old one. I tried renaming the username, for which the side effects were
> unknown, but also seemed to work. This meant that users would retain their
> history, but their first git push/pull would fail until they changed the clone
> path. I checked with the Gerrit mailing list about side effects of renaming
> usernames. There are side effects, but it doesn't affect our particular use
> case, so we were free to do so.
> On Friday (Jun 3), I wrote a sql script to update everyone's accounts to a
> consistent state. If you had different username in gerrit from Github, your
> gerrit username would be changed to match your Github account. If you had
> multiple usernames, only the one matching the Github username would be kept. I
> ran it and I emailed everyone this affected to test logging in and doing
> reviews. Huge thanks to Niels, Prashanth, Jiffin, and others for testing the
> instance and reporting issues they came across.
> On Monday (Jun 6), I backed up the database and ran this script in production.
> We had a few people have issues pushing/pulling, but everyone has now figured
> out the changes they need to make in the .git/config to get things working.
> # What we Learned
> * Gerrit's ssh-based flush-cache command needs to be used after changing
> anything in the user table.
> * After a Gerrit restart, it takes a bit for login to start working again. This
> time period depends on the machine's CPU/RAM. Much lower on production
> * We have a reasonably good idea about Gerrit's accounts_external_ids table.
> # What Went Well
> * We had a fix deployed within 3 working days from the reporting the issue.
> * We've cleared out any repeat of this particular issue in the future.
> * This instance is documented very well including the different approaches and
> their outcomes.
> # What Went Badly
> * We did not have documentation of previous Gerrit upgrade issues.
> * Not enough testing of the new version of Gerrit and not enough time.
> * When issues were noticed, the rollback plans were non-viable. We'd like to be
> in a place where we should catch these in staging or at least soon enough in
> production that we can rollback.
> # Notes for Future
> * Document previous issues and post-mortems. I will be working on creating a
> place for this. This post-mortem and any future ones will be available in a
> public place.
Github issues or bugzilla, imho.
> * Dogfood Gerrit. Most of the code other than our project code goes directly
> into Github. I would like for new projects that I maintain to be running and
> reviewed on Gerrit with a replication to Github.
> * Establish an official staging site for gerrit.
> * Establish a week long testing period before every upgrade with a small team
> of volunteers.
> * Have a small team of developers be around during upgrades, so we can do
> immediate tests of the upgrade.
>  https://github.com/davido/gerrit-oauth-provider
>  https://gerrit-review.googlesource.com/Documentation/cmd-flush-caches.html
> Gluster-infra mailing list
> Gluster-infra at gluster.org
More information about the Gluster-infra