[Gluster-users] Problem with glusterd locks on gluster 3.6.1

Fri Jun 17 09:59:54 UTC 2016

On 06/17/2016 03:21 PM, B.K.Raghuram wrote:
> Thanks a ton Atin. That fixed cherry-pick. Will build it and let you
> know how it goes. Does it make sense to try and merge the whole upstream
> glusterfs repo for the 3.6 branch in order to get all the other bug
> fixes? That may bring in many more merge conflicts though..

Yup, I'd not recommend that. Applying your local changes on the latest
version is a much easier option :)

> 
> On Fri, Jun 17, 2016 at 3:07 PM, Atin Mukherjee <amukherj at redhat.com
> <mailto:amukherj at redhat.com>> wrote:
> 
>     I've resolved the merge conflicts and files are attached. Copy these
>     files and follow the instructions from the cherry pick command which
>     failed.
> 
>     ~Atin
> 
>     On 06/17/2016 02:55 PM, B.K.Raghuram wrote:
>     >
>     > Thanks Atin, I had three merge conflicts in the third patch.. I've
>     > attached the files with the conflicts. Would any of the intervening
>     > commits be needed as well?
>     >
>     > The conflicts were in :
>     >
>     >     both modified:      libglusterfs/src/mem-types.h
>     >     both modified:      xlators/mgmt/glusterd/src/glusterd-utils.c
>     >     both modified:      xlators/mgmt/glusterd/src/glusterd-utils.h
>     >
>     >
>     > On Fri, Jun 17, 2016 at 2:17 PM, Atin Mukherjee <amukherj at redhat.com <mailto:amukherj at redhat.com>
>     > <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>> wrote:
>     >
>     >
>     >
>     >     On 06/17/2016 12:44 PM, B.K.Raghuram wrote:
>     >     > Thanks Atin.. I'm not familiar with pulling patches the review system
>     >     > but will try:)
>     >
>     >     It's not that difficult. Open the gerrit review link, go to the download
>     >     drop box at the top right corner, click on it and then you will see a
>     >     cherry pick option, copy that content and paste it the source code repo
>     >     you host. If there are no merge conflicts, it should auto apply,
>     >     otherwise you'd need to fix them manually.
>     >
>     >     HTH.
>     >     Atin
>     >
>     >     >
>     >     > On Fri, Jun 17, 2016 at 12:35 PM, Atin Mukherjee <amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     >     > <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>> wrote:
>     >     >
>     >     >
>     >     >
>     >     >     On 06/16/2016 06:17 PM, Atin Mukherjee wrote:
>     >     >     >
>     >     >     >
>     >     >     > On 06/16/2016 01:32 PM, B.K.Raghuram wrote:
>     >     >     >> Thanks a lot Atin,
>     >     >     >>
>     >     >     >> The problem is that we are using a forked version of 3.6.1 which has
>     >     >     >> been modified to work with ZFS (for snapshots) but we do not have the
>     >     >     >> resources to port that over to the later versions of gluster.
>     >     >     >>
>     >     >     >> Would you know of anyone who would be willing to take this on?!
>     >     >     >
>     >     >     > If you can cherry pick the patches and apply them on your source and
>     >     >     > rebuild it, I can point the patches to you, but you'd need to give a
>     >     >     > day's time to me as I have some other items to finish from my plate.
>     >     >
>     >     >
>     >     >     Here is the list of the patches need to be applied on the following
>     >     >     order:
>     >     >
>     >     >     http://review.gluster.org/9328
>     >     >     http://review.gluster.org/9393
>     >     >     http://review.gluster.org/10023
>     >     >
>     >     >     >
>     >     >     > ~Atin
>     >     >     >>
>     >     >     >> Regards,
>     >     >     >> -Ram
>     >     >     >>
>     >     >     >> On Thu, Jun 16, 2016 at 11:02 AM, Atin Mukherjee
>     >     >     <amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>
>     >     >     >> <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com> <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>>
>     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>>> wrote:
>     >     >     >>
>     >     >     >>
>     >     >     >>
>     >     >     >>     On 06/16/2016 10:49 AM, B.K.Raghuram wrote:
>     >     >     >>     >
>     >     >     >>     >
>     >     >     >>     > On Wed, Jun 15, 2016 at 5:01 PM, Atin Mukherjee
>     >     >     <amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>
>     >     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>>
>     >     >     >>     > <mailto:amukherj at redhat.com
>     <mailto:amukherj at redhat.com>
>     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>
>     >     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>
>     >     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>
>     <mailto:amukherj at redhat.com <mailto:amukherj at redhat.com>>>>>> wrote:
>     >     >     >>     >
>     >     >     >>     >
>     >     >     >>     >
>     >     >     >>     >     On 06/15/2016 04:24 PM, B.K.Raghuram wrote:
>     >     >     >>     >     > Hi,
>     >     >     >>     >     >
>     >     >     >>     >     > We're using gluster 3.6.1 and we
>     periodically find
>     >     >     that gluster commands
>     >     >     >>     >     > fail saying the it could not get the lock
>     on one of
>     >     >     the brick machines.
>     >     >     >>     >     > The logs on that machine then say
>     something like :
>     >     >     >>     >     >
>     >     >     >>     >     > [2016-06-15 08:17:03.076119] E
>     >     >     >>     >     > [glusterd-op-sm.c:3058:glusterd_op_ac_lock]
>     >     >     0-management: Unable to
>     >     >     >>     >     > acquire lock for vol2
>     >     >     >>     >
>     >     >     >>     >     This is a possible case if concurrent volume
>     >     operations
>     >     >     are run. Do you
>     >     >     >>     >     have any script which checks for volume
>     status on an
>     >     >     interval from all
>     >     >     >>     >     the nodes, if so then this is an expected
>     behavior.
>     >     >     >>     >
>     >     >     >>     >
>     >     >     >>     > Yes, I do have a couple of scripts that check on
>     >     volume and
>     >     >     quota
>     >     >     >>     > status.. Given this, I do get a "Another
>     transaction
>     >     is in
>     >     >     progress.."
>     >     >     >>     > message which is ok. The problem is that
>     sometimes I get
>     >     >     the volume lock
>     >     >     >>     > held message which never goes away. This sometimes
>     >     results
>     >     >     in glusterd
>     >     >     >>     > consuming a lot of memory and CPU and the
>     problem can
>     >     only
>     >     >     be fixed with
>     >     >     >>     > a reboot. The log files are huge so I'm not sure if
>     >     its ok
>     >     >     to attach
>     >     >     >>     > them to an email.
>     >     >     >>
>     >     >     >>     Ok, so this is known. We have fixed lots of stale
>     lock
>     >     issues
>     >     >     in 3.7
>     >     >     >>     branch and some of them if not all were also
>     backported to
>     >     >     3.6 branch.
>     >     >     >>     The issue is you are using 3.6.1 which is quite
>     old. If you
>     >     >     can upgrade
>     >     >     >>     to latest versions of 3.7 or at worst of 3.6 I am
>     confident
>     >     >     that this
>     >     >     >>     will go away.
>     >     >     >>
>     >     >     >>     ~Atin
>     >     >     >>     >
>     >     >     >>     >     >
>     >     >     >>     >     > After sometime, glusterd then seems to
>     give up
>     >     and die..
>     >     >     >>     >
>     >     >     >>     >     Do you mean glusterd shuts down or
>     segfaults, if so I
>     >     >     am more
>     >     >     >>     interested
>     >     >     >>     >     in analyzing this part. Could you provide
>     us the
>     >     >     glusterd log,
>     >     >     >>     >     cmd_history log file along with core (in
>     case of
>     >     SEGV) from
>     >     >     >>     all the
>     >     >     >>     >     nodes for the further analysis?
>     >     >     >>     >
>     >     >     >>     >
>     >     >     >>     > There is no segfault. glusterd just shuts down.
>     As I said
>     >     >     above,
>     >     >     >>     > sometimes this happens and sometimes it just
>     continues to
>     >     >     hog a lot of
>     >     >     >>     > memory and CPU..
>     >     >     >>     >
>     >     >     >>     >
>     >     >     >>     >     >
>     >     >     >>     >     > Interestingly, I also find the following line
>     >     in the
>     >     >     >>     beginning of
>     >     >     >>     >     > etc-glusterfs-glusterd.vol.log and I dont
>     know if
>     >     >     this has any
>     >     >     >>     >     > significance to the issue :
>     >     >     >>     >     >
>     >     >     >>     >     > [2016-06-14 06:48:57.282290] I
>     >     >     >>     >     >
>     [glusterd-store.c:2063:glusterd_restore_op_version]
>     >     >     >>     0-management:
>     >     >     >>     >     > Detected new install. Setting op-version to
>     >     maximum :
>     >     >     30600
>     >     >     >>     >     >
>     >     >     >>     >
>     >     >     >>     >
>     >     >     >>     > What does this line signify?
>     >     >     >>
>     >     >     >>
>     >     >
>     >     >
>     >
>     >
> 
>