[Bugs] [Bug 1432542] Glusterd crashes when restarted with many volumes

Sun Mar 19 03:58:47 UTC 2017

https://bugzilla.redhat.com/show_bug.cgi?id=1432542

--- Comment #17 from Jeff Darcy <jeff at pl.atyp.us> ---
Further investigation shows that all symptoms - including cases where I was
able to observe multiple threads/synctasks executing concurrently even though
they were both supposed to be under glusterd's big_lock, could be explained by
memory corruption.  What's happening is that the lock/unlock behavior in
attach_brick is allowing other operations to overlap.  In particular, if we're
calling attach_brick from glusterd_restart_bricks (as we do when starting up),
an overlapping volume delete will modify the very list we're in the middle of
walking, so we'll either be working on a pointer to a freed brickinfo_t or
we'll follow its list pointers off into space.  Either way, *all sorts* of
crazy and horrible things can happen.

Moving the attach-RPC retries into a separate thread/task really doesn't help,
and introduces its own problems.  What has worked fairly well is forcing volume
deletes to wait if glusterd_restart_bricks is still in progress, but the
approach I'm currently using is neither as complete nor as airtight as it
really needs to be.  I'm still trying to think of a better solution, but at
least now it's clear what the problem is.

BTW, debugging and testing is being severely hampered by constant
disconnections between glusterd and glusterfsd.  This might be related to the
sharp increase in client/server disconnect/reconnect cycles that others have
reported dating back approximately to RHGS 3.1.2, and we really need to figure
out why our RPC got flakier around that time.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
Unsubscribe from this bug https://bugzilla.redhat.com/token.cgi?t=HFtEjzvUMY&a=cc_unsubscribe