[Bugs] [Bug 1739884] glusterfsd process crashes with SIGSEGV

Wed Aug 14 21:31:04 UTC 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1739884

--- Comment #3 from Chad Feller <cfeller at gmail.com> ---
Hi, I still have the core dumps, so I will include the output of your commands
as attachments.  

I'm also going to attach the cluster configuration as well. Native clients are
mounted with the 'reader-thread-count=4' option.

Only additional notes are that we had switched from Copper 1GbE to Fiber 10GbE
about a week before the first crash. At the same time we also added 18
additional disks per node, which would eventually comprise two additional
bricks per node. 

I've been using custom Ansible playbooks to manage not only the system, but
Gluster as well. When I used Ansible
(https://docs.ansible.com/ansible/latest/modules/gluster_volume_module.html) to
add the two additional bricks at the same time, it incorrectly paired them
(bug?).

Before the incorrect addition, my cluster configuration was as follows:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv

After Ansible incorrectly paired the disks, it was something like this, IIRC:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv
  Brick3: gluster00:/export/brick1/srv
  Brick4: gluster00:/export/brick2/srv
  Brick5: gluster01:/export/brick1/srv
  Brick6: gluster01:/export/brick2/srv

After adding the bricks, I issued a rebalance command (not realizing the
incorrect pairing) but about a minute into the rebalance I realized that
something was amiss. I immediately realized what happened and issued a:

  gluster volume remove-brick gv0 gluster01:/export/brick2/srv
gluster01:/export/brick1/srv gluster00:/export/brick2/srv
gluster00:/export/brick1/srv start

After the remove completed, I did a commit to confirm the remove-brick command.
 After the commit I was back to the original configuration:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv

While the data was intact, my directory permissions and and file ownership were
wiped out due to a bug that may have been related to Bug #1716848. After
correcting the directory permissions and ownership, the cluster ran fine for
several hours, and I had planned to reattempt the brick add (via Ansible) but
with one brick pair at a time so I didn't end up with mismatched brick pairs
again. At the end of the day however, before I was able to re-add the brick
pair, Gluster crashed with the first core dump. It was still in the two brick
setup, as I had not yet re-attempted the brick add. 

(Note: I reformatted the bricks before attempting to re-use them.)

I rebooted the cluster, and upon coming back up, the self heal daemon resync'd
everything. After examining the volume, I was happy with everything so I went
ahead and added a brick pair via Ansible. It worked and everything was paired
correctly. I then added the next pair to Ansible and ran the playbook again.
Again, everything paired correctly. At this point I had the correct brick
setup:

  Brick1: gluster00:/export/brick0/srv
  Brick2: gluster01:/export/brick0/srv
  Brick3: gluster00:/export/brick1/srv
  Brick4: gluster01:/export/brick1/srv
  Brick5: gluster00:/export/brick2/srv
  Brick6: gluster01:/export/brick2/srv

>From here I issued a rebalance command and watched. Everything was working fine
for about 10 hours, which was when the second crash happened. That is, the
second crash happened in the middle of a rebalance. After everything came back
up, the self heal daemon did its thing. I examined the volume, saw no issues
and went ahead and started the rebalance again. This time the rebalance ran to
completion (took somewhere between 1-2 days). I've had zero crashes since then.

Not sure if there is any pattern of access that caused any of this, but the
timing of it around some administrative work is interesting and why I covered
it in such detail above.

I should also note that I'm also using both munin-glusterfs
(https://github.com/burner1024/munin-gluster) and gluster-prometheus
(https://github.com/gluster/gluster-prometheus) plugins on the nodes for
monitoring (although Munin is legacy at this point, and will be going away once
Prometheus is fully built out).

-- 
You are receiving this mail because:
You are on the CC list for the bug.