[Bugs] [Bug 1622821] New: Prevent hangs while increasing replica-count/ replace-brick for directory hierarchy

Tue Aug 28 06:39:08 UTC 2018

https://bugzilla.redhat.com/show_bug.cgi?id=1622821

            Bug ID: 1622821
           Summary: Prevent hangs while increasing
                    replica-count/replace-brick for directory hierarchy
           Product: GlusterFS
           Version: mainline
         Component: replicate
          Assignee: bugs at gluster.org
          Reporter: pkarampu at redhat.com
                CC: bugs at gluster.org

Description of problem:
Problem Statement:

Whenever we have access patterns on the mounts that need to perform lookups on
multiple directories to reach the leaf file/directory when a brick is
replaced/replica-count increased, there is a possibility of introducing hangs
because of name-heal and metadata heal.

Here is a simulation of access pattern that can show the problem.

#!/bin/bash
MAX_MOUNTS=40
DEPTH=4
glusterd

gluster --mode=script --wignore volume create r2 replica 2
localhost.localdomain:/home/gfs/r2_0 localhost.localdomain:/home/gfs/r2_1
localhost.localdomain:/home/gfs/r2_2 localhost.localdomain:/home/gfs/r2_3
gluster --mode=script volume start r2

for i in $(seq 1 $MAX_MOUNTS); do mkdir /mnt/$i; mount -t glusterfs
localhost.localdomain:/r2 /mnt/$i; done
depth_str=""
for i in $(seq 1 $DEPTH); do depth_str="1/$depth_str"; done
mkdir -p /mnt/1/$depth_str
for i in $(seq 1 $MAX_MOUNTS); do touch /mnt/1/$depth_str/$i; done
gluster v add-brick r2 replica 3 localhost.localdomain:/home/gfs/r2_{4,5} force
sleep 5 #for graph switch to complete
gluster volume profile r2 start
gluster volume profile r2 info clear

for i in $(seq 1 $MAX_MOUNTS); do time touch /mnt/$i/$depth_str/${i}_1 & done;
wait
sleep 2
gluster volume profile r2 info incremental

This will create 40 clients which will try to access different files in same
directory hierarchy increasing the chances of metadata-heals and name-heals. I
modified afr code to print log messages about launching metadata and name
heals. What I observed is that lookups on the directories from different
clients would all trigger metadata and name heals so all of these lookups will
be serialized and the last mount which gets the necessary lock to perform heal
took more than a second to get the metadata lock and similarly more than a
second to get the lock for name heal. This was the worst I have seen after
running the script above multiple times, sometimes enabling disabling different
heals. More the activity on the mount, more the users will think that the mount
is unusable after a point because of the hangs. This also leads to timeouts
with applications that expect
response in some time like web servers etc.
Outputs:
It takes more than 2 seconds to do 'touch'

    real 0m2.042s
    user 0m0.001s
    sys 0m0.002s

    root at localhost - /var/log/glusterfs
    14:42:40 :) ⚡ grep "performing metadata selfheal" mnt-* | awk '{print $12}'
| sort | uniq -c
    8 805420ef-fff1-4b0a-9430-b145124a756b
    41 ab2c7d71-735d-4341-b6ce-6158ebab210a
    14 f0314b17-0fe4-45df-ba13-ea42066ce063
    Profile info
    61.15 106739.19 us 42.00 us 1358447.00 us 278 INODELK

...

    14:20:28 :) ⚡ grep "completed name-heal" mnt-* | awk '{print $8}' | sort |
uniq -c
    44 00000000-0000-0000-0000-000000000001/1
    5 1a43321e-5060-4185-9e49-de889f4f5071/1
    19 999feb5b-118c-42ff-973b-0f95c6dfb927/1
    11 aaa924bf-2891-4eba-b923-2ea3877061c4/1
    Profile info
    54.23 41022.43 us 16.00 us 1239715.00 us 551 ENTRYLK

Potential Solution:
Entry self-heal and Data self-heal don't have this problem because lookups
don't wait for them to complete. This is not the case with metadata and
name-heal. Lookup waits for these to complete. We need to find a way of serving
lookup without blocking for heals except in very rare situations to address
this problem.

Proposed changes to name-heal:
For 2-way replication we can't disable name heal because without doing
name-heal we won't know what to respond to the application because when a
filename exists on one replica and not on the other we need to find which of
those 2 bricks is the source and which is not. But in the case of replica-3 and
replica-2+Arbiter we know that if there are 2 names with same gfid but one of
the bricks doesn't have the file, we don't need to block lookup until name-heal
completes. We can trigger it in the background. Similarly when lookup finds
that the name is not present on multiple bricks and if readables for the parent
inode say only the bricks with no name are the sources, we can move name-heal
to back-ground. Same idea can be applied to thin-arbiter also.

Proposed changes to metadata heal:
The only case which requires metadata heal to happen in foreground is when the
metadata mismatches without any pending markers on the inode on the bricks,
rest don't need to do metadata heal in foreground.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Always
2.
3.

Actual results:

Expected results:

Additional info:

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.