[Gluster-users] Possible SHD stalling

Sat May 9 16:33:14 UTC 2020

Hi,

On 2020/05/08 17:47, Ravishankar N wrote:
>
> On 08/05/20 7:07 pm, Jaco Kroon wrote:
>>
>> I'm not sure "stuck" is the right word, but looking at the
>> "statistics heal-count" values it goes into a form of "go slow" mode,
>> and ends up adding more entries for heal in some cases at a rate of
>> about 2 every second, sometimes 4 at worst (based on ~10-15 second
>> intervals depending on how long it takes to gather the heal-counts).
>>
>> My guess currently is that since the log entries that scroll by
>> relates to metadata and entry heals, these are when it's healing
>> files, and when it hits a folder it somehow locks something, then
>> iterates through the folder (very) slowly for some reason, and whilst
>> this is happening, directory listings are outright unusable.
>>
>> Is there lock profiling code available in glusterfs?
>>
> Nope, nothing as of today as far as I know. You can have the locks
> xlator log everything by setting the locks.trace option to ON though.

Might be worthwhile to look at building something in.  Herewith so long
my strace analysis on one of the glsuterfsds:

strace -p 2827 -f -T 2>&1 >/tmp/strace.txt

This pid represents the brick process for zoidberg:/mnt/gluster/a

Processed output (a b c) represents "system call c too between [*b*,
*b*+1) seconds to return during the sample period *a* number of times". 
I've marked futex (used for contested mutex cases) calls in bold.  The
only system calls ever to take longer than 1.000000s were:  nanosleep,
futex, epoll_wait (which is hard to say what that could be in terms of
operations, or just blocking waiting for client requests), select (same
comment as epoll_wait) and restart_syscall (which I'm betting is safe to
assume is continuations of other syscalls mentioned here).

      5 30 nanosleep<-- 5 cases of nanosleep sleeping for 30 seconds.
*      2 27 futex*
      1 27 epoll_wait
      1 23 restart_syscall
      1 21 restart_syscall
*      1 21 futex*
      1 21 epoll_wait
*      2 20 futex*
      2 20 epoll_wait
*      1 18 futex*
      1 18 epoll_wait
      1 16 restart_syscall
*      2 13 futex**
**      1 12 futex**
**     19 11 futex**
*      2 11 epoll_wait
      2 10 restart_syscall
*     10 10 futex**
*      1 10 epoll_wait
      1 8 restart_syscall
*      4 8 futex**
*      1 8 epoll_wait
      2 7 restart_syscall
*      5 6 futex**
*      2 6 epoll_wait
     24 5 nanosleep
*      2 5 futex**
*      1 4 restart_syscall
*      9 4 futex**
*      2 4 epoll_wait
*     19 3 futex**
*      4 3 epoll_wait
      1 2 restart_syscall
*     69 2 futex**
*     12 2 epoll_wait
    522 1 select
      1 1 restart_syscall
      7 1 nanosleep
*    338 1 futex**
*     43 1 epoll_wait
  19630 0 writev
   2467 0 unlink
      2 0 tls=0x7f1338009480,
      1 0 tls=0x7f1318df9480,
     32 0 statfs
   3882 0 stat
      8 0 set_robust_list
      8 0 select
      4 0 rt_sigqueueinfo
     49 0 rt_sigprocmask
     16 0 restart_syscall
  39018 0 readv
  23490 0 readlink
     62 0 pwrite64
    158 0 pread64
    836 0 openat
     87 0 mprotect
      4 0 mkdir
      8 0 madvise
  95834 0 lstat
   3068 0 lsetxattr
     37 0 lseek
  16474 0 llistxattr
   3067 0 linkat
  36345 0 lgetxattr
      4 0 getuid
      4 0 getpid
    111 0 getdents64
 399210 0 futex
    300 0 fstat
     62 0 fsetxattr
      2 0 flistxattr
    291 0 fgetxattr
  19172 0 epoll_wait
  19244 0 epoll_ctl
    843 0 close
      5 0 clone
      2 0 chown

So, looking at just the futex calls:

      2 27 futex
      1 21 futex
      2 20 futex
      1 18 futex
      2 13 futex
      1 12 futex
     19 11 futex
     10 10 futex
      4 8 futex
      5 6 futex
      2 5 futex
      9 4 futex
     19 3 futex
     69 2 futex
    338 1 futex
 399210 0 futex

That still indicates that the FAR MAJORITY of mutex locks were fast. 
Performance stats for the same brick (I marked that which I find
interesting/relevant in bold again):

Brick: zoidberg:/mnt/gluster/a
-----------------------------------------
Cumulative Stats:
   Block Size:                  1b+                  
4b+                   8b+
 No. of Reads:                    0                    
0                     0
No. of Writes:                   23                   
10                    74

   Block Size:                 16b+                 
32b+                  64b+
 No. of Reads:                    0                  
248                     1
No. of Writes:                  155                 
1340                    32

   Block Size:                128b+                
256b+                 512b+
 No. of Reads:                    1                   
27                    71
No. of Writes:                   63                  
779                   186

   Block Size:               1024b+                2048b+               
4096b+
 No. of Reads:                   19                   
31                    43
No. of Writes:                  387                  
797                  1490

   Block Size:               8192b+               16384b+              
32768b+
 No. of Reads:                  147                  
254                   508
No. of Writes:                41360                 
3995                 34916

   Block Size:              65536b+              131072b+
 No. of Reads:                  941                 52418
No. of Writes:                 9285                 18319
 %-latency   Avg-latency   Min-Latency   Max-Latency   No. of
calls         Fop
 ---------   -----------   -----------   -----------  
------------        ----
      0.00       0.00 us       0.00 us       0.00 us         627326     
FORGET
      0.00       0.00 us       0.00 us       0.00 us         922811    
RELEASE
      0.00       0.00 us       0.00 us       0.00 us          35290 
RELEASEDIR
      0.00      10.31 us       5.72 us      14.90 us             
2         IPC
      0.00      40.70 us      40.70 us      40.70 us             
1          LK
      0.00      65.64 us      65.64 us      65.64 us             
1        STAT
      0.00      70.26 us      70.26 us      70.26 us             
1        READ
      0.00     176.04 us     176.04 us     176.04 us             
1        LINK
      0.00      42.11 us      29.10 us      63.81 us             11   
FINODELK
      0.00     282.23 us     192.42 us     372.03 us              2   
TRUNCATE
      0.00      65.89 us      20.84 us     316.68 us            
11       FLUSH
      0.00     211.73 us      25.36 us     411.40 us              4   
READDIRP
      0.00      60.37 us      37.28 us     135.28 us             20     
STATFS
      0.00     509.01 us     176.30 us    1213.35 us              4     
UNLINK
      0.00     414.76 us      51.19 us    2698.67 us             28    
OPENDIR
      0.01     349.84 us     115.94 us    1342.20 us            
48       WRITE
      0.04      90.44 us      36.50 us     615.58 us           1199   
READLINK
      0.06   25088.70 us    7133.79 us   59634.44 us              7     
CREATE
      0.12    7024.50 us      32.80 us   57327.60 us             50    
READDIR
      0.16     624.73 us      49.54 us   94746.41 us           
751        OPEN
*      0.62  167828.35 us     543.79 us  387838.03 us             11   
FXATTROP*
      0.65     430.52 us      16.83 us  485714.99 us           4526    
INODELK
      0.70     618.44 us      14.05 us  178158.97 us           3401    
ENTRYLK
*      2.12   71297.23 us      16.61 us 2286254.75 us             89   
GETXATTR*
      2.59    1054.74 us      25.63 us  178938.38 us           7323     
LOOKUP
*     92.94   88173.46 us     141.60 us  519829.21 us           3149    
XATTROP*
      0.00       0.00 us       0.00 us       0.00 us           2360     
UPCALL
      0.00       0.00 us       0.00 us       0.00 us              2    
CI_IATT
      0.00       0.00 us       0.00 us       0.00 us              1  
CI_UNLINK
      0.00       0.00 us       0.00 us       0.00 us           2358  
CI_FORGET

    Duration: 163988 seconds
   Data Read: 6998476545 bytes
Data Written: 5064990911 bytes

>>
>> I saw this "issue": https://github.com/gluster/glusterfs/issues/275
>>
>> And there was some related PR I can't track down now which aims to
>> wrap pthread mutex_* calls with some other library to time locks etc
>> ... having looked at that code it made sense but could trivially
>> introduce extra latency of a few micro-seconds even in the optimal
>> case, and at worst could change lock behaviour (all _lock first does
>> _trylock then re-issues _lock) - although I haven't fully analysed
>> the code change.  And one would need to modify everywhere where
>> mutex's are called, alternatively LD_PRELOAD hacks are useful, but
>> then obtaining code locations where locks are initialised and used
>> becomes much, much harder, but it would make an analysis of whether
>> locks are contended on for extended periods of time (more than a few
>> milliseconds) or held for extended times quite trivial, and I'm
>> guessing backtrace() from execinfo could be helpful here to at least
>> get an idea of which locks are involved.
>>
>> This looks like the approach used by mutrace:
>> http://0pointer.de/blog/projects/mutrace.html so might just end up
>> using that.  Ironically enough if the author of that project just
>> added a little bit more graph tracking code he could turn it into a
>> deadlock detection tool as well.
>>
>>> I also see from the profile info on your first email that XATTROP
>>> fop has the highest latency. Can you do an strace of the brick
>>> processes to see if there is any syscall (setxattr in particular,
>>> for the XATTROP fop) that is taking higher than usual times?
>>>
>> I'm familiar with strace.  What am I looking for?  How will I
>> identify XATTROP fops and their associated system calls in this trace?
>
> XATTROP is a FOP meant for the index xlator. Index serializes this FOP
> and sends a setxattr to posix xlator and in the callback performs some
> link/unlinks operations inside .glusterfs/indices/. So I was wondering
> if the bottleneck is in the setxattr syscall itself (which can be
> found in the strace -i.e. the time taken for the syscall) or because
> of the serialization inside index. The xattr keys will be
> trusted.afr*. like so:
>
> [pid 517071] fsetxattr(16, "trusted.afr.dirty",
> "\0\0\0\1\0\0\0\0\0\0\0", 12, 0) = 0 <0.000135>
> [pid 517071]
> unlink("/home/ravi/bricks/brick1/.glusterfs/indices/xattrop/6ea17fe6-9080-4054-ab98-775d37ea143b")
> = 0 <0.000086>

Ok.  So all fsetxattr calls were sub 1ms (during a time where ls took
90s±5s).

unlinks varies wildly, with out of 2467 unlink calls, 1729 were sub 1ms.

21 was [1,5) ms
17 was [5,10) ms
24 was [10,20) ms
112 was [20,50) ms
368 was [50,100) ms
196 was [100,200) ms
27 was [200,300) ms
6 was >= 300 ms, with the longest clocking in at 347ms.

This, in my opinion, does not justify stalls on ls upwards of a minute
for listing.  Currently upwards of 3 minutes and counting.

>
> But anyway you can check if there are any syscalls(not just setxattr)
> that take a long time on the drives.

Nothing on the drives, at least as far as I can tell.  Unless those
epolls and select calls, possibly relate, and even then, that was max 7
epoll_wait calls over 10s, the rest all lower than, so nothing that in
my opinion can justify this level of badness.

>
>>> For the slow ls from the client, an ongoing selfheal should not
>>> affect it, since readdir, stat etc are all read-only operations that
>>> do not take any locks that can compete with heal, and there is no
>>> hardware bottleneck at the brick side to process requests from what
>>> you say. Maybe a tcp dump between the client and the server can help
>>> find if the slowness is from the server side by looking at the
>>> request and response times of the FOPS in the dump.
>>>
>> Well, it does.  I've samples a few hundred ls's since my email this
>> morning.  All of them were sub 0.5s. 
> I guess you need to check when the heal does happen, *which* FOP is
> slow for the `ls -l` operation. Is it readdirp? stat?

Ok, so I'm assuming strace -T on ls itself should be revealing.  For the
moment this OK for the moment via strace (clocking in at 3.5s for the
moment).  There is one solitary fstat() call in the trace (only copying
relevant):

openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
<1.800382>
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000019>
getdents64(3, /* 18 entries */, 131072) = 600 <1.346974>
getdents64(3, /* 18 entries */, 131072) = 584 <0.000174>
getdents64(3, /* 18 entries */, 131072) = 584 <0.000239>
getdents64(3, /* 18 entries */, 131072) = 584 <0.000165>
getdents64(3, /* 18 entries */, 131072) = 560 <0.343684>
getdents64(3, /* 18 entries */, 131072) = 568 <0.000272>
getdents64(3, /* 18 entries */, 131072) = 592 <0.000103>
getdents64(3, /* 18 entries */, 131072) = 600 <0.000096>
getdents64(3, /* 18 entries */, 131072) = 584 <0.000166>
getdents64(3, /* 18 entries */, 131072) = 608 <0.000167>
getdents64(3, /* 1 entries */, 131072)  = 40 <0.000160>
getdents64(3, /* 0 entries */, 131072)  = 0 <0.000060>
close(3)                                = 0 <0.000011>

In a second trace, the long calls were:

openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
<1.501978>
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <2.526342>

For some reason I'm unable to get really, really long ls calls right at
the moment.

But in all cases it seems that the openat call takes a long time,
followed by either a long(ish) fstat or a long(ish) getdents64 system call.

Doing strace against another process that reliably gets tripped up by
this (blocking reliably up to a minute and a half under current
conditions) I'm also not seeing anything that makes sense.  So I'm
starting to think in this case it's more like death by a million small cuts.

I will keep on digging but for now I have to switch the SHD off ... I'll
be creating a separate mount point in a moment and doing recursive find
. -exec stat {} \; -exec sleep 0.01 \; to see if that helps ...

>> The only difference is that all three SHDs were killed.  Almost
>> without a doubt in my mind there has to be some unforeseen
>> interaction. We've been having a few discussions, and a very, very
>> long time ago, we effectively actioned a full heal with "find
>> /path/to/fuse-mount -exec stat {} \;" and other times with "find
>> /path/to/fuse-mount -print0 | xargs -0 -n50 -P5 -- stat" - and if we
>> recall correctly we've seen similar there.  We thought at that time
>> it was the FUSE process causing trouble, and it's entirely possible
>> that that is indeed the case, but we are wondering now whether that
>> too not maybe related to the heal process that was just happening due
>> to the stats (heal on stat).  We generally run with
>> cluster.*-self-heal off nowadays and rather rely purely on the SHD
>> for performance reasons.
>
> We've had complaints of client-side heal  (data/metadata/entry)
> stealing too much of i/o bandwidth  which is why turned the default
> values to off in AFR.

I've never had an issue with i/o bandwidth (mbit/s) here, but the
additional latency (ms) on pretty much all filesystem access was really
bad.  I suspect this again relates to use, where we're using it as a
filesystem where most folks seems to be using it as a backing store for
disk images for VMs.

I'm not sure how to proceed from here ... If I'm really, really lucky,
the above process should finish in a few days at best, but it could
still take weeks.  What worries me greatly is that I do have an upcoming
rebalance to add another distribute triplet ... and I'm very, very
worried that this will have the same potential impact for client
performance.

Kind Regards,
Jaco

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20200509/f749b889/attachment.html>