[Gluster-users] Possible SHD stalling

Ravishankar N ravishankar at redhat.com
Fri May 8 15:47:15 UTC 2020


On 08/05/20 7:07 pm, Jaco Kroon wrote:
>
> I'm not sure "stuck" is the right word, but looking at the "statistics 
> heal-count" values it goes into a form of "go slow" mode, and ends up 
> adding more entries for heal in some cases at a rate of about 2 every 
> second, sometimes 4 at worst (based on ~10-15 second intervals 
> depending on how long it takes to gather the heal-counts).
>
> My guess currently is that since the log entries that scroll by 
> relates to metadata and entry heals, these are when it's healing 
> files, and when it hits a folder it somehow locks something, then 
> iterates through the folder (very) slowly for some reason, and whilst 
> this is happening, directory listings are outright unusable.
>
> Is there lock profiling code available in glusterfs?
>
Nope, nothing as of today as far as I know. You can have the locks 
xlator log everything by setting the locks.trace option to ON though.
>
> I saw this "issue": https://github.com/gluster/glusterfs/issues/275
>
> And there was some related PR I can't track down now which aims to 
> wrap pthread mutex_* calls with some other library to time locks etc 
> ... having looked at that code it made sense but could trivially 
> introduce extra latency of a few micro-seconds even in the optimal 
> case, and at worst could change lock behaviour (all _lock first does 
> _trylock then re-issues _lock) - although I haven't fully analysed the 
> code change.  And one would need to modify everywhere where mutex's 
> are called, alternatively LD_PRELOAD hacks are useful, but then 
> obtaining code locations where locks are initialised and used becomes 
> much, much harder, but it would make an analysis of whether locks are 
> contended on for extended periods of time (more than a few 
> milliseconds) or held for extended times quite trivial, and I'm 
> guessing backtrace() from execinfo could be helpful here to at least 
> get an idea of which locks are involved.
>
> This looks like the approach used by mutrace: 
> http://0pointer.de/blog/projects/mutrace.html so might just end up 
> using that.  Ironically enough if the author of that project just 
> added a little bit more graph tracking code he could turn it into a 
> deadlock detection tool as well.
>
>> I also see from the profile info on your first email that XATTROP fop 
>> has the highest latency. Can you do an strace of the brick processes 
>> to see if there is any syscall (setxattr in particular, for the 
>> XATTROP fop) that is taking higher than usual times?
>>
> I'm familiar with strace.  What am I looking for?  How will I identify 
> XATTROP fops and their associated system calls in this trace?

XATTROP is a FOP meant for the index xlator. Index serializes this FOP 
and sends a setxattr to posix xlator and in the callback performs some 
link/unlinks operations inside .glusterfs/indices/. So I was wondering 
if the bottleneck is in the setxattr syscall itself (which can be found 
in the strace -i.e. the time taken for the syscall) or because of the 
serialization inside index. The xattr keys will be trusted.afr*. like so:

[pid 517071] fsetxattr(16, "trusted.afr.dirty", 
"\0\0\0\1\0\0\0\0\0\0\0", 12, 0) = 0 <0.000135>
[pid 517071] 
unlink("/home/ravi/bricks/brick1/.glusterfs/indices/xattrop/6ea17fe6-9080-4054-ab98-775d37ea143b") 
= 0 <0.000086>

But anyway you can check if there are any syscalls(not just setxattr) 
that take a long time on the drives.

>> For the slow ls from the client, an ongoing selfheal should not 
>> affect it, since readdir, stat etc are all read-only operations that 
>> do not take any locks that can compete with heal, and there is no 
>> hardware bottleneck at the brick side to process requests from what 
>> you say. Maybe a tcp dump between the client and the server can help 
>> find if the slowness is from the server side by looking at the 
>> request and response times of the FOPS in the dump.
>>
> Well, it does.  I've samples a few hundred ls's since my email this 
> morning.  All of them were sub 0.5s. 
I guess you need to check when the heal does happen, *which* FOP is slow 
for the `ls -l` operation. Is it readdirp? stat?
> The only difference is that all three SHDs were killed.  Almost 
> without a doubt in my mind there has to be some unforeseen 
> interaction. We've been having a few discussions, and a very, very 
> long time ago, we effectively actioned a full heal with "find 
> /path/to/fuse-mount -exec stat {} \;" and other times with "find 
> /path/to/fuse-mount -print0 | xargs -0 -n50 -P5 -- stat" - and if we 
> recall correctly we've seen similar there.  We thought at that time it 
> was the FUSE process causing trouble, and it's entirely possible that 
> that is indeed the case, but we are wondering now whether that too not 
> maybe related to the heal process that was just happening due to the 
> stats (heal on stat).  We generally run with cluster.*-self-heal off 
> nowadays and rather rely purely on the SHD for performance reasons.

We've had complaints of client-side heal  (data/metadata/entry) stealing 
too much of i/o bandwidth  which is why turned the default values to off 
in AFR.

Thanks,
Ravi



More information about the Gluster-users mailing list