[Gluster-users] Issues in AFR and self healing

Pablo Schandin pablo.schandin at avature.net
Wed Aug 15 17:37:43 UTC 2018

I found another log that I wasn't aware of in /var/log/glusterfs/brick, 
that is te mount log, I confused the log files. In this file I see a lot 
of entries like this one:

[2018-08-15 16:41:19.568477] I [addr.c:55:compare_addr_and_update] 
0-/mnt/brick1/gv1: allowed = "", received addr = ""
[2018-08-15 16:41:19.568527] I [addr.c:55:compare_addr_and_update] 
0-/mnt/brick1/gv1: allowed = "", received addr = ""
[2018-08-15 16:41:19.568547] I [login.c:76:gf_auth] 0-auth/login: 
allowed user names: 7107ccfa-0ba1-4172-aa5a-031568927bf1
[2018-08-15 16:41:19.568564] I [MSGID: 115029] 
[server-handshake.c:793:server_setvolume] 0-gv1-server: accepted client 
(version: 3.1
[2018-08-15 16:41:19.582710] I [MSGID: 115036] 
[server.c:527:server_rpc_notify] 0-gv1-server: disconnecting connection 
[2018-08-15 16:41:19.582830] I [MSGID: 101055] 
[client_t.c:443:gf_client_unref] 0-gv1-server: Shutting down connection 

So I see a lot of disconnections, right? This might be why the self 
healing is triggered all the time?




Get Engaged to Talent

On 08/14/2018 09:15 AM, Pablo Schandin wrote:
> Thanks for the info!
> I cannot see any logs in the mount log besides one line every time it 
> rotates
> [2018-08-13 06:25:02.246187] I 
> [glusterfsd-mgmt.c:1821:mgmt_getspec_cbk] 0-glusterfs: No change in 
> volfile,continuing
> But I did find in the glfsheal-gv1.log of the volumes some kind of 
> server-client connection that was disconnected and now it connects 
> using a different port. The block of log per each run is kind of long 
> so I'm copying it into a pastebin.
> https://pastebin.com/bp06rrsT
> Maybe this has something to do with it?
> Thanks!
> Pablo.
> On 08/11/2018 12:19 AM, Ravishankar N wrote:
>> On 08/10/2018 11:25 PM, Pablo Schandin wrote:
>>> Hello everyone!
>>> I'm having some trouble with something but I'm not quite sure of 
>>> with what yet. I'm running GlusterFS 3.12.6 on Ubuntu 16.04. I have 
>>> two servers (nodes) in the cluster in a replica mode. Each server 
>>> has 2 bricks. As the servers are KVM running several VMs, one brick 
>>> has some VMs locally defined in it and the second brick is the 
>>> replicated from the other server. It has data but not actual writing 
>>> is being done except for the replication.
>>>                             Server 1                               
>>>               Server 2
>>> Volume 1 (gv1): Brick 1 defined VMs (read/write) ---->            
>>>       Brick 1 replicated qcow2 files
>>> Volume 2 (gv2): Brick 2 replicated qcow2 files <-----            
>>>      Brick 2 defined VMs (read/write)
>>> So, the main issue arose when I got a nagios alarm that warned about 
>>> a file listed to be healed. And then it disappeared. I came to find 
>>> out that every 5 minutes, the self heal daemon triggers the healing 
>>> and this fixes it. But looking at the logs I have a lot of entries 
>>> in the glustershd.log file like this:
>>> [2018-08-09 14:23:37.689403] I [MSGID: 108026] 
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv1-replicate-0: 
>>> Completed data selfheal on 407bd97b-e76c-4f81-8f59-7dae11507b0c. 
>>> sources=[0]  sinks=1
>>> [2018-08-09 14:44:37.933143] I [MSGID: 108026] 
>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv2-replicate-0: 
>>> Completed data selfheal on 73713556-5b63-4f91-b83d-d7d82fee111f. 
>>> sources=[0]  sinks=1
>>> The qcow2 files are being healed several times a day (up to 30 in 
>>> occasions). As I understand, this means that a data heal occurred on 
>>> file with gfid 407b... and 7371... in source to sink. Local server 
>>> to replica server? Is it OK for the shd to heal files in the 
>>> replicated brick that supposedly has no writing on it besides the 
>>> mirroring? How does that work?
>> In AFR, for writes, there is no notion of local/remote brick. No 
>> matter from which client you write to the volume, it gets sent to 
>> both bricks. i.e. the replication is synchronous and real time.
>>> How does afr replication work? The file with gfid 7371... is the 
>>> qcow2 root disk of an owncloud server with 17GB of data. It does not 
>>> seem to be that big to be a bottleneck of some sort, I think.
>>> Also, I was investigating the directory tree in 
>>> brick/.glusterfs/indices and I notices that both in xattrop and 
>>> dirty I always have a file created named xattrop-xxxxxx and 
>>> dirty-xxxxxx. I read that the xattrop file is like a parent file or 
>>> handle to reference other files created there as hardlinks with gfid 
>>> name for the shd to heal. Is the same case as the ones in the dirty dir?
>> Yes, before the write, the gfid gets captured inside dirty on all 
>> bricks. If the write is successful, it gets removed. In addition, if 
>> the write fails on one brick, the other brick will capture the gfid 
>> inside xattrop.
>>> Any help will be greatly appreciated it. Thanks!
>> If frequent heals are triggered, it could mean there are frequent 
>> network disconnects from the clients to the bricks as writes happen. 
>> You can check the mount logs and see if that is the case and 
>> investigate possible network issues.
>> HTH,
>> Ravi
>>> Pablo.
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180815/6256af19/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4008 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180815/6256af19/attachment.p7s>

More information about the Gluster-users mailing list