[Gluster-users] Issues in AFR and self healing

Thu Aug 16 04:06:58 UTC 2018

On 08/15/2018 11:07 PM, Pablo Schandin wrote:
>
> I found another log that I wasn't aware of in 
> /var/log/glusterfs/brick, that is te mount log, I confused the log 
> files. In this file I see a lot of entries like this one:
>
> [2018-08-15 16:41:19.568477] I [addr.c:55:compare_addr_and_update] 
> 0-/mnt/brick1/gv1: allowed = "172.20.36.10", received addr = 
> "172.20.36.11"
> [2018-08-15 16:41:19.568527] I [addr.c:55:compare_addr_and_update] 
> 0-/mnt/brick1/gv1: allowed = "172.20.36.11", received addr = 
> "172.20.36.11"
> [2018-08-15 16:41:19.568547] I [login.c:76:gf_auth] 0-auth/login: 
> allowed user names: 7107ccfa-0ba1-4172-aa5a-031568927bf1
> [2018-08-15 16:41:19.568564] I [MSGID: 115029] 
> [server-handshake.c:793:server_setvolume] 0-gv1-server: accepted 
> client from 
> physinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0 
> (version: 3.1
> 2.6)
> [2018-08-15 16:41:19.582710] I [MSGID: 115036] 
> [server.c:527:server_rpc_notify] 0-gv1-server: disconnecting 
> connection from 
> physinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0
> [2018-08-15 16:41:19.582830] I [MSGID: 101055] 
> [client_t.c:443:gf_client_unref] 0-gv1-server: Shutting down 
> connection 
> physinfra-hb2.xcade.net-21091-2018/08/15-16:41:03:103872-gv1-client-0-0-0
>
> So I see a lot of disconnections, right? This might be why the self 
> healing is triggered all the time?
>
Not necessarily. These disconnects could also be due to the glfsheal 
binary which is invoked when you run `gluster vol heal volname info` etc 
and do not cause heals. It would be better to check your client mount 
logs for disconnect messages like these:

[2018-08-16 03:59:32.289763] I [MSGID: 114018] 
[client.c:2285:client_rpc_notify] 0-testvol-client-4: disconnected from 
testvol-client-0. Client process will keep trying to connect to glusterd 
until brick's port is available

If there are no disconnects and you are still seeing files undergoing 
heal, then you might want to check the brick logs to see if there are 
any write failures.
Thanks,
Ravi
>
> Thanks!
>
> Pablo.
>
> Avature
>
> Get Engaged to Talent
>
>
>
> On 08/14/2018 09:15 AM, Pablo Schandin wrote:
>>
>> Thanks for the info!
>>
>> I cannot see any logs in the mount log besides one line every time it 
>> rotates
>>
>> [2018-08-13 06:25:02.246187] I 
>> [glusterfsd-mgmt.c:1821:mgmt_getspec_cbk] 0-glusterfs: No change in 
>> volfile,continuing
>>
>> But I did find in the glfsheal-gv1.log of the volumes some kind of 
>> server-client connection that was disconnected and now it connects 
>> using a different port. The block of log per each run is kind of long 
>> so I'm copying it into a pastebin.
>>
>> https://pastebin.com/bp06rrsT
>>
>> Maybe this has something to do with it?
>>
>> Thanks!
>>
>> Pablo.
>>
>> On 08/11/2018 12:19 AM, Ravishankar N wrote:
>>>
>>>
>>>
>>> On 08/10/2018 11:25 PM, Pablo Schandin wrote:
>>>>
>>>> Hello everyone!
>>>>
>>>> I'm having some trouble with something but I'm not quite sure of 
>>>> with what yet. I'm running GlusterFS 3.12.6 on Ubuntu 16.04. I have 
>>>> two servers (nodes) in the cluster in a replica mode. Each server 
>>>> has 2 bricks. As the servers are KVM running several VMs, one brick 
>>>> has some VMs locally defined in it and the second brick is the 
>>>> replicated from the other server. It has data but not actual 
>>>> writing is being done except for the replication.
>>>>
>>>>                             Server 1                               
>>>>                   Server 2
>>>> Volume 1 (gv1): Brick 1 defined VMs (read/write) ---->            
>>>>       Brick 1 replicated qcow2 files
>>>> Volume 2 (gv2): Brick 2 replicated qcow2 files <-----            
>>>>      Brick 2 defined VMs (read/write)
>>>>
>>>> So, the main issue arose when I got a nagios alarm that warned 
>>>> about a file listed to be healed. And then it disappeared. I came 
>>>> to find out that every 5 minutes, the self heal daemon triggers the 
>>>> healing and this fixes it. But looking at the logs I have a lot of 
>>>> entries in the glustershd.log file like this:
>>>>
>>>> [2018-08-09 14:23:37.689403] I [MSGID: 108026] 
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv1-replicate-0: 
>>>> Completed data selfheal on 407bd97b-e76c-4f81-8f59-7dae11507b0c. 
>>>> sources=[0] sinks=1
>>>> [2018-08-09 14:44:37.933143] I [MSGID: 108026] 
>>>> [afr-self-heal-common.c:1656:afr_log_selfheal] 0-gv2-replicate-0: 
>>>> Completed data selfheal on 73713556-5b63-4f91-b83d-d7d82fee111f. 
>>>> sources=[0] sinks=1
>>>>
>>>> The qcow2 files are being healed several times a day (up to 30 in 
>>>> occasions). As I understand, this means that a data heal occurred 
>>>> on file with gfid 407b... and 7371... in source to sink. Local 
>>>> server to replica server? Is it OK for the shd to heal files in the 
>>>> replicated brick that supposedly has no writing on it besides the 
>>>> mirroring? How does that work?
>>>>
>>> In AFR, for writes, there is no notion of local/remote brick. No 
>>> matter from which client you write to the volume, it gets sent to 
>>> both bricks. i.e. the replication is synchronous and real time.
>>>
>>>> How does afr replication work? The file with gfid 7371... is the 
>>>> qcow2 root disk of an owncloud server with 17GB of data. It does 
>>>> not seem to be that big to be a bottleneck of some sort, I think.
>>>>
>>>> Also, I was investigating the directory tree in 
>>>> brick/.glusterfs/indices and I notices that both in xattrop and 
>>>> dirty I always have a file created named xattrop-xxxxxx and 
>>>> dirty-xxxxxx. I read that the xattrop file is like a parent file or 
>>>> handle to reference other files created there as hardlinks with 
>>>> gfid name for the shd to heal. Is the same case as the ones in the 
>>>> dirty dir?
>>>>
>>> Yes, before the write, the gfid gets captured inside dirty on all 
>>> bricks. If the write is successful, it gets removed. In addition, if 
>>> the write fails on one brick, the other brick will capture the gfid 
>>> inside xattrop.
>>>>
>>>> Any help will be greatly appreciated it. Thanks!
>>>>
>>> If frequent heals are triggered, it could mean there are frequent 
>>> network disconnects from the clients to the bricks as writes happen. 
>>> You can check the mount logs and see if that is the case and 
>>> investigate possible network issues.
>>>
>>> HTH,
>>> Ravi
>>>>
>>>> Pablo.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Gluster-users mailing list
>>>> Gluster-users at gluster.org
>>>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20180816/ea2b9f80/attachment.html>