[Gluster-users] Sparse files and heal full bug fix backport to 3.6.x

Thu Feb 11 01:07:44 UTC 2016

On 02/11/2016 03:10 AM, Steve Dainard wrote:
> Most recently this happened on Gluster 3.6.6, I know it happened on
> another earlier minor release of 3.6, maybe 3.6.4. Currently on 3.6.8,
> I can try to re-create on another replica volume.
>
> Which logs would give some useful info, under which logging level?

Okay, I wanted to see if the losing of sparseness is indeed an effect of 
the self heal. The client (fuse mount?) logs and the glustershd logs at 
GF_LOG_DEBUG should  sufficient, since these are the 2 places where 
selfheal can happen. The other things that would be useful when you try 
to re-create are:
1. Size reported by `ls -l` and `du -h` of the file on all bricks before 
heal
2. The same thing after heal
3. Do a `sync && echo 3 > /proc/sys/vm/drop_caches` on the bricks and 
see if it makes any difference to the du  size. (XFS does some strange 
things with pre-allocation).

You say that you re-sparsed the files using fallocate. How did you do 
that? The -d flag?

>
>  From host with brick down. 2016-02-06 00:40 was approximately when I
> got restarted glusterd to get the brick to start properly.
> glfsheal-vm-storage.log
> ...
> [2015-11-30 20:37:17.348673] I
> [glfs-resolve.c:836:__glfs_active_subvol] 0-vm-storage: switched to
> graph 676c7573-7465-7230-312e-706369632e75 (0)
> [2016-02-06 00:27:15.282280] E
> [client-handshake.c:1496:client_query_portmap_cbk]
> 0-vm-storage-client-0: failed to get the port number for remote
> subvolume. Please run 'gluster volume status' on server to see if
> brick process is running.
> [2016-02-06 00:27:49.797465] E
> [client-handshake.c:1496:client_query_portmap_cbk]
> 0-vm-storage-client-0: failed to get the port number for remote
> subvolume. Please run 'gluster volume status' on server to see if
> brick process is running.
> [2016-02-06 00:27:54.126627] E
> [client-handshake.c:1496:client_query_portmap_cbk]
> 0-vm-storage-client-0: failed to get the port number for remote
> subvolume. Please run 'gluster volume status' on server to see if
> brick process is running.
> [2016-02-06 00:27:58.449801] E
> [client-handshake.c:1496:client_query_portmap_cbk]
> 0-vm-storage-client-0: failed to get the port number for remote
> subvolume. Please run 'gluster volume status' on server to see if
> brick process is running.
> [2016-02-06 00:31:56.139278] E
> [client-handshake.c:1496:client_query_portmap_cbk]
> 0-vm-storage-client-0: failed to get the port number for remote
> subvolume. Please run 'gluster volume status' on server to see if
> brick process is running.
> <nothing newer in logs>
>
> The brick log, which has a massive amount of these errors
> (https://dl.dropboxusercontent.com/u/21916057/mnt-lv-vm-storage-vm-storage.log-20160207.tar.gz):
> [2016-02-06 00:43:43.280048] E [socket.c:1972:__socket_read_frag]
> 0-rpc: wrong MSG-TYPE (1700885605) received from 142.104.230.33:38710
> [2016-02-06 00:43:43.280159] E [socket.c:1972:__socket_read_frag]
> 0-rpc: wrong MSG-TYPE (1700885605) received from 142.104.230.33:38710
> [2016-02-06 00:43:43.280325] E [socket.c:1972:__socket_read_frag]
> 0-rpc: wrong MSG-TYPE (1700885605) received from 142.104.230.33:38710
>
> But I only peer and mount gluster on a private subnet so its a bit
> odd.. but I don't know if its related.
>
>
> On Tue, Feb 9, 2016 at 5:38 PM, Ravishankar N <ravishankar at redhat.com> wrote:
>> Hi Steve,
>> The patch already went in for 3.6.3
>> (https://bugzilla.redhat.com/show_bug.cgi?id=1187547). What version are you
>> using? If it is 3.6.3 or newer, can you share the logs if this happens
>> again? (or possibly try if you can reproduce the issue on your setup).
>> Thanks,
>> Ravi
>>
>>
>> On 02/10/2016 02:25 AM, FNU Raghavendra Manjunath wrote:
>>
>>
>> Adding Pranith, maintainer of the replicate feature.
>>
>>
>> Regards,
>> Raghavendra
>>
>>
>> On Tue, Feb 9, 2016 at 3:33 PM, Steve Dainard <sdainard at spd1.com> wrote:
>>> There is a thread from 2014 mentioning that the heal process on a
>>> replica volume was de-sparsing sparse files.(1)
>>>
>>> I've been experiencing the same issue on Gluster 3.6.x. I see there is
>>> a bug closed for a fix on Gluster 3.7 (2) and I'm wondering if this
>>> fix can be back-ported to Gluster 3.6.x?
>>>
>>> My experience has been:
>>> Replica 3 volume
>>> 1 brick went offline
>>> Brought brick back online
>>> Heal full on volume
>>> My 500G vm-storage volume went from ~280G used to >400G used.
>>>
>>> I've experienced this a couple times previously, and used fallocate to
>>> re-sparse files but this is cumbersome at best, and lack of proper
>>> heal support on sparse files could be disastrous if I didn't have
>>> enough free space and ended up crashing my VM's when my storage domain
>>> ran out of space.
>>>
>>> Seeing as 3.6 is still a supported release, and 3.7 feels too bleeding
>>> edge for production systems, I think it makes sense to back-port this
>>> fix if possible.
>>>
>>> Thanks,
>>> Steve
>>>
>>>
>>>
>>> 1.
>>> https://www.gluster.org/pipermail/gluster-users/2014-November/019512.html
>>> 2. https://bugzilla.redhat.com/show_bug.cgi?id=1166020
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>>
>>