[Gluster-users] Slef-heal still not finished after 2 days

Mon Jun 30 03:16:11 UTC 2014

On 06/30/2014 07:32 AM, John Gardeniers wrote:
> Hi Praneth,
>
> Thanks for your reply.
>
> On 30/06/14 11:42, Pranith Kumar Karampuri wrote:
>> On 06/30/2014 04:03 AM, John Gardeniers wrote:
>>> Hi All,
>>>
>>> We have 2 servers, each with on 5TB brick, configured as replica 2.
>>> After a series of events that caused the 2 bricks to become way out of
>>> step gluster was turned off on one server and its brick was wiped of
>>> everything but the attributes were untouched.
>>>
>>> This weekend we stopped the client and gluster and made a backup of the
>>> remaining brick, just to play safe. Gluster was then turned back on,
>>> first on the "master" and then on the "slave". Self-heal kicked in and
>>> started rebuilding the second brick. However, after 2 full days all
>>> files in the volume are still showing heal failed errors.
>>>
>>> The rebuild was, in my opinion at least, very slow, taking most of a day
>>> even though the system is on a 10Gb LAN. The data is a little under
>>> 1.4TB committed, 2TB allocated.
>> How much more to be healed? 0.6TB?
> The "slave" brick now has more data on it than the "master", which I
> find a bit strange. 'du' reports 1400535604 bytes on the master and
> 1405361404 bytes on the slave.
>
>>> Once the 2 bricks were very close to having the same amount of space
>>> used things slowed right down. For the last day both bricks show a very
>>> slow increase in used space, even though there are no changes being
>>> written by the client. By slow I mean just a few KB per minute.
>> Is the I/O still in progress on the mount? Self-heal doesn't happen on
>> files where I/O is going on mounts in 3.4.x. So that could be the
>> reason if I/O is going on.
> Yes, IO is active and we are running 3.4.2. It sounds like this could be
> the issue. This is extremely broken behaviour. It sounds like the only
> way to fix it is to shut down every disconnect the client. This is
> impractical as we are using gluster as our storage for RHEV and we need
> those VMs running. Unless the self-heal completes we cannot upgrade
> either gluster server.
I agree. Apologize for the inconvenience. Let me check if there is a way 
to get you out of this situation without any upgrade. Give me a day. I 
will get back to you.

Pranith
>>> The logs are confusing, to say the least. In
>>> etc-glusterfs-glusterd.vol.log on both servers there are thousands of
>>> entries such as (possibly because I was using watch to monitor self-heal
>>> progress):
>>>
>>> [2014-06-29 21:41:11.289742] I
>>> [glusterd-volume-ops.c:478:__glusterd_handle_cli_heal_volume]
>>> 0-management: Received heal vol req for volume gluster-rhev
>> What versoin of gluster are you using?
> 3.4.2
>>> That timestamp is the latest on either server, that's about 9 hours ago
>>> as I type this. I find that a bit disconcerting. I have requested volume
>>> heal-failed info since then.
>>>
>>> The brick log on the "master" server (the one from which we are
>>> rebuilding the new brick) contains no entries since before the rebuild
>>> started.
>>>
>>> On the "slave" server the brick log shows a lot of entries such as:
>>>
>>> [2014-06-28 08:49:47.887353] E [marker.c:2140:marker_removexattr_cbk]
>>> 0-gluster-rhev-marker: Numerical result out of range occurred while
>>> creating symlinks
>>> [2014-06-28 08:49:47.887382] I
>>> [server-rpc-fops.c:745:server_removexattr_cbk] 0-gluster-rhev-server:
>>> 10311315: REMOVEXATTR
>>> /44d30b24-1ed7-48a0-b905-818dc0a006a2/images/02d4bd3c-b057-4f04-ada5-838f83d0b761/d962466d-1894-4716-b5d0-3a10979145ec
>>>
>>> (1c1f53ac-afe2-420d-8c93-b1eb53ffe8b1) of key  ==> (Numerical result out
>>> of range)
>> CC Raghavendra who knows about marker translator.
>>> Those entries are around the time the rebuild was starting. The final
>>> entries in that same log (immediately after those listed above) are:
>>>
>>> [2014-06-29 12:47:28.473999] I
>>> [server-rpc-fops.c:243:server_inodelk_cbk] 0-gluster-rhev-server: 2869:
>>> INODELK (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file
>>> or directory)
>>> [2014-06-29 12:47:28.489527] I [server-rpc-fops.c:1572:server_open_cbk]
>>> 0-gluster-rhev-server: 2870: OPEN (null)
>>> (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory)
>> These logs are harmless and were fixed in 3.5 I think. Are you on 3.4.x?
>>
>>> As I type it's 2014-06-30 08:31.
>>>
>>> What do they mean and how can I rectify it?
>>>
>>> regards,
>>> John
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
>> ______________________________________________________________________