[Gluster-users] Slef-heal still not finished after 2 days

Mon Jun 30 01:42:42 UTC 2014

On 06/30/2014 04:03 AM, John Gardeniers wrote:
> Hi All,
>
> We have 2 servers, each with on 5TB brick, configured as replica 2.
> After a series of events that caused the 2 bricks to become way out of
> step gluster was turned off on one server and its brick was wiped of
> everything but the attributes were untouched.
>
> This weekend we stopped the client and gluster and made a backup of the
> remaining brick, just to play safe. Gluster was then turned back on,
> first on the "master" and then on the "slave". Self-heal kicked in and
> started rebuilding the second brick. However, after 2 full days all
> files in the volume are still showing heal failed errors.
>
> The rebuild was, in my opinion at least, very slow, taking most of a day
> even though the system is on a 10Gb LAN. The data is a little under
> 1.4TB committed, 2TB allocated.
How much more to be healed? 0.6TB?
>
> Once the 2 bricks were very close to having the same amount of space
> used things slowed right down. For the last day both bricks show a very
> slow increase in used space, even though there are no changes being
> written by the client. By slow I mean just a few KB per minute.
Is the I/O still in progress on the mount? Self-heal doesn't happen on 
files where I/O is going on mounts in 3.4.x. So that could be the reason 
if I/O is going on.
>
> The logs are confusing, to say the least. In
> etc-glusterfs-glusterd.vol.log on both servers there are thousands of
> entries such as (possibly because I was using watch to monitor self-heal
> progress):
>
> [2014-06-29 21:41:11.289742] I
> [glusterd-volume-ops.c:478:__glusterd_handle_cli_heal_volume]
> 0-management: Received heal vol req for volume gluster-rhev
What versoin of gluster are you using?
> That timestamp is the latest on either server, that's about 9 hours ago
> as I type this. I find that a bit disconcerting. I have requested volume
> heal-failed info since then.
>
> The brick log on the "master" server (the one from which we are
> rebuilding the new brick) contains no entries since before the rebuild
> started.
>
> On the "slave" server the brick log shows a lot of entries such as:
>
> [2014-06-28 08:49:47.887353] E [marker.c:2140:marker_removexattr_cbk]
> 0-gluster-rhev-marker: Numerical result out of range occurred while
> creating symlinks
> [2014-06-28 08:49:47.887382] I
> [server-rpc-fops.c:745:server_removexattr_cbk] 0-gluster-rhev-server:
> 10311315: REMOVEXATTR
> /44d30b24-1ed7-48a0-b905-818dc0a006a2/images/02d4bd3c-b057-4f04-ada5-838f83d0b761/d962466d-1894-4716-b5d0-3a10979145ec
> (1c1f53ac-afe2-420d-8c93-b1eb53ffe8b1) of key  ==> (Numerical result out
> of range)
CC Raghavendra who knows about marker translator.
>
> Those entries are around the time the rebuild was starting. The final
> entries in that same log (immediately after those listed above) are:
>
> [2014-06-29 12:47:28.473999] I
> [server-rpc-fops.c:243:server_inodelk_cbk] 0-gluster-rhev-server: 2869:
> INODELK (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file
> or directory)
> [2014-06-29 12:47:28.489527] I [server-rpc-fops.c:1572:server_open_cbk]
> 0-gluster-rhev-server: 2870: OPEN (null)
> (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory)
These logs are harmless and were fixed in 3.5 I think. Are you on 3.4.x?

>
> As I type it's 2014-06-30 08:31.
>
> What do they mean and how can I rectify it?
>
> regards,
> John
>
> _______________________________________________
> Gluster-users mailing list
> Gluster-users at gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users