[Gluster-users] Slef-heal still not finished after 2 days

John Gardeniers jgardeniers at objectmastery.com
Mon Jun 30 03:18:35 UTC 2014


Hi again Pranith,

On 30/06/14 11:58, Pranith Kumar Karampuri wrote:
> Oops, I see you are the same user who posted about VM files self-heal.
> Sorry I couldn't get back in time. So you are using 3.4.2.
> Could you post logfiles of mount, bricks please. That should help us
> to find more information about any issues.
>
When you say the log for the mount, which log is that? There are none
that I can identify with the mount.

> gluster volume heal <volname> info heal-failed records the last 1024
> failures. It also prints the timestamp of when the failures occurred.
> Even after the heal is successful it keeps showing the errors. So
> timestamp of when the heal failed is important. Because some of these
> commands are causing such confusion we depracated these commands in
> upcoming releases (3.6).
>
So far I've been focusing on the heal-failed count, which I fully, and I
believe understandably, expect to show zero when there are no errors.
Now that I look at the timestamps of those errors I realise they are all
from *before* the slave brick was added back in. May I assume then that
in reality there are no unhealed files? If this is correct, I must point
out that if errors are reported when there are none that is a massive
design flaw. It means things like nagios checks, such as the one we use,
are useless. This makes monitoring near enough to impossible.

> This is probably a stupid question but let me ask it anyway. When a
> brick contents are erased from backend
> we need to make sure about the following two things:
> 1) Extended attributes of the root brick is showing pending operations
> on the brick that is erased
> 2) Execute "gluster volume heal <volname> full"

1) While gluster was stopped I merely did an rm -rf on both the data
sub-directory and the .gluster sub-directory. How do I show that there
are pending operations?
2) Yes, I did run that.

>
> Did you do the steps above?
>
> Since you are on 3.4.2 I think best way to check what files are healed
> is using extended attributes in the backend. Could you please post
> them again.

I don't quite understand what you're asking for. I understand attributes
as belonging to files and directories, not operations. Please elaborate.

>
> Pranith
>
> On 06/30/2014 07:12 AM, Pranith Kumar Karampuri wrote:
>>
>> On 06/30/2014 04:03 AM, John Gardeniers wrote:
>>> Hi All,
>>>
>>> We have 2 servers, each with on 5TB brick, configured as replica 2.
>>> After a series of events that caused the 2 bricks to become way out of
>>> step gluster was turned off on one server and its brick was wiped of
>>> everything but the attributes were untouched.
>>>
>>> This weekend we stopped the client and gluster and made a backup of the
>>> remaining brick, just to play safe. Gluster was then turned back on,
>>> first on the "master" and then on the "slave". Self-heal kicked in and
>>> started rebuilding the second brick. However, after 2 full days all
>>> files in the volume are still showing heal failed errors.
>>>
>>> The rebuild was, in my opinion at least, very slow, taking most of a
>>> day
>>> even though the system is on a 10Gb LAN. The data is a little under
>>> 1.4TB committed, 2TB allocated.
>> How much more to be healed? 0.6TB?
>>>
>>> Once the 2 bricks were very close to having the same amount of space
>>> used things slowed right down. For the last day both bricks show a very
>>> slow increase in used space, even though there are no changes being
>>> written by the client. By slow I mean just a few KB per minute.
>> Is the I/O still in progress on the mount? Self-heal doesn't happen
>> on files where I/O is going on mounts in 3.4.x. So that could be the
>> reason if I/O is going on.
>>>
>>> The logs are confusing, to say the least. In
>>> etc-glusterfs-glusterd.vol.log on both servers there are thousands of
>>> entries such as (possibly because I was using watch to monitor
>>> self-heal
>>> progress):
>>>
>>> [2014-06-29 21:41:11.289742] I
>>> [glusterd-volume-ops.c:478:__glusterd_handle_cli_heal_volume]
>>> 0-management: Received heal vol req for volume gluster-rhev
>> What versoin of gluster are you using?
>>> That timestamp is the latest on either server, that's about 9 hours ago
>>> as I type this. I find that a bit disconcerting. I have requested
>>> volume
>>> heal-failed info since then.
>>>
>>> The brick log on the "master" server (the one from which we are
>>> rebuilding the new brick) contains no entries since before the rebuild
>>> started.
>>>
>>> On the "slave" server the brick log shows a lot of entries such as:
>>>
>>> [2014-06-28 08:49:47.887353] E [marker.c:2140:marker_removexattr_cbk]
>>> 0-gluster-rhev-marker: Numerical result out of range occurred while
>>> creating symlinks
>>> [2014-06-28 08:49:47.887382] I
>>> [server-rpc-fops.c:745:server_removexattr_cbk] 0-gluster-rhev-server:
>>> 10311315: REMOVEXATTR
>>> /44d30b24-1ed7-48a0-b905-818dc0a006a2/images/02d4bd3c-b057-4f04-ada5-838f83d0b761/d962466d-1894-4716-b5d0-3a10979145ec
>>>
>>> (1c1f53ac-afe2-420d-8c93-b1eb53ffe8b1) of key  ==> (Numerical result
>>> out
>>> of range)
>> CC Raghavendra who knows about marker translator.
>>>
>>> Those entries are around the time the rebuild was starting. The final
>>> entries in that same log (immediately after those listed above) are:
>>>
>>> [2014-06-29 12:47:28.473999] I
>>> [server-rpc-fops.c:243:server_inodelk_cbk] 0-gluster-rhev-server: 2869:
>>> INODELK (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file
>>> or directory)
>>> [2014-06-29 12:47:28.489527] I [server-rpc-fops.c:1572:server_open_cbk]
>>> 0-gluster-rhev-server: 2870: OPEN (null)
>>> (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory)
>> These logs are harmless and were fixed in 3.5 I think. Are you on 3.4.x?
>>
>>>
>>> As I type it's 2014-06-30 08:31.
>>>
>>> What do they mean and how can I rectify it?
>>>
>>> regards,
>>> John
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________




More information about the Gluster-users mailing list