[Gluster-users] how to check/fix underlaying partition error?

Pavel Riha pavel.riha at trilogic.cz
Thu Apr 16 08:25:12 UTC 2015


Hi Jiri,

yes, the glusterd restart started the brick .. there is few sec delay 
only, so I was confused the first time.

btw the server was running for some months without problem .. and now 
the xfs partition (only this one) had problem .. no corruption, but 
after using it for few min, it stop with
XFS (dev): xlog_write: reservation summary
XFS (dev): xlog_write: reservation ran out. Need to up reservation
in the log

I found it could be related to this bug
https://bugzilla.redhat.com/show_bug.cgi?id=1092853
as I had 3.14 kernel.

so I upgraded to 3.18 kernel during night, and now the partition is 
working allready for some hours


now I'm confused only with the self-heal litle bit, there is something 
in shown in splitbrain, but don't how is t possible, there where problem 
only with one server, never with the second, so no legal reason for 
split, or?

(root at karamel)~# gluster volume heal storage2 info split-brain
Gathering Heal info on volume storage2 has been successful

Brick karamel:/mnt/gl/storage2/brick
Number of entries: 0

Brick champion:/mnt/gl/storage2/brick
Number of entries: 2
at                    path on brick
-----------------------------------
2015-04-15 18:00:21 /web02d/web02d.raw
2015-04-15 14:45:38 /web02d/web02d.raw


but info show nothing

(root at karamel)~# gluster volume heal storage2 info
Gathering Heal info on volume storage2 has been successful

Brick karamel:/mnt/gl/storage2/brick
Number of entries: 0

Brick champion:/mnt/gl/storage2/brick
Number of entries: 0


healed shows this web02d as healed, but I dont understand why everythin 
others is under the "champion" server and only this on "karamel" server 
(karamel was with the problem, the brick was not working for few days)


(root at karamel)~# gluster volume heal storage2 info healed
Gathering Heal info on volume storage2 has been successful

Brick karamel:/mnt/gl/storage2/brick
Number of entries: 1
at                    path on brick
-----------------------------------
2015-04-15 23:20:45 /web02d/web02d.raw

Brick champion:/mnt/gl/storage2/brick
Number of entries: 6
at                    path on brick
-----------------------------------
2015-04-15 23:53:46 /web02d/web02d.conf
2015-04-15 23:53:46 /web02d/tmp.raw
...


Pavel

On 16.4.2015 08:49, Jiri Hoogeveen wrote:
> Hi Pavel,
>
> killing the brick proces, is the way to go.
> This way, all other bricks on that server, will keep working.
> After you replace/fix the disk,
>
> A restart of the glusterd proces should me should be enough, to get the brick back online. (self-healing scan, can take some IO)
>
> Do you have some logs, about the brick that would not start?
>
> Btw, IO error on XFS? Did you lose some files from brick/.glusterfs, which can explain why the brick will not start up.
>
> Grtz, Jiri
>
>
>> On 15 Apr 2015, at 17:05, Pavel Riha <pavel.riha at trilogic.cz> wrote:
>>
>> Thank you for your reply.
>>
>> but btw what is the right way to do this?
>> stoping the glusterd service does not stop the glustefsd daemons itself
>> https://bugzilla.redhat.com/show_bug.cgi?id=988946
>>
>> and I have more volumes running, but only one with this problem.
>> I haven't found any official way how to stop the process, so I just KILLed them.
>> It worked.. partiton repaired, seems ok for now.
>>
>> But how to run the brick again??
>> I didn't save the cmdline showed in ps, but it was crazy. As I see the other running .. there are crazy numbers (uuid, socked, port)
>> and the port (for ex) is not the same as on the other server...
>>
>> so I restarted the glusterd service .. nothing happend .. I was hopeless
>> .. but after a while I recognized, that the process is running, so maybe the glusterd started it after a while
>>
>> there should be some way to stop or at least start one brick
>>
>>
>>
>> Pavel
>>
>>
>>
>> On 15.4.2015 11:59, Sander Zijlstra wrote:
>>> Hi Pavel,
>>>
>>> you can simply stop the glusterd service and run the fsck, it's similar to rebooting a server which is part of a replicated volume. If all is ok before you can simply take down one of the two and once it comes back online it will be heal each file which hasn't been copied allready.
>>>
>>> Do take care of any client which has the volume mounted using the server you take down; that will loose connection also.
>>>
>>> Met vriendelijke groet / kind regards,
>>>
>>> Sander Zijlstra
>>>
>>> Linux Engineer | SURFsara | Science Park 140 | 1098XG Amsterdam |
>>> +31 (0)6 43 99 12 47 | sander.zijlstra at surfsara.nl | www.surfsara.nl |
>>>
>>> ----- Original Message -----
>>> From: "Pavel Riha" <pavel.riha at trilogic.cz>
>>> To: gluster-users at gluster.org
>>> Sent: Wednesday, 15 April, 2015 10:28:50
>>> Subject: [Gluster-users] how to check/fix underlaying partition error?
>>>
>>> Hi guys,
>>>
>>> I have replicated glusterfs (v3.4.2) on two server and I found logs
>>> filled by IO error on one server only. But in /var/log/messages is no hw
>>> error, only XFS error, so I gues the filesystem could be corrupted
>>>
>>> My question is, how to stop or pause this brick and run fsck ?
>>>   From the replicate feature I'm expecting no need to stop the gluster
>>> volume (there are some xen VM running)
>>>
>>> what is the right way to do it? with the later re-adding and fast
>>> rebuild/sync in mind..
>>>
>>> thank for tips
>>>
>>> Pavel
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-users
>>>
>> _______________________________________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>


More information about the Gluster-users mailing list