[Gluster-users] Pretty much any operation related to Gluster mounted fs hangs for a while

Tiago Santos tiago at musthavemenus.com
Tue Feb 10 23:32:03 UTC 2015


Allright, it seems we're fine now!


We basically took two actions and the network issue seems gone.

1. These servers are VM on a cloud provider, so I don't really have access
to details here. The assigned sysadmin reported that one of my Gluster VMs
were on a crowded host, and that could be potentially been affecting on
both load (CPU/memory) and network performance. He moved this one VM to a
new (and more free) host. The other VM that is part of this gluster setup
was kept as before.

2. I set up a new internet-isolated sub-net between these VMs, allowing me
to get firewall out of the way.

It seems #1 was the responsible, and #2 was an achieved nice-to-have.


Before:

root at web3:~# date; time ls -ltrh
/var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png
Mon Jan 26 07:00:27 PST 2015
-rwx---r-- 1 mhmadmin mhmadmin 61K Jan 22 14:37
/var/www/site-images/templates/assets/prod/temporary/13/user_1339200.png

real 0m*33.651s*
user 0m0.001s
sys 0m0.004s


After:

root at web3:~# date; time ls -ltrh
/var/www/site-images/templates/assets/prod/temporary/13/user_1410560.png
Tue Feb 10 15:28:18 PST 2015
-rwx---r-- 1 mhmadmin mhmadmin 17K Feb 10 12:41
/var/www/site-images/templates/assets/prod/temporary/13/user_1410560.png

real 0m*0.031s*
user 0m0.001s
sys 0m0.006s


The case seems closed. If you guys have any questions that I know the
answer or can reply, please let me know.


Thanks Anirban, Joe and selected audience :)


-- 
*Tiago Santos*







On Wed, Jan 28, 2015 at 2:45 PM, Tiago Santos <tiago at musthavemenus.com>
wrote:

> Since I stopped writing to the clients (so I could cleanly work on the
> split brain) I got no more entries on /var/log/gluster.log (this is the
> client log, right?)
>
>
> While working with diff command in order to fix the split brain, I saw
> several entries like these:
>
> diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558482:
> Transport endpoint is not connected
> diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558483:
> Transport endpoint is not connected
> diff: r2/webhost/sites/clipart/assets/apache/images/13/templates/558484:
> Transport endpoint is not connected
>
> They happen a lot, then stops. Then happen again and so on.
>
> At the same time the errors are showing, ping from the system I'm working
> on split-brain to the system that is failing to connect (r2) shows this:
>
> 64 bytes from r2-server (r2-ip): icmp_seq=662 ttl=64 time=1.21 ms
> 64 bytes from r2-server (r2-ip): icmp_seq=663 ttl=64 time=0.990 ms
> 64 bytes from r2-server (r2-ip): icmp_seq=664 ttl=64 time=1.01 ms
>
> I know this is a very trivial network checking that may not be showing me
> what I want to see, and I'm working on more elaborated one. But I'm
> completely open for suggestions on how to properly do that in order to
> verify if this is issue when talking about gluster.
>
>
> So far, thank you so much, guys!
>
>

>
> On Mon, Jan 26, 2015 at 8:36 PM, Joe Julian <joe at julianfamily.org> wrote:
>
>>  Check your client logs. Perhaps the client isn't actually connecting to
>> both servers.
>>
>> On 01/26/2015 02:12 PM, Tiago Santos wrote:
>>
>> That's what I meant. Sorry for the confusion.
>>
>> I'm writing on Client1 (same server as Brick1). Client2 (mounted Brick2,
>> on server2) has nothing writing to it (so far).
>>
>>  My wondering is how I went up on having a split-brain if I'm only
>> writing on one client.
>>
>>
>>
>>
>>
>> On Mon, Jan 26, 2015 at 8:04 PM, Joe Julian <joe at julianfamily.org> wrote:
>>
>>>  Nothing but GlusterFS should be writing to bricks. Mount a client and
>>> write there.
>>>
>>>
>>> On 01/26/2015 01:38 PM, Tiago Santos wrote:
>>>
>>> Right.
>>>
>>>  I have Brick1 being constantly written. But I have nothing writing on
>>> Brick2. It just get "healed" data from Brick1.
>>>
>>>  This setup is still not in production, and there's no applications
>>> using that data. I have rsyncs constantly updating Brick1 (bring data from
>>> production servers), and then Gluster updates Brick2.
>>>
>>>  Which makes me wonder how may I be creating multiple replicas during a
>>> split-brain.
>>>
>>>
>>>  It may be the case that, having a split-brain event, I may be updating
>>> versions of the same file on Brick1 (only), and Gluster understands it as
>>> different versions and things get confuse?
>>>
>>>
>>>  Anyways, while we talk I'm gonna run Joe's precious procedure on
>>> split-brain recovery.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jan 26, 2015 at 7:23 PM, Joe Julian <joe at julianfamily.org>
>>> wrote:
>>>
>>>> Mismatched GFIDs would happen if a file is created on multiple replicas
>>>> during a split-brain event. The GFID is assigned at file creation.
>>>>
>>>>
>>>> On 01/26/2015 01:04 PM, A Ghoshal wrote:
>>>>
>>>>>  Yep, so it is indeed a split-brain caused by a mismatch of the
>>>>> trusted.gfid attribute.
>>>>>
>>>>> Sadly, I don't know precisely what causes it. -Communication loss
>>>>> might be one of the triggers. I am guessing the files with the problem are
>>>>> dynamic, correct? In our setup (also replica 2), communication is never a
>>>>> problem but we do see this when one of the server takes a reboot. Maybe
>>>>> some obscure and difficult to understand race between background self-heal
>>>>> and the self heal daemon...
>>>>>
>>>>> In any case, a normal procedure for split brain recovery would work
>>>>> for you if you wish to get you files back in function. It's easy to find on
>>>>> google. I use the instructions on Joe Julian's blog page myself.
>>>>>
>>>>>
>>>>>   -----Tiago Santos <tiago at musthavemenus.com> wrote: -----
>>>>>
>>>>>   =======================
>>>>>   To: A Ghoshal <a.ghoshal at tcs.com>
>>>>>   From: Tiago Santos <tiago at musthavemenus.com>
>>>>>   Date: 01/27/2015 02:11AM
>>>>>   Cc: gluster-users <gluster-users at gluster.org>
>>>>>   Subject: Re: [Gluster-users] Pretty much any operation related to
>>>>> Gluster mounted fs hangs for a while
>>>>>   =======================
>>>>>     Oh, right!
>>>>>
>>>>> Follow the outputs:
>>>>>
>>>>>
>>>>> root at web3:/export/images1-1/brick# time getfattr -m . -d -e hex
>>>>> templates/assets/prod/temporary/13/user_1339200.png
>>>>> # file: templates/assets/prod/temporary/13/user_1339200.png
>>>>> trusted.afr.site-images-client-0=0x000000000000000400000000
>>>>> trusted.afr.site-images-client-1=0x000000020000000900000000
>>>>> trusted.gfid=0x10e5894c474a4cb1898b71e872cdf527
>>>>>
>>>>> real 0m0.024s
>>>>> user 0m0.001s
>>>>> sys 0m0.001s
>>>>>
>>>>>
>>>>>
>>>>> root at web4:/export/images2-1/brick# time getfattr -m . -d -e hex
>>>>> templates/assets/prod/temporary/13/user_1339200.png
>>>>> # file: templates/assets/prod/temporary/13/user_1339200.png
>>>>> trusted.afr.site-images-client-0=0x000000000000000000000000
>>>>> trusted.afr.site-images-client-1=0x000000000000000000000000
>>>>> trusted.gfid=0xd02f14fcb6724ceba4a330eb606910f3
>>>>>
>>>>> real 0m0.003s
>>>>> user 0m0.000s
>>>>> sys 0m0.006s
>>>>>
>>>>>
>>>>> Not sure exactly what that means. I'm googling, and would appreciate
>>>>> if you
>>>>> guys can bring some light.
>>>>>
>>>>> Thanks!
>>>>> --
>>>>> Tiago
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 26, 2015 at 6:16 PM, A Ghoshal <a.ghoshal at tcs.com> wrote:
>>>>>
>>>>>  Actually you ran getfattr on the volume - which is why the requisite
>>>>>> extended attributes never showed up...
>>>>>>
>>>>>> Your bricks are mounted elsewhere.
>>>>>>   /exports/images1-1/brick, and exports/images2-1/brick
>>>>>>
>>>>>> Btw, what version of Linux do you use? And, are the files you observe
>>>>>> the
>>>>>> input/output errors on soft-links?
>>>>>>
>>>>>>   -----Tiago Santos <tiago at musthavemenus.com> wrote: -----
>>>>>>
>>>>>>   =======================
>>>>>>   To: A Ghoshal <a.ghoshal at tcs.com>
>>>>>>   From: Tiago Santos <tiago at musthavemenus.com>
>>>>>>   Date: 01/27/2015 12:20AM
>>>>>>   Cc: gluster-users <gluster-users at gluster.org>
>>>>>>   Subject: Re: [Gluster-users] Pretty much any operation related to
>>>>>> Gluster
>>>>>> mounted fs hangs for a while
>>>>>>   =======================
>>>>>>     Thanks for you input, Anirban.
>>>>>>
>>>>>> I ran the commands on both servers, with the following results:
>>>>>>
>>>>>>
>>>>>> root at web3:/var/www/site-images# time getfattr -m . -d -e hex
>>>>>> templates/assets/prod/temporary/13/user_1339200.png
>>>>>>
>>>>>> real 0m34.524s
>>>>>> user 0m0.004s
>>>>>> sys 0m0.000s
>>>>>>
>>>>>>
>>>>>> root at web4:/var/www/site-images# time getfattr -m . -d -e hex
>>>>>> templates/assets/prod/temporary/13/user_1339200.png
>>>>>> getfattr: templates/assets/prod/temporary/13/user_1339200.png:
>>>>>> Input/output
>>>>>> error
>>>>>>
>>>>>> real 0m11.315s
>>>>>> user 0m0.001s
>>>>>> sys 0m0.003s
>>>>>> root at web4:/var/www/site-images# ls
>>>>>> templates/assets/prod/temporary/13/user_1339200.png
>>>>>> ls: cannot access templates/assets/prod/temporary/13/user_1339200.png:
>>>>>> Input/output error
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.gluster.org/pipermail/gluster-users/attachments/20150210/18511c6a/attachment.html>


More information about the Gluster-users mailing list