[Gluster-users] Advise on recovering from a bad replica please

Thu Jun 26 17:32:50 UTC 2014

On 06/26/2014 04:10 AM, John Gardeniers wrote:
> Hi Pranith,
>
> jupiter currently has no gluster processes running.
> jupiter.om.net:/gluster_backup is a geo-replica.
>
> [root at nix ~]# gluster volume info
>   
> Volume Name: gluster-backup
> Type: Distribute
> Volume ID: 0905fb11-f95a-4533-ae1c-05be43a8fe1f
> Status: Started
> Number of Bricks: 1
> Transport-type: tcp
> Bricks:
> Brick1: jupiter.om.net:/gluster_backup
>   
> Volume Name: gluster-rhev
> Type: Replicate
> Volume ID: b210cba9-56d3-4e08-a4d0-2f1fe8a46435
> Status: Started
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: jupiter.om.net:/gluster_brick_1
> Brick2: nix.om.net:/gluster_brick_1
> Options Reconfigured:
> geo-replication.indexing: on
> nfs.disable: on

I am extremely sorry, I should have asked for this information also 
yesterday.
1) What is the version of gluster you are using? In 3.4.x there is this 
issue where if operations are happening on VM self-heal wouldn't start, 
which is not the case in 3.5 I believe. So it is important. I remembered 
it only in the morning.

2) I believe the number of files on the bricks should be very less 
considering it is a rhevm setup. Could you please also attach the output of

For each brick
find <brick-path> | xargs getfattr -d -m. -e hex > 
file-you-need-to-send-us.txt

This should help us see the xattrs of the files to help you on how to 
fix the split-brains where necessary.

Pranith

> regards,
> John
>
>
> On 25/06/14 19:05, Pranith Kumar Karampuri wrote:
>> On 06/25/2014 04:29 AM, John Gardeniers wrote:
>>> Hi All,
>>>
>>> We're using Gluster as the storage for our virtualization. This consists
>>> of 2 servers with a single brick each configured as a replica pair. We
>>> also have a geo-replica on one of those two servers.
>>>
>>> For reasons that don't really matter, last weekend we had a situation
>>> which cause one server to reboot a number of times, which in turn
>>> resulted in a lot of heal-failed and split-brain errors. Because at the
>>> same time VMs were being migrated across hosts we ended up with many
>>> crashed VMs.
>>>
>>> Due to the need get the VMs up and running with as quickly as possible
>>> we decided to shut down one Gluster replica and use the "primary" one
>>> alone. As the geo-replica is also on the node we shut down that leaves
>>> us with just a single copy, which makes us rather nervous.
>>>
>>> As we have decided to treat the files on the currently running node as
>>> "correct", I'd appreciate advise on the best way to get the other node
>>> back into the replication. Should we simply bring it back on line and
>>> try to correct the errors that I expect will be many or should we treat
>>> it as a failed server and bring it back with an empty brick, rather than
>>> what is currently in the existing brick? The volume/bricks are 5TB, of
>>> which we're currently using around 2TB and the servers are on a 10Gb
>>> network, so I imagine it shouldn't take too long to rebuild and this
>>> would all be done out of hours anyway.
>> Considering you are saying there were split-brain related errors as
>> well. I suggest you bring up empty brick.
>> Could you give "gluster volume info" output and tell me which brick
>> went down. Based on that I will tell you
>> what you need to do.
>>
>> Pranith
>>> regards,
>>> John
>>>
>>> _______________________________________________
>>> Gluster-users mailing list
>>> Gluster-users at gluster.org
>>> http://supercolony.gluster.org/mailman/listinfo/gluster-users
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud service.
>> For more information please visit http://www.symanteccloud.com
>> ______________________________________________________________________