[Gluster-devel] Automated split-brain resolution

Fri Aug 8 10:02:45 UTC 2014

On 08/08/2014 01:09 PM, Harshavardhana wrote:
> On Thu, Aug 7, 2014 at 1:35 AM, Ravishankar N <ravishankar at redhat.com> wrote:
>> Manual resolution of split-brains [1] has been a tedious task involving
>> understanding and modifying AFR's changelog extended attributes. To simplify
>> and to an extent automate this task, we are proposing a new CLI command with
>> which the user can  specify  what the source brick/file is, and
>> automatically heal the files in the appropriate direction.
>>
>> Command: gluster volume resolve-split-brain <VOLNAME> {<bigger_file>  |
>> source-brick <brick_name> [<file>] }
>>
>> Breaking up the command into its possible options, we have:
>>
>> a) gluster volume resolve-split-brain <VOLNAME> <bigger_file>
>> When this command is executed, AFR will consider the brick having the
>> highest file size as the source and heal it to all other bricks (including
>> all other sources and sinks) in that replica subvolume. If the file size is
>> same in all the bricks, it does *not* heal the file.
>>
>> b) gluster volume resolve-split-brain <VOLNAME > source-brick <brick_name >
>> [<file>]
>>
>> When this command is executed, if <file> is specified, AFR heals the file
>> from the source-brick <brick_name> to all other bricks of that replica
>> subvolume. For resolving multiple files, the command must be run
>> iteratively, once per file.
>> If <file> is not specified, AFR heals all the files that have an entry in
>> .glusterfs/indices/xattrop *and* are in split-brain. As before, heals happen
>> from source-brick <brick_name> to all other bricks.
>>
>> Future work could also include extending the command to add other policies
>> like choosing the file having the latest mtime as the source, integration
>> with trash xlator wherein the files deleted from the sink are moved to the
>> trash dir etc.
>>
> I have a few queries regarding the overall design itself.
>
> Here are the caveats
>
>     - Adding a new option rather than extending an existing option
> 'gluster volume heal'.

This does make sense.

>     - Asking user to input the filename which is not necessary as
> default since such files are already
>       available through the 'gluster volume heal <volname> info split-brain'

As of today, `info split-brain` is not 100% accurate. It does not list 
entries that are in gfid split-brain (but we are not attempting to heal 
that now anyway using a gluster CLI), and for the files that are in 
(meta)data split-brain, it lists only the last 1024 entries and 
sometimes contains stale entries. But this will be fixed soon with a 
gfapi based implementation, much like  `heal info` command (glfs-heal.c) 
in the 3.5 release.

>
> What would be ideal is the following making it seamless and much more
> user friendly
>
> Extend the existing CLI as following
>
>   - 'gluster volume heal <volname> split-brain'

Agreed.

>
> Healing split-brained files is more palpable and has a rather more
> convincing tone for a sys-admin IMHO.
>
> An example version of this extension would be.
>
> 'gluster volume heal <volname> split-brain [<file>|<gfid as canonical form>]
>
> In-fact since we already know the list of split-brained files we can
> just loop through them and ask interactive questions
>
> # gluster volume heal <volname> split-brain
> WARNING: About to start fixing split brained files on an active
> GlusterFS volume, do you wish to proceed? y
>
> WARNING: files removed would be actively backed up in '.trash' under
> your brick path for future recovery.
> ...
> WARNING: Found 1000 files in split brain
> ...
> File on pair 'host1:host2' is in split brain, file with latest
> time-stamp found on host1 - Fix? y
> File on pair 'host3:host5' is in split brain. file with biggest size
> found on host5 - Fix? y
> ....
> ....
> ....
> ....
> ************ Fixed (1000 split brain files) ************

While we could extend the existing heal command, we also need to provide 
a policy flag. Entering "y/n" for 1000 files does not make the process 
any easier.

> # gluster volume heal <volname> split-brain
> INFO: no split brains present on this <volume>
>
> The real pain point of fixing the split brain is not taking getfattr
> outputs and figuring out what is the file under conflict, the real
> pain point is doing the gfid to the actual file translation when there
> are millions of files. Gathering this list takes more time than
> actually fixing the split brain and i have personally spent countless
> hrs doing these.
I don't follow this part completely. If `info split-brain` gives you the 
gfid instead of file path, you could just go to the .glusterfs/<gfid 
hardlink> and do a setfattr there.
>
> Now this list is easily available to GlusterFS and also its gfid to
> path translation - why isn't it simple enough for us to ask the user
> what we think is the right choice - we do certainly know which is the
> bigger file too.
>
> My general contention is when we know what is the right thing to do
> under certain conditions we should be making it easier for example:
> Directory metadata split brains - we just fix it automatically today
> but certainly wasn't the case in the past. We learnt to do the right
> thing when its necessary from experience.

Sure, we have info on which the bigger file is or the one with the 
latest ctime but the bigger file need not always be the source (a 
truncated file could be the pristine copy). So the choice has to be 
given to the user via a policy.  To make automation easier, it makes 
more sense to apply the policy to all files as a whole, or run the 
command once per file, with a policy of choice. Running the command only 
once and asking the policy (per file) in the itermediate execution is 
not amenable to automation. The user can redirect `info split-brain` to 
a file and then script something to run the command for each entry in 
the file. Also makes it easy to integrate with a GUI: Click 'get files 
in sb' and you have a scroll-down list of files with polices against 
each file. Select a file, tick the policy and click 'resolve-sb' and done!
>
> A greater UI experience make it really 'automated' as you intend to
> do, to make larger decisions ourselves and users are left with simple
> choices to be made so that its not confusing.
>

So we now have the command:
# gluster volume heal <VOLNAME> [full | info [split-brain] | split-brain 
{bigger-file  |  source-brick <brick_name>} [<file>] ]

The relevant new extension being:
gluster volume heal  <VOLNAME> split-brain {bigger-file  | source-brick 
<brick_name>} [<file>]

Does this look good? Thanks for your feedback :)

-Ravi