[Gluster-users] Initial sync

Ravishankar N ravishankar at redhat.com
Thu Nov 6 15:24:03 UTC 2014

On 11/05/2014 10:53 PM, Andreas Hollaus wrote:
> Hi,
> Maybe my GlusterFS source code is a bit old, but these scripts seem to 
> be referred to as filters.
I do not know what the below references are either.
> ### glusterd-volgen.c ###
> .
> .
> .
> static void
> volgen_apply_filters (char *orig_volfile)
> {
>     DIR           *filterdir = NULL;
>     struct dirent  entry = {0,};
>     struct dirent *next = NULL;
>     char          *filterpath = NULL;
>     struct stat    statbuf = {0,};
>     filterdir = opendir(FILTERDIR);
> .
> .
> .
> Anyway, I was previously told to use a sed script as the volume files 
> will be overwritten whenever an option is set using a CLI command. I 
> have created such a script, but I wonder where I shall store it 
> (according to the Makefile, the path depends on installation dir and 
> release version and I'm only sure about the latter)?
> Maybe if I, as you say, edit the file before the volume is started 
> this favorite-child setting will be read and then part of the volume 
> settings that are stored whenever a CLI command is executed. No, I 
> just tried this and the favorite-child option was removed when I later 
> on set a new ping-timeout value. Seems like a script is required after 
> all to make sure that the setting is persistent. I hope though that 
> this setting is read from the volume file and handled, even though it 
> is not rewritten in case I set other options(?).
Right, edits to volfile will be lost when a new volume set operation is 
made as the files get rewritten. So you would have to use hook 
scripts[1]. Since you want
to retain the fav-child option after every volume set operations, you 
must place your sed script in /var/lib/glusterd/hooks/1/set/post/.

Here is a nice example of how to use them: 

[1] http://www.gluster.org/community/documentation/index.php/Features/Hooks
> By the way, you told me to edit the 'trusted-<volname>.vol' file, but 
> I also have a '<volname>.vol' file with similar contents. What's the 
> difference between these and is it only the 'trusted-<volname>.vol' 
> that is supposed to be edited?

Sorry I missed out the difference. If the client (mount) is on the same 
node as the server(s) you need to edit the trusted.*.vol file. If your 
client is a separate node, you need to edit the other one.

> Regards
> Andreas
> On 11/05/14 16:43, Ravishankar N wrote:
>> On 11/05/2014 06:54 PM, Andreas Hollaus wrote:
>>> On 11/05/14 12:23, Ravishankar N wrote:
>>>> On 11/05/2014 03:18 PM, Andreas Hollaus wrote:
>>>>> Hi,
>>>>> I'm curious about this 5 phase transaction scheme that is described in the document
>>>>> (lock, pre-op, op, post-op, unlock).
>>>>> Are these stage switches all triggered from the client or can the server do it
>>>>> without notifying the client, for instance switching from 'op' to 'post-op'?
>>>> All stages are performed by the AFR translator in the client graph, where it is
>>>> loaded, in the sequence you listed.
>>> So the counters are stored on the servers (as extended attributes on the bricks), but
>>> increased and decreased by the client after fetching them from the servers? If so, I
>>> guess that the messages between those are just synchronous file system operations
>>> like read extended attributes, write file etc.
>> You got it right. Lock the file on the bricks, set xattrs on bricks, 
>> write to bricks, clear xattrs on bricks (success case), unlock file 
>> on bricks.
>>> Is the client created whenever a GlusterFS volume is mounted?
>> Correct. You give the hostname+volume name to mount process which it 
>> uses to fetches the volfile graph from the server, reads it and loads 
>> the appropriate xlators.
>>>   As I'm running both
>>> server and client on the same board it's a bit hard to distinguish them from each other.
>>>>> Decreasing the counter for the local pending operations could be done without talking
>>>>> to the client, even though I realize a message has to sent to the other server(s),
>>>>> possibly through the client.
>>>>> The reason I ask is that I'm trying to estimate the risk of ending up in a split
>>>>> brain situation, or at least understand if our servers will 'accuse' each other
>>>>> temporarily during this 5 phase transaction under normal circumstances. If I
>>>>> understand who sends messages to who and I what order, I'll have a better chance to
>>>>> see if we require any solution to split brain situations. As I've experienced
>>>>> problems to setup the 'favorite-child' option, I want to know if it's required or
>>>>> not. In our use case, quorum is not a solution, but losing some data is acceptable as
>>>>> long as the bricks are in sync.
>>>> If a file is split-brained, AFR does not allow modifications  by clients on it
>>>> until the split-brain is resolved. The afr xattrs and heal mechanisms ensure that
>>>> the bricks are in sync, so worries on that front.
>>> I know about the input/output error in case of a split brain and that is something we
>>> must avoid at any cost. That's the reason why 'favorite-child' seems like a good idea
>>> for us, but my filter script is not executed even though I tried a couple of probable
>>> locations to store it at. It's a bit hard to be absolutely sure what that filter path
>>> macro contained at the time the GlusterFS package was built. It would have been
>>> easier if the path existed, even though it was empty if no filters were used.
>>> According to the source code, there are some return statements due to errors that
>>> could also be the reason for not running the filter script. Are there any ways to set
>>> verbose level to get some more clues to what's going on?
>> Not sure I follow you on what a filter script is (hook scripts?), but 
>> yes, you can use the  favourite-child option to pick the source for 
>> split-brained files. I don't think it's a supported/tested feature 
>> though. It can't be set using gluster CLI. You will have to edit the 
>> volfile manually and add this option before starting the volume like so:
>> #cat /var/lib/glusterd/vols/testvol/trusted-testvol-fuse.vol
>> <snip>
>> volume testvol-replicate-0
>>     type cluster/replicate
>> *option favorite-child testvol-client-1*
>>     subvolumes testvol-client-0 testvol-client-1
>> end-volume
>> </snip>
>> -Ravi
>>> Regards
>>> Andreas
>>>> Thanks,
>>>> Ravi
>>>>> Regards
>>>>> Andreas
>>>>> On 10/31/14 15:37, Ravishankar N wrote:
>>>>>> On 10/30/2014 07:23 PM, Andreas Hollaus wrote:
>>>>>>> Hi,
>>>>>>> Thanks! Seems like an interesting document. Although I've read blogs about how
>>>>>>> extended attributes are used as a change log, this seams like a more comprehensive
>>>>>>> document.
>>>>>>> I won't write directly to any brick. That's the reason I first have to create a
>>>>>>> volume which consists of only one brick, until the other server is available, and
>>>>>>> then add that second brick. I don't want to delay the file system clients until the
>>>>>>> second server is available, hence the reason for add-brick.
>>>>>>> I guess that this procedure is only needed the first time the volume is configured,
>>>>>>> right? If any of these bricks would fail later on, the change log would keep
>>>>>>> track of
>>>>>>> all changes to the file system even though only one of the bricks is available(?).
>>>>>> Yes, if one one brick of a replica pair goes down, the other one keeps track of
>>>>>> file modifications by the client, and would sync it back to the first one when it
>>>>>> comes back up.
>>>>>>> After a restart, volume settings stored in the configuration file would be accepted
>>>>>>> even though not all servers were up and running yet at that time, wouldn't they?
>>>>>> glusterd running on all nodes ensures that the volume configurations stored on each
>>>>>> node are in sync.
>>>>>>> Speaking about configuration files. When are these copied to each server?
>>>>>>> If I create a volume which consists of two bricks, I guess that those servers will
>>>>>>> create the configuration files, independently of each other, from the information
>>>>>>> sent from the client (gluster volume create...).
>>>>>> All volume config/management commands must be run from any of the servers that make
>>>>>> up the volume and not the client (unless both happen to be in the same machine). As
>>>>>> mentioned above, when any of the volume commands are run on any one server,
>>>>>> glusterd orchestrates the necessary action on all servers and keeps them in sync.
>>>>>>>     In case I later on add a brick, I guess that the settings have to be copied
>>>>>>> to the
>>>>>>> new brick after they have been modified on the first one, right (or will they be
>>>>>>> recreated on all servers from the information specified by the client, like in the
>>>>>>> previous case)?
>>>>>>> Will configuration files be copied in other situations as well, for instance in
>>>>>>> case
>>>>>>> one of the servers which is part of the volume for some reason would be missing
>>>>>>> those
>>>>>>> files? In my case, the root file system is recreated from an image at each
>>>>>>> reboot, so
>>>>>>> everything created in /etc will be lost. Will GlusterFS settings be restored
>>>>>>> from the
>>>>>>> other server automatically
>>>>>> No, it is expected that servers have persistent file-systems.  There are ways to
>>>>>> restore such bricks; see
>>>>>> http://gluster.org/community/documentation/index.php/Gluster_3.4:_Brick_Restoration_-_Replace_Crashed_Server
>>>>>> -Ravi
>>>>>>> or do I need to backup and restore those myself? Even
>>>>>>> though the brick doesn't know that it is part of a volume in case it lose the
>>>>>>> configuration files, both the other server(s) and the client(s) will probably
>>>>>>> recognize it as being part of the volume. I therefore believe that such a
>>>>>>> self-healing would actually be possible, even though it may not be implemented.
>>>>>>> Regards
>>>>>>> Andreas
>>>>>>>    On 10/30/14 05:21, Ravishankar N wrote:
>>>>>>>> On 10/28/2014 03:58 PM, Andreas Hollaus wrote:
>>>>>>>>> Hi,
>>>>>>>>> I'm curious about how GlusterFS manages to sync the bricks in the initial phase,
>>>>>>>>> when
>>>>>>>>> the volume is created or
>>>>>>>>> extended.
>>>>>>>>> I first create a volume consisting of only one brick, which clients will start to
>>>>>>>>> read and write.
>>>>>>>>> After a while I add a second brick to the volume to create a replicated volume.
>>>>>>>>> If this new brick is empty, I guess that files will be copied from the first
>>>>>>>>> brick to
>>>>>>>>> get the bricks in sync, right?
>>>>>>>>> However, if the second brick is not empty but rather contains a subset of the
>>>>>>>>> files
>>>>>>>>> on the first brick I don't see
>>>>>>>>> how GlusterFS will solve the problem of syncing the bricks.
>>>>>>>>> I guess that all files which lack extended attributes could be removed in this
>>>>>>>>> scenario, because they were created
>>>>>>>>> when the disk was not part of a GlusterFS volume. However, in case the brick was
>>>>>>>>> used
>>>>>>>>> in the volume previously,
>>>>>>>>> for instance before that server restarted, there will be extended attributes for
>>>>>>>>> the
>>>>>>>>> files on the second brick which
>>>>>>>>> weren't updated during the downtime (when the volume consisted of only one
>>>>>>>>> brick).
>>>>>>>>> There could be multiple
>>>>>>>>> changes to the files during this time. In this case I don't understand how the
>>>>>>>>> extended attributes could be used to
>>>>>>>>> determine which of the bricks contains the most recent file.
>>>>>>>>> Can anyone explain how this works? Is it only allowed to add empty bricks to a
>>>>>>>>> volume?
>>>>>>>> It is allowed to add only empty bricks to the volume. Writing directly to
>>>>>>>> bricks is
>>>>>>>> not supported. One needs to access the volume only from a mount point or using
>>>>>>>> libgfapi.
>>>>>>>> After adding a brick to increase the distribute count, you need to run the volume
>>>>>>>> rebalance command so that the some of the existing files are hashed (moved) to
>>>>>>>> this
>>>>>>>> newly added brick.
>>>>>>>> After adding a brick to increase the replica count, you need to run the volume
>>>>>>>> heal
>>>>>>>> full command to sync the files from the other replica into the newly added brick.
>>>>>>>> https://github.com/gluster/glusterfs/blob/master/doc/features/afr-v1.md  will give
>>>>>>>> you an idea of how the replicate translator uses xattrs to keep files in sync.
>>>>>>>> HTH,
>>>>>>>> Ravi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20141106/6b3db67b/attachment.html>

More information about the Gluster-users mailing list