[Gluster-devel] Consistent time attributes (ctime, atime and mtime) across replica set and distribution set

Tue Feb 28 07:41:22 UTC 2017

Thanks for the reply , Comments are inline

On 02/28/2017 12:50 PM, Niels de Vos wrote:
> On Tue, Feb 28, 2017 at 11:21:55AM +0530, Mohammed Rafi K C wrote:
>> Hi All,
>>
>>
>> We discussed the problem $subject in the mail thread [1]. Based on the
>> comments and suggestions I will summarize the design (Made as points for
>> simplicity.)
>>
>>
>> 1) As part of each fop, top layer will generate a time stamp and pass it
>> to the down along with other param.
>>
>>     1.1) This will bring a dependency for NTP synced clients along with
>> servers
> What do you mean with "top layer"? Is this on the Gluster client, or
> does the time get inserted on the bricks?
It is the top layer (master xlator) in client graph like fuse, gfapi,
nfs . My mistake I should have mentioned . Sorry for that.

>
> I think we should not require a hard dependency on NTP, but have it
> strongly suggested. Having a synced time in a clustered environment is
> always helpful for reading and matching logs.
Agreed, but if we go with option 1 where we generate time from client,
then time will not be in sync if not done with NTP.

>
>>     1.2) There can be a diff in time if the fop stuck in the xlator for
>> various reason, for ex: because of locks.
> Or just slow networks? Blocking (mandatory?) locks should be handled
> correctly. The time a FOP is blocked can be long.
True, the questions can this be included in timestamp valie, because if
it generated from say fuse then when it reaches to the brick the time
may have moved ahead. what do you think about it ?

>
>> 2) On the server posix layer stores the value in the memory (inode ctx)
>> and will sync the data periodically to the disk as an extended attr
>>
>>      2.1) of course sync call also will force it. And fop comes for an
>> inode which is not linked, we do the sync immediately.
> Does it need to be in the posix layer?

You mean storing the time attr ? then it need not be , protocol/server
is also another candidate but I feel posix is ahead in the race ;) .

>
>> 3) Each time when inodes are created or initialized it read the data
>> from disk and store it.
>>
>>
>> 4) Before setting to inode_ctx we compare the timestamp stored and the
>> timestamp received, and only store if the stored value is lesser than
>> the current value.
>>
>>
>> 5) So in best case data will be stored and retrieved from the memory. We
>> replace the values in iatt with the values in inode_ctx.
>>
>>
>> 6) File ops that changes the parent directory attr time need to be
>> consistent across all the distributed directories across the subvolumes.
>> (for eg: a create call will change ctime and mtime of parent dir)
>>
>>      6.1) This has to handle separately because we only send the fop to
>> the hashed subvolume.
>>
>>      6.2) We can asynchronously send the timeupdate setattr fop to the
>> other subvoumes and change the values for parent directory if the file
>> fops is successful on hashed subvolume.
>>
>>      6.3) This will have a window where the times are inconsistent
>> across dht subvolume (Please provide your suggestions)
> Isn't this the same problem for 'normal' AFR volumes? I guess self-heal
> needs to know how to pick the right value for the [cm]time xattr.

Yes and need to heal. Both self heal and dht. But till then there can be
difference in values.

>
>> 7) Currently we have couple of mount options for time attributes like
>> noatime, relatime , nodiratime etc. But we are not explicitly handled
>> those options even if it is given as mount option when gluster mount. [2]
> Where is the URL for [2]?

Sorry I missed it here goes [2] :
http://lists.gluster.org/pipermail/gluster-devel/2008-March/032304.html
. You can see the previous mails.
>
>>      7.1) We always relay on back end storage layer behavior, if you
>> have given those mount options when you mount your disk, you will get
>> this behaviour
> These options are for "not writing the atime", so if there is a client
> that does not use these options for mounting, the atime will be updated
> upon each access. Using these options on the brick-level, and not
> through fuse, nfs or smb would prevent it for all clients. Those are two
> use-cases, they probably need to be handled both in the future as well.

In that case then whoever masking the time from brick has to keep those
options in mind and act accordingly.

>
>>      7.2) Now if we are taking effort to fix the consistency issue, do
>> we need to honour those options by our own ?
> I do not think you need to handle them, and just rely on the filesystems
> (fuse, nfs and smb) to take care of it. However, check if Samba or
> NFS-Ganesha have config options for these, in that case, we might need
> to be able to tune it too.
>
>> Please provide your comments and suggestions.
> Please update https://bugzilla.redhat.com/show_bug.cgi?id=1318493 with
> your findings too.
>
> When this is fixed, caching solutions (like FS-Cache for NFS, SMB) will
> work much better. As mentioned in the BUG, we would be able to add a
> "birth time" attribute as well.
Cool. I will do that. I also see a comment asking the same for elastic
search and other application.

>
> Thanks,
> Niels
>
>
>>
>> [1] :
>> http://lists.gluster.org/pipermail/gluster-devel/2016-January/048003.html
>>
>>
>> Regards
>>
>> Rafi KC
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel