[Gluster-devel] Consistent time attributes (ctime, atime and mtime) across replica set and distribution set

Niels de Vos ndevos at redhat.com
Tue Feb 28 07:20:18 UTC 2017

On Tue, Feb 28, 2017 at 11:21:55AM +0530, Mohammed Rafi K C wrote:
> Hi All,
> We discussed the problem $subject in the mail thread [1]. Based on the
> comments and suggestions I will summarize the design (Made as points for
> simplicity.)
> 1) As part of each fop, top layer will generate a time stamp and pass it
> to the down along with other param.
>     1.1) This will bring a dependency for NTP synced clients along with
> servers

What do you mean with "top layer"? Is this on the Gluster client, or
does the time get inserted on the bricks?

I think we should not require a hard dependency on NTP, but have it
strongly suggested. Having a synced time in a clustered environment is
always helpful for reading and matching logs.

>     1.2) There can be a diff in time if the fop stuck in the xlator for
> various reason, for ex: because of locks.

Or just slow networks? Blocking (mandatory?) locks should be handled
correctly. The time a FOP is blocked can be long.

> 2) On the server posix layer stores the value in the memory (inode ctx)
> and will sync the data periodically to the disk as an extended attr
>      2.1) of course sync call also will force it. And fop comes for an
> inode which is not linked, we do the sync immediately.

Does it need to be in the posix layer?

> 3) Each time when inodes are created or initialized it read the data
> from disk and store it.
> 4) Before setting to inode_ctx we compare the timestamp stored and the
> timestamp received, and only store if the stored value is lesser than
> the current value.
> 5) So in best case data will be stored and retrieved from the memory. We
> replace the values in iatt with the values in inode_ctx.
> 6) File ops that changes the parent directory attr time need to be
> consistent across all the distributed directories across the subvolumes.
> (for eg: a create call will change ctime and mtime of parent dir)
>      6.1) This has to handle separately because we only send the fop to
> the hashed subvolume.
>      6.2) We can asynchronously send the timeupdate setattr fop to the
> other subvoumes and change the values for parent directory if the file
> fops is successful on hashed subvolume.
>      6.3) This will have a window where the times are inconsistent
> across dht subvolume (Please provide your suggestions)

Isn't this the same problem for 'normal' AFR volumes? I guess self-heal
needs to know how to pick the right value for the [cm]time xattr.

> 7) Currently we have couple of mount options for time attributes like
> noatime, relatime , nodiratime etc. But we are not explicitly handled
> those options even if it is given as mount option when gluster mount. [2]

Where is the URL for [2]?

>      7.1) We always relay on back end storage layer behavior, if you
> have given those mount options when you mount your disk, you will get
> this behaviour

These options are for "not writing the atime", so if there is a client
that does not use these options for mounting, the atime will be updated
upon each access. Using these options on the brick-level, and not
through fuse, nfs or smb would prevent it for all clients. Those are two
use-cases, they probably need to be handled both in the future as well.

>      7.2) Now if we are taking effort to fix the consistency issue, do
> we need to honour those options by our own ?

I do not think you need to handle them, and just rely on the filesystems
(fuse, nfs and smb) to take care of it. However, check if Samba or
NFS-Ganesha have config options for these, in that case, we might need
to be able to tune it too.

> Please provide your comments and suggestions.

Please update https://bugzilla.redhat.com/show_bug.cgi?id=1318493 with
your findings too.

When this is fixed, caching solutions (like FS-Cache for NFS, SMB) will
work much better. As mentioned in the BUG, we would be able to add a
"birth time" attribute as well.


> [1] :
> http://lists.gluster.org/pipermail/gluster-devel/2016-January/048003.html
> Regards
> Rafi KC
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170227/6944b9fc/attachment.sig>

More information about the Gluster-devel mailing list