[Gluster-devel] Consistent time attributes (ctime, atime and mtime) across replica set and distribution set

Wed Mar 15 21:43:48 UTC 2017

On Wed, Mar 15, 2017 at 11:31 PM, Soumya Koduri <skoduri at redhat.com> wrote:

> Hi Rafi,
>
> I haven't thoroughly gone through design. But have few comments/queries
> which I have posted inline for now .
>
> On 02/28/2017 01:11 PM, Mohammed Rafi K C wrote:
>
>> Thanks for the reply , Comments are inline
>>
>>
>>
>> On 02/28/2017 12:50 PM, Niels de Vos wrote:
>>
>>> On Tue, Feb 28, 2017 at 11:21:55AM +0530, Mohammed Rafi K C wrote:
>>>
>>>> Hi All,
>>>>
>>>>
>>>> We discussed the problem $subject in the mail thread [1]. Based on the
>>>> comments and suggestions I will summarize the design (Made as points for
>>>> simplicity.)
>>>>
>>>>
>>>> 1) As part of each fop, top layer will generate a time stamp and pass it
>>>> to the down along with other param.
>>>>
>>>>     1.1) This will bring a dependency for NTP synced clients along with
>>>> servers
>>>>
>>> What do you mean with "top layer"? Is this on the Gluster client, or
>>> does the time get inserted on the bricks?
>>>
>> It is the top layer (master xlator) in client graph like fuse, gfapi,
>> nfs . My mistake I should have mentioned . Sorry for that.
>>
>
> These clients shouldn't include internal client processes like rebalance,
> self-heal daemons right? IIUC from [1], we should avoid changing times
> during rebalance and self-heals.
>
> Also what about fops generated from the underlying layers -
> getxattr/setxattr which may modify these time attributes?
>
>
>>
>>
>>> I think we should not require a hard dependency on NTP, but have it
>>> strongly suggested. Having a synced time in a clustered environment is
>>> always helpful for reading and matching logs.
>>>
>> Agreed, but if we go with option 1 where we generate time from client,
>> then time will not be in sync if not done with NTP.
>>
>>
>>
>>
>>>     1.2) There can be a diff in time if the fop stuck in the xlator for
>>>> various reason, for ex: because of locks.
>>>>
>>> Or just slow networks? Blocking (mandatory?) locks should be handled
>>> correctly. The time a FOP is blocked can be long.
>>>
>> True, the questions can this be included in timestamp valie, because if
>> it generated from say fuse then when it reaches to the brick the time
>> may have moved ahead. what do you think about it ?
>>
>>
>>
>>> 2) On the server posix layer stores the value in the memory (inode ctx)
>>>> and will sync the data periodically to the disk as an extended attr
>>>>
>>> Will you use any timer thread for asynchronous update?
>
>
>>>>      2.1) of course sync call also will force it. And fop comes for an
>>>> inode which is not linked, we do the sync immediately.
>>>>
>>> Does it need to be in the posix layer?
>>>
>>
>> You mean storing the time attr ? then it need not be , protocol/server
>> is also another candidate but I feel posix is ahead in the race ;) .
>>
>
> I agree with Shyam and Niels that posix layer doesn't seem right. Since
> having this support comes with performance cost, how about a separate
> xlator (which shall be optional)?
>
>
>>
>>
>>> 3) Each time when inodes are created or initialized it read the data
>>>> from disk and store it.
>>>>
>>>>
>>>> 4) Before setting to inode_ctx we compare the timestamp stored and the
>>>> timestamp received, and only store if the stored value is lesser than
>>>> the current value.
>>>>
>>> If we choose not to set this attribute for self-heal/rebalance (as
> stated above) daemons, we would need special handling for the requests sent
> by them (i.e, to heal this time attribute as well on the destination
> file/dir).
>
>
>>>>
>>>> 5) So in best case data will be stored and retrieved from the memory. We
>>>> replace the values in iatt with the values in inode_ctx.
>>>>
>>>>
>>>> 6) File ops that changes the parent directory attr time need to be
>>>> consistent across all the distributed directories across the subvolumes.
>>>> (for eg: a create call will change ctime and mtime of parent dir)
>>>>
>>>>      6.1) This has to handle separately because we only send the fop to
>>>> the hashed subvolume.
>>>>
>>>>      6.2) We can asynchronously send the timeupdate setattr fop to the
>>>> other subvoumes and change the values for parent directory if the file
>>>> fops is successful on hashed subvolume.
>>>>
>>>
> The same needs to be handled even during DHT directory healing right?
>
>
>>>>      6.3) This will have a window where the times are inconsistent
>>>> across dht subvolume (Please provide your suggestions)
>>>>
>>> Isn't this the same problem for 'normal' AFR volumes? I guess self-heal
>>> needs to know how to pick the right value for the [cm]time xattr.
>>>
>>
>> Yes and need to heal. Both self heal and dht. But till then there can be
>> difference in values.
>>
>
> Is this design targetting to synchronize only ctime/mtime? If 'atime' is
> also considered , as the read/stat done by AFR shall modify atime only on
> the first subvol, even AFR xlator needs to take care of updating other
> subvols. Same goes with EC as well.
>

atime is updated on open which is sent to all subvols in AFR/EC

>
> Thanks,
> Soumya
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170316/e9963859/attachment.html>