[Gluster-devel] Consistent time attributes (ctime, atime and mtime) across replica set and distribution set

Wed Mar 15 14:19:10 UTC 2017

On 03/08/2017 04:26 AM, Mohammed Rafi K C wrote:
> On 03/07/2017 08:28 PM, Shyam wrote:
>> On 02/28/2017 12:51 AM, Mohammed Rafi K C wrote:
>> - I would like to see the design in such a form that, when server side
>> replication and disperse become a reality, parts of the
>> design/implementation remain the same. This way work need not be
>> redone when server side anything happens. (I would have extended this
>> to adapt to DHT2 as well, but will leave that to you)
>>
>> - As this evolves we need to assign clear responsibilities to
>> different xlators that we are discussing, which will eventually happen
>> anyway, but just noting it ahead for clarity as we discuss this further.
>
> I feel it is much more flexible with server side replication/ec .
> Because we have a leader where we can control the behavior like
> generating the timestamp, syncing the data to other replica etc. I need
> to think about dht2. Nevertheless I will include this details in a document.

Yes, I agree that with server side replication/ec this becomes clearer, 
and possibly with less timesync requirements on the clients. Noting them 
down would help, thanks.

>>> 2) On the server posix layer stores the value in the memory (inode ctx)
>>> and will sync the data periodically to the disk as an extended attr
>>
>> I believe this should not be a part of the posix xlator. our posix
>> store definition is that it uses a local file-system underneath that
>> is POSIX compliant. This need, for storing the time information
>> outside of POSIX specification (as an xattr or otherwise), is
>> something that is gluster specific. As a result we should not fold
>> this into the posix xlator.
>>
>> For example, if we replace posix later with a db store, or a key-vlaue
>> store, we would need to code this cache management of time information
>> again for these stores, but if we abstract it out, then we do not need
>> to do the same.
>
> I agree that we may have to re-implement this if we coupled with posix xlator. But this is a very small piece of code where we store this time in indeo ctx and syncing it when require. Also as Amar pointed out each, back-end store may have different behavior. We can write this as abstract way so that we can re-use this tomorrow. But IMHO, I don't see this as an xlator.

Re-use is primarily what I am looking at. This looks more like a problem 
of "when will an inode metadata will be flushed to disk", time stamps 
being one of the criteria, so implementing it with good abstractions 
will help, thanks.

>>> 6) File ops that changes the parent directory attr time need to be
>>> consistent across all the distributed directories across the subvolumes.
>>> (for eg: a create call will change ctime and mtime of parent dir)
>>>
>>>      6.1) This has to handle separately because we only send the fop to
>>> the hashed subvolume.
>>>
>>>      6.2) We can asynchronously send the timeupdate setattr fop to the
>>> other subvoumes and change the values for parent directory if the file
>>> fops is successful on hashed subvolume.
>>
>> Am I right in understanding that this is not part of the solution, and
>> just a suggestion on what we may do in the future, or is it part of
>> the solution proposed?
>
> If we have an agreement from dht maintainers, I'm ready to take dht part
> also as part of this effort :) .

For a full solution, this would be a must, right? else, we again solve 
it part way and leave some gaps. I would like to see the problem 
addressed in whole if possible as the timestamp issue has been addressed 
in parts for way too long IMHO.

 From a DHT POV, we assimilate time stamp from all directories and use 
the highest. Considering this, if a subvolume that was last updated for 
a directory (due to a create/unlink or other call that updated parent 
timestamps), goes down, the time information may be incorrect. But, the 
cluster is running in a degraded mode anyway, as an entire subvol to DHT 
(and hence access and modifications and creations that land here are not 
possible), in such a situation returning stale timestamps maybe an 
option. Du thoughts?

Basically, yes we can asynchronously update time xattr, as stated, but 
maybe we can live without it as well?

>
>>
>> If the latter (i.e part of the solution proposed), which layer has the
>> responsibility to asynchronously update other DHT subvolumes?
>>
>> For example, posix-xlator does not have that knowledge, so it should
>> not be that xlator, which means it is some other xlator, now is that
>> on the client or on the server, and how is it crash consistent are
>> some things that come to mind when reading this, but will wait for the
>> details before thinking aloud.
>
> Yes, In the proposed solution it was dht who has to initiate the fop to
> sync the time attributes to the other subvolumes (synchronously or
> asynchronously) after let's say a create fop in the hashed subvol (just
> en ag ;) ).
>
> I'm totally in agreement with crash consistency, but thinking in broad
> normally posix doesn't guarantee the persistence of the data unless
> there is an explicit sync call . I thought we can include this also as a
> cache coherence problem. What do you think ?

The issue that I see here is, one of the subvolumes is updated, and the 
other not, so when the subvolume that is updated goes down and then 
comes back up again, we are really not keeping the guarantee, we will 
return an older timestamp one time and a newer one later.

IOW, in POSIX this would still be the older timestamp, whereas in our 
case this would be a newer timestamp, as one subvol of the directory got 
updated.

The reasoning above is to state, if what we end up doing can be punted 
like posix guarantee at all?

>>>      6.3) This will have a window where the times are inconsistent
>>> across dht subvolume (Please provide your suggestions)
>>>