[Gluster-devel] distributed files/directories and [cm]time updates

Tue Jan 26 09:51:35 UTC 2016

Hi Joseph,

On 26/01/16 10:42, Joseph Fernandes wrote:
> Hi Xavi,
>
> Answer inline:
>
> ----- Original Message -----
> From: "Xavier Hernandez" <xhernandez at datalab.es>
> To: "Joseph Fernandes" <josferna at redhat.com>
> Cc: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Gluster Devel" <gluster-devel at gluster.org>
> Sent: Tuesday, January 26, 2016 2:09:43 PM
> Subject: Re: [Gluster-devel] distributed files/directories and [cm]time updates
>
> Hi Joseph,
>
> On 26/01/16 09:07, Joseph Fernandes wrote:
>> Answer inline:
>>
>>
>> ----- Original Message -----
>> From: "Xavier Hernandez" <xhernandez at datalab.es>
>> To: "Pranith Kumar Karampuri" <pkarampu at redhat.com>, "Gluster Devel" <gluster-devel at gluster.org>
>> Sent: Tuesday, January 26, 2016 1:21:37 PM
>> Subject: Re: [Gluster-devel] distributed files/directories and [cm]time	updates
>>
>> Hi Pranith,
>>
>> On 26/01/16 03:47, Pranith Kumar Karampuri wrote:
>>> hi,
>>>          Traditionally gluster has been using ctime/mtime of the
>>> files/dirs on the bricks as stat output. Problem we are seeing with this
>>> approach is that, software which depends on it gets confused when there
>>> are differences in these times. Tar especially gives "file changed as we
>>> read it" whenever it detects ctime differences when stat is served from
>>> different bricks. The way we have been trying to solve it is to serve
>>> the stat structures from same brick in afr, max-time in dht. But it
>>> doesn't avoid the problem completely. Because there is no way to change
>>> ctime at the moment(lutimes() only allows mtime, atime), there is little
>>> we can do to make sure ctimes match after self-heals/xattr
>>> updates/rebalance. I am wondering if anyone of you solved these problems
>>> before, if yes how did you go about doing it? It seems like applications
>>> which depend on this for backups get confused the same way. The only way
>>> out I see it is to bring ctime to an xattr, but that will need more iops
>>> and gluster has to keep updating it on quite a few fops.
>>
>> I did think about this when I was writing ec at the beginning. The idea
>> was that the point in time at which each fop is executed were controlled
>> by the client by adding an special xattr to each regular fop. Of course
>> this would require support inside the storage/posix xlator. At that
>> time, adding the needed support to other xlators seemed too complex for
>> me, so I decided to do something similar to afr.
>>
>> Anyway, the idea was like this: for example, when a write fop needs to
>> be sent, dht/afr/ec sets the current time in a special xattr, for
>> example 'glusterfs.time'. It can be done in a way that if the time is
>> already set by a higher xlator, it's not modified. This way DHT could
>> set the time in fops involving multiple afr subvolumes. For other fops,
>> would be afr who sets the time. It could also be set directly by the top
>> most xlator (fuse), but that time could be incorrect because lower
>> xlators could delay the fop execution and reorder it. This would need
>> more thinking.
>>
>> That xattr will be received by storage/posix. This xlator will determine
>> what times need to be modified and will change them. In the case of a
>> write, it can decide to modify mtime and, maybe, atime. For a mkdir or
>> create, it will set the times of the new file/directory and also the
>> mtime of the parent directory. It depends on the specific fop being
>> processed.
>>
>> mtime, atime and ctime (or even others) could be saved in a special
>> posix xattr instead of relying on the file system attributes that cannot
>> be modified (at least for ctime).
>>
>> This solution doesn't require extra fops, So it seems quite clean to me.
>> The additional I/O needed in posix could be minimized by implementing a
>> metadata cache in storage/posix that would read all metadata on lookup
>> and update it on disk only at regular intervals and/or on invalidation.
>> All fops would read/write into the cache. This would even reduce the
>> number of I/O we are currently doing for each fop.
>>
>>>>>>>>>>> JOE: the idea of metadata cache is cool for read work loads, but for writes we
>> would end up doing double writes to the disk. i.e 1 for the actual write or 1 to update the setxattr.
>> IMHO we cannot have it in a write back cache (periodic flush to disk) and ctime/mtime/atime data loss
>> or inconsistency will be a problem. Your thoughts?
>
> If we want to have all in physical storage at all times, gluster will be
> slow. We only need to be posix compliant, and posix allows some degree
> of "inconsistency" here. i.e. we are not forced to write to physical
> storage until the user application sends a flush or similar request.
> Note that there are xlators that currently take advantage of this: for
> example write-behind and md-cache.
>
> Almost all file systems (if not all) rely on this to improve
> performance, otherwise they would be really slow.
>>>>>>>>>>>> JOE : Agree
>
> Of course this could cause a temporal inconsistency between bricks, but
> since all cluster xlators (dht, afr and ec) use special xattrs to track
> consistency, a crash before flushing the metadata could be detected and
> repaired (with additional care even a crash while flushing metadata
> could be detected).
>
>>>>>>>>>>> JOE : Well I am fine with the cache approach, But what level of fault tolerance is acceptable
> is another question here. Remember we are building a cache over cache (linux system cache) for posix
> metadata. IMHO configurable option should be provided, to have a deterministic consistence, for example
> how offen the metadata should be flushed. I understand the performance implication but it should be
> configurable.

I agree. The metadata should be written to disk periodically, even if no 
flush is received. It's a good compromise between performance and 
consistency. Having this time period configurable would be great.

> The reason I am excited about this is, long time back when we were thinking of WORM-Retention
> our major worry was gluster's control on utime/mtime/ctime and what would be the cost of maintaining this extra
> metadata. By giving control on the server-side metadata cache setting we would configure it desired consistency
> and performance.
>
> ~Joe
>