[Gluster-devel] Solving Ctime Issue with legacy files [BUG 1593542]

Xavi Hernandez jahernan at redhat.com
Tue Jun 18 06:58:15 UTC 2019


Hi Kotresh,

On Tue, Jun 18, 2019 at 8:33 AM Kotresh Hiremath Ravishankar <
khiremat at redhat.com> wrote:

> Hi Xavi,
>
> Reply inline.
>
> On Mon, Jun 17, 2019 at 5:38 PM Xavi Hernandez <jahernan at redhat.com>
> wrote:
>
>> Hi Kotresh,
>>
>> On Mon, Jun 17, 2019 at 1:50 PM Kotresh Hiremath Ravishankar <
>> khiremat at redhat.com> wrote:
>>
>>> Hi All,
>>>
>>> The ctime feature is enabled by default from release gluster-6. But as
>>> explained in bug [1]  there is a known issue with legacy files i.e., the
>>> files which are created before ctime feature is enabled. These files would
>>> not have "trusted.glusterfs.mdata" xattr which maintain time attributes. So
>>> on, accessing those files, it gets created with latest time attributes.
>>> This is not correct because all the time attributes (atime, mtime, ctime)
>>> get updated instead of required time attributes.
>>>
>>> There are couple of approaches to solve this.
>>>
>>> 1. On accessing the files, let the posix update the time attributes
>>> from  the back end file on respective replicas. This obviously results in
>>> inconsistent "trusted.glusterfs.mdata" xattr values with in replica set.
>>> AFR/EC should heal this xattr as part of metadata heal upon accessing this
>>> file. It can chose to replicate from any subvolume. Ideally we should
>>> consider the highest time from the replica and treat it as source but I
>>> think that should be fine as replica time attributes are mostly in sync
>>> with max difference in order of few seconds if am not wrong.
>>>
>>>    But client side self heal is disabled by default because of
>>> performance reasons [2]. If we chose to go by this approach, we need to
>>> consider enabling at least client side metadata self heal by default.
>>> Please share your thoughts on enabling the same by default.
>>>
>>> 2. Don't let posix update the legacy files from the backend. On lookup
>>> cbk, let the utime xlator update the time attributes from statbuf received
>>> synchronously.
>>>
>>> Both approaches are similar as both results in updating the xattr during
>>> lookup. Please share your inputs on which approach is better.
>>>
>>
>> I prefer second approach. First approach is not feasible for EC volumes
>> because self-heal requires that k bricks (on a k+r configuration) agree on
>> the value of this xattr, otherwise it considers the metadata damaged and
>> needs manual intervention to fix it. During upgrade, first r bricks with be
>> upgraded without problems, but trusted.glusterfs.mdata won't be healed
>> because r < k. In fact this xattr will be removed from new bricks because
>> the majority of bricks agree on xattr not being present. Once the r+1 brick
>> is upgraded, it's possible that posix sets different values for
>> trusted.glusterfs.mdata, which will cause self-heal to fail.
>>
>> Second approach seems better to me if guarded by a new option that
>> enables this behavior. utime xlator should only update the mdata xattr if
>> that option is set, and that option should only be settable once all nodes
>> have been upgraded (controlled by op-version). In this situation the first
>> lookup on a file where utime detects that mdata is not set, will require a
>> synchronous update. I think this is good enough because it will only happen
>> once per file. We'll need to consider cases where different clients do
>> lookups at the same time, but I think this can be easily solved by ignoring
>> the request if mdata is already present.
>>
>
> Initially there were two issues.
> 1. Upgrade Issue with EC Volume as described by you.
>          This is solved with the patch [1]. There was a bug in ctime posix
> where it was creating xattr even when ctime is not set on client (during
> utimes system call). With patch [1], the behavior
>     is that utimes system call will only update the
> "trusted.glusterfs.mdata" xattr if present else it won't create. The new
> xattr creation should only happen during entry operations (i.e create,
> mknod and others).
>    So there won't be any problems with upgrade. I think we don't need new
> option dependent on op version if I am not wrong.
>

If I'm not missing something, we cannot allow creation of mdata xattr even
for create/mknod/setattr fops. Doing so could cause the same problem if
some of the bricks are not upgraded and do not support mdata yet (or they
have ctime disabled by default).


> 2. After upgrade, how do we update "trusted.glusterfs.mdata" xattr.
>         This mail thread was for this. Here which approach is better? I
> understand from EC point of view the second approach is the best one. The
> question I had was, Can't EC treat 'trusted.glusterfs.mdata'
>     as special xattr and add the logic to heal it from one subvolume
> (i.e. to remove the requirement of having to have consistent data on k
> subvolumes in k+r configuration).
>

Yes, we can do that. But this would require a newer client with support for
this new xattr, which won't be possible during an upgrade, where bricks are
upgraded before the clients. So, even if we add this intelligence to the
client, the upgrade process is still broken. Only consideration here is if
we can rely on self-heal daemon being on the server side (and thus upgraded
at the same time than the server) to ensure that files can really be healed
even if other bricks/shd daemons are not yet updated. Not sure if it could
work, but anyway I don't like it very much.


>
>         Second approach is independent of AFR and EC. So if we chose this,
> do we need new option to guard? If the upgrade steps is to upgrade server
> first and then client, we don't need to guard I think?
>

I think you are right for regular clients. Is there any server-side daemon
that acts as a client that could use utime xlator ? if not, I think we
don't need an additional option here.


>> Xavi
>>
>>
>>>
>>>
>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1593542
>>> [2] https://github.com/gluster/glusterfs/issues/473
>>>
>>> --
>>> Thanks and Regards,
>>> Kotresh H R
>>>
>>
>
> --
> Thanks and Regards,
> Kotresh H R
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190618/3bb781fa/attachment-0001.html>


More information about the Gluster-devel mailing list