[Gluster-devel] Solving Ctime Issue with legacy files [BUG 1593542]

Tue Jun 18 07:25:44 UTC 2019

Hi Xavi,

On Tue, Jun 18, 2019 at 12:28 PM Xavi Hernandez <jahernan at redhat.com> wrote:

> Hi Kotresh,
>
> On Tue, Jun 18, 2019 at 8:33 AM Kotresh Hiremath Ravishankar <
> khiremat at redhat.com> wrote:
>
>> Hi Xavi,
>>
>> Reply inline.
>>
>> On Mon, Jun 17, 2019 at 5:38 PM Xavi Hernandez <jahernan at redhat.com>
>> wrote:
>>
>>> Hi Kotresh,
>>>
>>> On Mon, Jun 17, 2019 at 1:50 PM Kotresh Hiremath Ravishankar <
>>> khiremat at redhat.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> The ctime feature is enabled by default from release gluster-6. But as
>>>> explained in bug [1]  there is a known issue with legacy files i.e., the
>>>> files which are created before ctime feature is enabled. These files would
>>>> not have "trusted.glusterfs.mdata" xattr which maintain time attributes. So
>>>> on, accessing those files, it gets created with latest time attributes.
>>>> This is not correct because all the time attributes (atime, mtime, ctime)
>>>> get updated instead of required time attributes.
>>>>
>>>> There are couple of approaches to solve this.
>>>>
>>>> 1. On accessing the files, let the posix update the time attributes
>>>> from  the back end file on respective replicas. This obviously results in
>>>> inconsistent "trusted.glusterfs.mdata" xattr values with in replica set.
>>>> AFR/EC should heal this xattr as part of metadata heal upon accessing this
>>>> file. It can chose to replicate from any subvolume. Ideally we should
>>>> consider the highest time from the replica and treat it as source but I
>>>> think that should be fine as replica time attributes are mostly in sync
>>>> with max difference in order of few seconds if am not wrong.
>>>>
>>>>    But client side self heal is disabled by default because of
>>>> performance reasons [2]. If we chose to go by this approach, we need to
>>>> consider enabling at least client side metadata self heal by default.
>>>> Please share your thoughts on enabling the same by default.
>>>>
>>>> 2. Don't let posix update the legacy files from the backend. On lookup
>>>> cbk, let the utime xlator update the time attributes from statbuf received
>>>> synchronously.
>>>>
>>>> Both approaches are similar as both results in updating the xattr
>>>> during lookup. Please share your inputs on which approach is better.
>>>>
>>>
>>> I prefer second approach. First approach is not feasible for EC volumes
>>> because self-heal requires that k bricks (on a k+r configuration) agree on
>>> the value of this xattr, otherwise it considers the metadata damaged and
>>> needs manual intervention to fix it. During upgrade, first r bricks with be
>>> upgraded without problems, but trusted.glusterfs.mdata won't be healed
>>> because r < k. In fact this xattr will be removed from new bricks because
>>> the majority of bricks agree on xattr not being present. Once the r+1 brick
>>> is upgraded, it's possible that posix sets different values for
>>> trusted.glusterfs.mdata, which will cause self-heal to fail.
>>>
>>> Second approach seems better to me if guarded by a new option that
>>> enables this behavior. utime xlator should only update the mdata xattr if
>>> that option is set, and that option should only be settable once all nodes
>>> have been upgraded (controlled by op-version). In this situation the first
>>> lookup on a file where utime detects that mdata is not set, will require a
>>> synchronous update. I think this is good enough because it will only happen
>>> once per file. We'll need to consider cases where different clients do
>>> lookups at the same time, but I think this can be easily solved by ignoring
>>> the request if mdata is already present.
>>>
>>
>> Initially there were two issues.
>> 1. Upgrade Issue with EC Volume as described by you.
>>          This is solved with the patch [1]. There was a bug in ctime
>> posix where it was creating xattr even when ctime is not set on client
>> (during utimes system call). With patch [1], the behavior
>>     is that utimes system call will only update the
>> "trusted.glusterfs.mdata" xattr if present else it won't create. The new
>> xattr creation should only happen during entry operations (i.e create,
>> mknod and others).
>>    So there won't be any problems with upgrade. I think we don't need new
>> option dependent on op version if I am not wrong.
>>
>
> If I'm not missing something, we cannot allow creation of mdata xattr even
> for create/mknod/setattr fops. Doing so could cause the same problem if
> some of the bricks are not upgraded and do not support mdata yet (or they
> have ctime disabled by default).
>

Yes, that's right, even create/mknod and other fops won't create xattr if
client doesn't set ctime (holds good for older clients). I have commented
in the patch [1]. All other fops where xattr gets created as the check that
if ctime is not set, don't create. It was missed only in utime syscall. And
hence caused upgrade issues.

>
>
>> 2. After upgrade, how do we update "trusted.glusterfs.mdata" xattr.
>>         This mail thread was for this. Here which approach is better? I
>> understand from EC point of view the second approach is the best one. The
>> question I had was, Can't EC treat 'trusted.glusterfs.mdata'
>>     as special xattr and add the logic to heal it from one subvolume
>> (i.e. to remove the requirement of having to have consistent data on k
>> subvolumes in k+r configuration).
>>
>
> Yes, we can do that. But this would require a newer client with support
> for this new xattr, which won't be possible during an upgrade, where bricks
> are upgraded before the clients. So, even if we add this intelligence to
> the client, the upgrade process is still broken. Only consideration here is
> if we can rely on self-heal daemon being on the server side (and thus
> upgraded at the same time than the server) to ensure that files can really
> be healed even if other bricks/shd daemons are not yet updated. Not sure if
> it could work, but anyway I don't like it very much.
>
>
>>
>>         Second approach is independent of AFR and EC. So if we chose
>> this, do we need new option to guard? If the upgrade steps is to upgrade
>> server first and then client, we don't need to guard I think?
>>
>
> I think you are right for regular clients. Is there any server-side daemon
> that acts as a client that could use utime xlator ? if not, I think we
> don't need an additional option here.
>
No, no other server side daemon has utime xlator loaded.

[1]  https://review.gluster.org/#/c/glusterfs/+/22858/

>
>>> Xavi
>>>
>>>
>>>>
>>>>
>>>> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1593542
>>>> [2] https://github.com/gluster/glusterfs/issues/473
>>>>
>>>> --
>>>> Thanks and Regards,
>>>> Kotresh H R
>>>>
>>>
>>
>> --
>> Thanks and Regards,
>> Kotresh H R
>>
>

-- 
Thanks and Regards,
Kotresh H R
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20190618/8acfa600/attachment.html>