[Gluster-users] Files losing permissions

Fri Aug 2 18:10:39 UTC 2013

It sounds like this is related to NFS.

Anand, thank you for the response.  I was under the impression that DHT
linkfiles are only in the .glusterfs subdirectory on the brick; in my case,
these files are outside that directory.  Furthermore, they aren't named
like DHT linkfiles (using that hash key)-- they are named the same as the
actual files.  Finally, once I removed the bad files and their DHT
linkfiles, the issue went away and the files remained accessible.  I had to
remove around 100,000 of these bad 000/1000 zero-length files (and their
DHT linkfiles) last night; only 324 additional files were detected.

On my volumes, I use a similar hashing scheme to the DHT one for regular
files-- the top level at the volume only has directories 00 through ff,
etc, etc.  Perhaps this caused some confusion for you?

For transparency, here is the command I run from the client to detect and
correct bad file permissions:

find ./?? -type f -perm 000 -ls -exec chmod -v 644 {} \; -o -type f -perm
1000 -ls -exec chmod -v 644 {} \;

If this number does not grow, I will conclude that I just missed these 324
files.  If the number gets larger, I can only conclude that GlusterFS is
somehow introducing this corruption.  If that is the case, I'll dig some
more.

Maik, I may have experienced the same thing.  I used rsync over NFS without
--inplace to load my data into the GlusterFS volume, and I wound up with
all those bad files on the wrong bricks (i.e. a file should be only on
server1-brick1 and server2-brick1, but "bad" versions (1000, zero-length)
were also on server3-brick1 and server4-brick1, leading to confusing
results on the clients).  Since then, I've switched to using the native
client for data loads and also the --inplace flag to rsync.

Other factors which may have caused the issue I had:

1. During a large rebalance, one GlusterFS node exceeded its system max
open files limit, and was rebooted.  The rebalance did not stop while this
took place.
2. Three times during the same rebalance, the Gluster NFS daemon used an
excessive amount of memory and was killed by the kernel oom-killer.  The
system in question has 8 GB of memory, was the rebalance master, and is not
running any significant software besides GlusterFS.   Each time, I
restarted glusterfs and the NFS server daemon started serving files again.
 The rebalance was not interrupted.

On Fri, Aug 2, 2013 at 4:54 AM, Maik Kulbe <info at linux-web-development.de>wrote:

> Hi,
>
> I've just had a problem removing a directory with test files. I had an
> inaccessible folder which I could neither delete nor read on the
> client(both NFS and FUSE client). On the backend, the folder had completely
> 0'd permissions and the files showed the 0'd permissions with the sticky
> bit. I can't remove the folder on the client(it fails with 'directory not
> empty') but if I delete the empty files on the backend, it's gone. Is there
> any explanation for this?
>
> I also found that this only happens, if I remove the folder recursivly
> over NFS. When I remove the files in the folder first there are no 0-size
> files on the backend and I can delete the directory with rmdir without any
> problem.
>
>  Justin,
>> What you are seeing are internal DHT linkfiles. They are zero byte files
>> with mode 01000. Changing their mode forcefully in the backend to
>> something else WILL render your files inaccessible from the mount point. I
>> am assuming that you have seen these files only in the backend and not
>> from the mount point And accessing/modifying files like this directly
>>
>> from the backend is very dangerous for your data, as explained in this
>> very example.
>> Avati
>>
>> On Thu, Aug 1, 2013 at 2:25 PM, Justin Dossey <jbd at podomatic.com> wrote:
>>
>> One thing I do see with the issue we're having is that the files which
>> have lost their permissions have "bad" versions on multiple bricks.
>>  Since the replica count is 2 for any given file, there should be only
>> two copies of each, no?
>> For example, the file below has zero-length, zero-permission versions on
>> uds06/brick2 and uds-07/brick2, but good versions on uds-05/brick1 and
>> uds-06/brick1.
>> FILE is
>> /09/38/1f/eastar/mail/entries/**trash/2008-07-06T13_41_56-07_**00.dump
>> uds-05 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
>> /export/brick1/vol1/09/38/1f/**eastar/mail/entries/trash/**
>> 2008-07-06T13_41_56-07_00.dump
>> uds-06 -rw-r--r-- 2 apache apache 2233 Jul 6 2008
>> /export/brick1/vol1/09/38/1f/**eastar/mail/entries/trash/**
>> 2008-07-06T13_41_56-07_00.dump
>> uds-06 ---------T 2 apache apache 0 Jul 23 03:11
>> /export/brick2/vol1/09/38/1f/**eastar/mail/entries/trash/**
>> 2008-07-06T13_41_56-07_00.dump
>> uds-07 ---------T 2 apache apache 0 Jul 23 03:11
>> /export/brick2/vol1/09/38/1f/**eastar/mail/entries/trash/**
>> 2008-07-06T13_41_56-07_00.dump
>> Is it acceptable for me to just delete the zero-length copies?
>>
>> On Thu, Aug 1, 2013 at 12:57 PM, Justin Dossey <jbd at podomatic.com>
>> wrote:
>>
>> Do you know whether it's acceptable to modify permissions on the brick
>> itself (as opposed to over NFS or via the fuse client)?  It seems that
>> as long as I don't modify the xattrs, the permissions I set on files
>> on the bricks are passed through.
>>
>> On Thu, Aug 1, 2013 at 12:32 PM, Joel Young <jdy at cryregarder.com>
>> wrote:
>>
>> I am not seeing exactly that, but I am experiencing the permission
>> for
>> the root directory of a gluster volume reverting from a particular
>> user.user to root.root ownership.  I have to periodically do a "cd
>> /share; chown user.user . "
>> On Thu, Aug 1, 2013 at 12:25 PM, Justin Dossey <jbd at podomatic.com>
>> wrote:
>> > Hi all,
>> >
>> > I have a relatively-new GlusterFS 3.3.2 4-node cluster in
>> > distributed-replicated mode running in a production environment.
>> >
>> > After adding bricks from nodes 3 and 4 (which changed the cluster
>> type from
>> > simple replicated-2 to distributed-replicated-2), I've discovered
>> that files
>> > are randomly losing their permissions.  These are files that
>> aren't being
>> > accessed by our clients-- some of them haven't been touched for
>> years.
>> >
>> > When I say "losing their permissions", I mean that regular files
>> are going
>> > from 0644 to 0000 or 1000.
>> >
>> > Since this is a real production issue, I run a parallel find
>> process to
>> > correct them every ten minutes.  It has corrected approximately
>> 40,000 files
>> > in the past 18 hours.
>> >
>> > Is anyone else seeing this kind of issue?  My searches have turned
>> up
>> > nothing so far.
>> >
>> > --
>> > Justin Dossey
>> > CTO, PodOmatic
>> >
>> >
>> > ______________________________**_________________
>> > Gluster-users mailing list
>> > Gluster-users at gluster.org
>> > http://supercolony.gluster.**org/mailman/listinfo/gluster-**users<http://supercolony.gluster.org/mailman/listinfo/gluster-users>
>>
>> --
>> Justin Dossey
>> CTO, PodOmatic
>>
>> --
>> Justin Dossey
>> CTO, PodOmatic
>> ______________________________**_________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.**org/mailman/listinfo/gluster-**users<http://supercolony.gluster.org/mailman/listinfo/gluster-users>
>>
>

-- 
Justin Dossey
CTO, PodOmatic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130802/a48c42ca/attachment.html>