[Gluster-users] Gluster 3.12.14: wrong quota in Distributed Dispersed Volume

Mon Nov 26 10:25:53 UTC 2018

Hi,

we have no notifications of OOM kills in /var/log/messages. So if I understood this correctly, the crawls finished but my attributes weren't set
correctly? And this script should fix them?

Thanks for your help so far

Gudrun
Am Donnerstag, den 22.11.2018, 13:03 +0530 schrieb Hari Gowtham:
> On Wed, Nov 21, 2018 at 8:55 PM Gudrun Mareike Amedick
> <g.amedick at uni-luebeck.de> wrote:
> > 
> > 
> > Hi Hari,
> > 
> > I disabled and re-enabled the quota and I saw the crawlers starting. However, this caused a pretty high load on my servers (200+) and this seem to
> > have gotten them killed again. At least, I have no crawlers running, the quotas are not matching the output of du -h, and the crawler logs all
> > contain
> > this line:
> The quota crawl is an intensive process as it has to crawl the entire
> file system. The intensity varies based on the number of bricks,
> number of files,
> the depth of filesystem, on going io to the filesystem and so on.
> Being a disperse volume it will have to talk to all the bricks and
> also with the huge size, the
> increase in the CPU is expected.
> 
> > 
> > 
> > [2018-11-20 14:16:35.180467] W [glusterfsd.c:1375:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7494) [0x7f0e3d6fe494] --
> > > 
> > > /usr/sbin/glusterfs(glusterfs_sigwaiter+0xf5) [0x561eb7952d45] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x561eb7952ba4] ) 0-: received
> > > signum
> > (15), shutting down
> This can mean that the file attributes are set and then its stopped/
> as you said the process was killed while it still has the attributes
> to be set on a few set of files.
> 
> This message is common for all the shutdown (one triggered after the
> job is finished and one triggered to stop the process as well)
> Can you check the /var/log/messages file for "OOM" kill?
> If you see those messages then the shutdown is because of the increase
> in memory consumption which is expected.
> 
> > 
> > 
> > I suspect this means my file attributes are not set correctly. Would the script you sent me fix that? And the script seems to be part of the Git
> > GlusterFS 5.0 repo. We are running 3.12. Would it still work on 3.12 (or 4.1, since we'll be upgrading soon) or could it break things?
> Quota is not actively developed because of its performance issues
> which need a major redesign. So the script holds true for newer
> version as well,
> because no changes have gone in the code for it.
> The advantage of the script is it can be used to run over a certain
> directory (need not be root. this reduce the number of directories/
> files depth and so on) which is faulty.
> The crawl is necessary for the quota to work fine. The script can help
> only if the xattrs are set by the crawl. which I think isn't the case
> here.
> (To verify if the xattrs are set on all the directories we need to do
> a getxattr and see) So we can't use script.
> 
> 
> > 
> > 
> > Kind regards
> > 
> > Gudrun Amedick
> > Am Dienstag, den 20.11.2018, 16:59 +0530 schrieb Hari Gowtham:
> > > 
> > > reply inline.
> > > On Tue, Nov 20, 2018 at 3:53 PM Gudrun Mareike Amedick
> > > <g.amedick at uni-luebeck.de> wrote:
> > > > 
> > > > 
> > > > 
> > > > Hi,
> > > > 
> > > > I think I know what happened. According to the logs, the crawlers recieved a signum(15). They seemed to have died before having finished.
> > > > Probably
> > > > too
> > > > much to do simultaneously. I have disabled and re-enabled quota and will set the quotas again with more time.
> > > > 
> > > > Is there a way to restart a crawler that was killed too soon?
> > > No. the disable and enable of quota starts a new crawl.
> > > 
> > > > 
> > > > 
> > > > 
> > > > If I restart a server while a crawler is running, will the crawler be restarted, too? We'll need to do some hardware fixing on one of the
> > > > servers
> > > > soon
> > > > and I need to know whether I have to check the crawlers first before shutting it down.
> > > During the shutdown of the server the crawl will be killed. (data
> > > usage shown will be updated as per what has been crawled)
> > > The crawl won't be restarted on starting the server. Only quotad will
> > > be restarted (which is not the same as crawl).
> > > For the crawl to happen you will have to restart the quota.
> > > 
> > > > 
> > > > 
> > > > 
> > > > Thanks for the pointers
> > > > 
> > > > Gudrun Amedick
> > > > Am Dienstag, den 20.11.2018, 11:38 +0530 schrieb Hari Gowtham:
> > > > > 
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > Can you check if the quota crawl finished? Without it having finished
> > > > > the quota list will show incorrect values.
> > > > > Looking at the under accounting, it looks like the crawl is not yet
> > > > > finished ( it does take a lot of time as it has to crawl the whole
> > > > > filesystem).
> > > > > 
> > > > > If the crawl has finished and the usage is still showing wrong values
> > > > > then there should be an accounting issue.
> > > > > The easy way to fix this is to try restarting quota. This will not
> > > > > cause any problems. The only downside is the limits won't hold true
> > > > > while the quota is disabled,
> > > > > till its enabled and the crawl finishes.
> > > > > Or you can try using the quota fsck script
> > > > > https://review.gluster.org/#/c/glusterfs/+/19179/ to fix your
> > > > > accounting issue.
> > > > > 
> > > > > Regards,
> > > > > Hari.
> > > > > On Mon, Nov 19, 2018 at 10:05 PM Frank Ruehlemann
> > > > > <f.ruehlemann at uni-luebeck.de> wrote:
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > we're running a Distributed Dispersed volume with Gluster 3.12.14 at
> > > > > > Debian 9.6 (Stretch).
> > > > > > 
> > > > > > We migrated our data (>300TB) from a pure Distributed volume into this
> > > > > > Dispersed volume with cp, followed by multiple rsyncs.
> > > > > > After the migration was successful we enabled quotas again with "gluster
> > > > > > volume quota $VOLUME enable", which finished successfully.
> > > > > > And we set our required quotas with "gluster volume quota $VOLUME
> > > > > > limit-usage $PATH $QUOTA", which finished without errors too.
> > > > > > 
> > > > > > But our "gluster volume quota $VOLUME list" shows wrong values.
> > > > > > For example:
> > > > > > A directory with ~170TB of data shows only 40.8TB Used.
> > > > > > When we sum up all quoted directories we're way under the ~310TB that
> > > > > > "df -h /$volume" shows.
> > > > > > And "df -h /$volume/$directory" shows wrong values for nearly all
> > > > > > directories.
> > > > > > 
> > > > > > All 72 8TB-bricks and all quota deamons of the 6 servers are visible and
> > > > > > online in "gluster volume status $VOLUME".
> > > > > > 
> > > > > > 
> > > > > > In quotad.log I found multiple warnings like this:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > [2018-11-16 09:21:25.738901] W [dict.c:636:dict_unref] (-->/usr/lib/x86_64-linux-
> > > > > > > gnu/glusterfs/3.12.14/xlator/features/quotad.so(+0x1d58)
> > > > > > > [0x7f6844be7d58] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.14/xlator/features/quotad.so(+0x2b92) [0x7f6844be8b92] -->/usr/lib/x86_64-
> > > > > > > linux-
> > > > > > > gnu/libglusterfs.so.0(dict_unref+0xc0) [0x7f684b0db640] ) 0-dict: dict is NULL [Invalid argument]
> > > > > > In some brick logs I found those:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > [2018-11-19 07:23:30.932327] I [MSGID: 120020] [quota.c:2198:quota_unlink_cbk] 0-$VOLUME-quota: quota context not set inode
> > > > > > > (gfid:f100f7a9-
> > > > > > > 0779-
> > > > > > > 4b4c-880f-c8b3b4bdc49d) [Invalid argument]
> > > > > > and (replaced the volume name with "$VOLUME") those:
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > The message "W [MSGID: 120003] [quota.c:821:quota_build_ancestry_cbk] 0-$VOLUME-quota: parent is NULL [Invalid argument]" repeated 13
> > > > > > > times
> > > > > > > between [2018-11-19 15:28:54.089404] and [2018-11-19 15:30:12.792175]
> > > > > > > [2018-11-19 15:31:34.559348] W [MSGID: 120003] [quota.c:821:quota_build_ancestry_cbk] 0-$VOLUME-quota: parent is NULL [Invalid argument]
> > > > > > I already found that setting the flag "trusted.glusterfs.quota.dirty" might help, but I'm unsure about the consequences that will be
> > > > > > triggered.
> > > > > > And I'm unsure about the necessary version flag.
> > > > > > 
> > > > > > Has anyone an idea how to fix this?
> > > > > > 
> > > > > > Best Regards,
> > > > > > --
> > > > > > Frank Rühlemann
> > > > > >    IT-Systemtechnik
> > > > > > 
> > > > > > UNIVERSITÄT ZU LÜBECK
> > > > > >     IT-Service-Center
> > > > > > 
> > > > > >     Ratzeburger Allee 160
> > > > > >     23562 Lübeck
> > > > > >     Tel +49 451 3101 2034
> > > > > >     Fax +49 451 3101 2004
> > > > > >     ruehlemann at itsc.uni-luebeck.de
> > > > > >     www.itsc.uni-luebeck.de
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Gluster-users mailing list
> > > > > > Gluster-users at gluster.org
> > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users
> > > 
> 
> 
> --
> Regards,
> Hari Gowtham.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 6743 bytes
Desc: not available
URL: <http://lists.gluster.org/pipermail/gluster-users/attachments/20181126/25e84deb/attachment.bin>