[Gluster-devel] How long should metrics collection on a cluster take?

Wed Jul 25 09:44:43 UTC 2018

On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
sankarshan.mukhopadhyay at gmail.com> wrote:

> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
> <pkarampu at redhat.com> wrote:
> > hi,
> >       Quite a few commands to monitor gluster at the moment take almost a
> > second to give output.
>
> Is this at the (most) minimum recommended cluster size?
>

Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.

>
> > Some categories of these commands:
> > 1) Any command that needs to do some sort of mount/glfs_init.
> >      Examples: 1) heal info family of commands 2) statfs to find
> > space-availability etc (On my laptop replica 3 volume with all local
> bricks,
> > glfs_init takes 0.3 seconds on average)
> > 2) glusterd commands that need to wait for the previous command to
> unlock.
> > If the previous command is something related to lvm snapshot which takes
> > quite a few seconds, it would be even more time consuming.
> >
> > Nowadays container workloads have hundreds of volumes if not thousands.
> If
> > we want to serve any monitoring solution at this scale (I have seen
> > customers use upto 600 volumes at a time, it will only get bigger) and
> lets
> > say collecting metrics per volume takes 2 seconds per volume(Let us take
> the
> > worst example which has all major features enabled like
> > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
> minutes
> > to collect metrics of the cluster with 600 volumes. What are the ways in
> > which we can make this number more manageable? I was initially thinking
> may
> > be it is possible to get gd2 to execute commands in parallel on different
> > volumes, so potentially we could get this done in ~2 seconds. But quite a
> > few of the metrics need a mount or equivalent of a mount(glfs_init) to
> > collect different information like statfs, number of pending heals, quota
> > usage etc. This may lead to high memory usage as the size of the mounts
> tend
> > to be high.
> >
>
> I am not sure if starting from the "worst example" (it certainly is
> not) is a good place to start from.

I didn't understand your statement. Are you saying 600 volumes is a worst
example?

> That said, for any environment
> with that number of disposable volumes, what kind of metrics do
> actually make any sense/impact?
>

Same metrics you track for long running volumes. It is just that the way
the metrics
are interpreted will be different. On a long running volume, you would look
at the metrics
and try to find why is the volume not giving performance as expected in the
last 1 hour. Where as
in this case, you would look at metrics and find the reason why volumes
that were
created and deleted in the last hour didn't give performance as expected.

>
> > I wanted to seek suggestions from others on how to come to a conclusion
> > about which path to take and what problems to solve.
> >
> > I will be happy to raise github issues based on our conclusions on this
> mail
> > thread.
> >
> > --
> > Pranith
> >
>
>
>
>
>
> --
> sankarshan mukhopadhyay
> <https://about.me/sankarshan.mukhopadhyay>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>

-- 
Pranith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20180725/efc74e2b/attachment.html>