[Gluster-devel] How long should metrics collection on a cluster take?
Pranith Kumar Karampuri
pkarampu at redhat.com
Wed Jul 25 09:44:43 UTC 2018
On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <
sankarshan.mukhopadhyay at gmail.com> wrote:
> On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
> <pkarampu at redhat.com> wrote:
> > hi,
> > Quite a few commands to monitor gluster at the moment take almost a
> > second to give output.
> Is this at the (most) minimum recommended cluster size?
Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
> > Some categories of these commands:
> > 1) Any command that needs to do some sort of mount/glfs_init.
> > Examples: 1) heal info family of commands 2) statfs to find
> > space-availability etc (On my laptop replica 3 volume with all local
> > glfs_init takes 0.3 seconds on average)
> > 2) glusterd commands that need to wait for the previous command to
> > If the previous command is something related to lvm snapshot which takes
> > quite a few seconds, it would be even more time consuming.
> > Nowadays container workloads have hundreds of volumes if not thousands.
> > we want to serve any monitoring solution at this scale (I have seen
> > customers use upto 600 volumes at a time, it will only get bigger) and
> > say collecting metrics per volume takes 2 seconds per volume(Let us take
> > worst example which has all major features enabled like
> > snapshot/geo-rep/quota etc etc), that will mean that it will take 20
> > to collect metrics of the cluster with 600 volumes. What are the ways in
> > which we can make this number more manageable? I was initially thinking
> > be it is possible to get gd2 to execute commands in parallel on different
> > volumes, so potentially we could get this done in ~2 seconds. But quite a
> > few of the metrics need a mount or equivalent of a mount(glfs_init) to
> > collect different information like statfs, number of pending heals, quota
> > usage etc. This may lead to high memory usage as the size of the mounts
> > to be high.
> I am not sure if starting from the "worst example" (it certainly is
> not) is a good place to start from.
I didn't understand your statement. Are you saying 600 volumes is a worst
> That said, for any environment
> with that number of disposable volumes, what kind of metrics do
> actually make any sense/impact?
Same metrics you track for long running volumes. It is just that the way
are interpreted will be different. On a long running volume, you would look
at the metrics
and try to find why is the volume not giving performance as expected in the
last 1 hour. Where as
in this case, you would look at metrics and find the reason why volumes
created and deleted in the last hour didn't give performance as expected.
> > I wanted to seek suggestions from others on how to come to a conclusion
> > about which path to take and what problems to solve.
> > I will be happy to raise github issues based on our conclusions on this
> > thread.
> > --
> > Pranith
> sankarshan mukhopadhyay
> Gluster-devel mailing list
> Gluster-devel at gluster.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Gluster-devel