[Gluster-devel] How long should metrics collection on a cluster take?
Pranith Kumar Karampuri
pkarampu at redhat.com
Tue Jul 24 16:18:53 UTC 2018
Quite a few commands to monitor gluster at the moment take almost a
second to give output.
Some categories of these commands:
1) Any command that needs to do some sort of mount/glfs_init.
Examples: 1) heal info family of commands 2) statfs to find
space-availability etc (On my laptop replica 3 volume with all local
bricks, glfs_init takes 0.3 seconds on average)
2) glusterd commands that need to wait for the previous command to unlock.
If the previous command is something related to lvm snapshot which takes
quite a few seconds, it would be even more time consuming.
Nowadays container workloads have hundreds of volumes if not thousands. If
we want to serve any monitoring solution at this scale (I have seen
customers use upto 600 volumes at a time, it will only get bigger) and lets
say collecting metrics per volume takes 2 seconds per volume(Let us take
the worst example which has all major features enabled like
snapshot/geo-rep/quota etc etc), that will mean that it will take 20
minutes to collect metrics of the cluster with 600 volumes. What are the
ways in which we can make this number more manageable? I was initially
thinking may be it is possible to get gd2 to execute commands in parallel
on different volumes, so potentially we could get this done in ~2 seconds.
But quite a few of the metrics need a mount or equivalent of a
mount(glfs_init) to collect different information like statfs, number of
pending heals, quota usage etc. This may lead to high memory usage as the
size of the mounts tend to be high.
I wanted to seek suggestions from others on how to come to a conclusion
about which path to take and what problems to solve.
I will be happy to raise github issues based on our conclusions on this
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Gluster-devel