[Gluster-Maintainers] Metrics: and how to get them out from gluster

Fri Sep 1 09:37:22 UTC 2017

On Fri, Sep 01, 2017 at 10:57:37AM +0530, Amar Tumballi wrote:
> Disclaimer: This email is long, and did take significant time to write. Do
> take time and read, review and give feedback, so we can have some metrics
> related tasks done by Gluster 4.0
> 
> ---
> ** History:*
> 
> To understand what is happening inside GlusterFS process, over the years,
> we have opened many bugs and also coded few things with regard to
> statedump, and did put some effort into io-stats translator to improve the
> gluster's monitoring capabilities.
> 
> But surely there is more required! And some glimpse of it is captured in
> [1], [2], [3] & [4]. Also, I did send an email to this group [5] about
> possibilities of capturing this information.
> 
> ** Current problem:*
> 
> When we talk about metrics or monitoring, we have to consider giving out
> these data to a tool which can preserve the readings in a periodic time,
> without a time graph, no metrics will make sense! So, the first challenge
> itself is how to get them out? Should getting the metrics out from each
> process need 'glusterd' interacting? or should we use signals? Which leads
> us to *'challenge #1'.*

gfapi processes can not use signals, those are reserved for the
application with main(). So some other way of communicating is needed.
Currently io-stats supports writing out JSON files, is that not
something that could be (re)used and extended? Or, maybe collecting
metrics can piggy back on the evening framework?

> Next is, should we depend on io-stats to do the reporting? If yes, how to
> get information from between any two layers? Should we provide io-stats in
> between all the nodes of translator graph? or should we utilize
> STACK_WIND/UNWIND framework to get the details? This is our *'challenge #2'*

For who would the metrics be? Latencies of all xlators in the graph will
give loads of data, but who of our userbase would seriously use this? I
dont doubt it is useful, but there might be more practical metrics that
can be gathered.

> Once the above decision will be taken, then the question is, "what about
> 'metrics' from other translators? Who gives it out (ie, dumps it?)? Why do
> we need something similar to statedump, and can't we read info from
> statedump itself?". But when we say 'metrics', we should have a key and a
> number associated with it, statedump has lot more, and no format. If its
> different from statedump, then what is our answer for translator code to
> give out metrics? This is our *'challenge #3*'

This sounds more useful for users to me. Something like a statedump, but
formatted in a standard way and a mechanism to continuously dump/gather
the data.

> If we get a solution to above challenges, then I guess we are in a decent
> shape for further development. Lets go through them one by one, in detail.
> 
> ** Problems and proposed solutions:*
> 
> *a) how to dump metrics data ?*
> 
> Currently, I propose signal handler way, as it will give control for us to
> choose what are the processes we need to capture information on, and will
> be much faster than communicating through another tool. Also considering we
> need to have these metrics taken every 10sec or so, there will be a need
> for efficient way to get this out.
> 
> But even there, we have challenges, because we have already chosen both
> USR1 and USR2 signal handlers, one for statedump, another for toggling
> latency monitoring respectively. It makes sense to continue to have
> statedump use USR1, but toggling options should be technically (for
> correctness too) be handled by glusterd volume set options, and there
> should be a way to handle it in a better way by our 'reconfigure()'
> framework in graph-switch. Proposal sent in github issue #303 [6].
> 
> If we are good with above proposal, then we can make use of USR2 for
> metrics dump. Next issue will be about the format of the file itself, which
> we will discuss at the end of the email.

Signals are horrible for libgfapi applications, we just can not use them
there. For example, QEMU already expects to be able to use signals for
certain actions, and so does NFS-Ganesha.

A local socket per process might be better. The first read (after
opening) can give identification of the process, and subsequent reads
can dump the statistics. I actually would like something like this for
gfapi in any case, but mainly for triggering statedumps without relying
on the glusterd connection. A simple single character command socket, of
some kind.

> NOTE: Above approach is already implemented in 'experimental' branch,
> excluding handling of [6].
> 
> *b) where to measure the latency and fops counts?*
> 
> One of the possible way is to load io-stats in between all the nodes, but
> it has its own limitations. Mainly, how to configure options in each of
> this translator, will having too many translators slow down operation ?
> (ie, create one extra 'frame' for every fop, and in a graph of 20 xlator,
> it will be 20 extra frame creates for a single fop).
> 
> I propose we handle this in 'STACK_WIND/UNWIND' macros itself, and provide
> a placeholder to store all this data in translator structure itself. This
> will be more cleaner, and no changes are required in code base, other than
> in 'stack.h (and some in xlator.h)'.
> 
> Also, we can provide 'option monitoring enable' (or disable) option as a
> default option for every translator, and can handle it at xlator_init()
> time itself. (This is not a blocker for 4.0, but good to have). Idea
> proposed @ github #304 [7].
> 
> NOTE: this approach is working pretty good already at 'experimental'
> branch, excluding [7]. Depending on feedback, we can improve it further.

No real concerns about the above, but having the additional work
disabled by default would have my preference.

I think this should also be a mount option, similar to setting the
log-level. When a performance problem is found, only a few client
systems should need to provide the metrics, and not all potentially
hundreds of them. It might be needed to keep the data collection at a
minimum in some environments.

> *c) framework for xlators to provide private metrics*
> 
> One possible solution is to use statedump functions. But to cause least
> disruption to an existing code, I propose 2 new methods. 'dump_metrics()',
> and 'reset_metrics()' to xlator methods, which can be dl_open()'d to xlator
> structure.

Sounds acceptable to me. (Make them class_methods, so that we can move
the entry points of the xlators to the 'new' structure at the same time
when adding these two functions.)

> 'dump_metrics()' dumps the private metrics in the expected format, and will
> be called from the global dump-metrics framework, and 'reset_metrics()'
> would be called from a CLI command when someone wants to restart metrics
> from 0 to check / validate few things in a running cluster. Helps
> debug-ability.
> 
> Further feedback welcome.

I'm not sure a reset_metrics() is really needed. But that all depends on
the API that the dump-metrics framework presents. It would be good to
have to do as little as possible/repetitive in the xlators themselves.

> NOTE: a sample code is already implemented in 'experimental' branch, and
> protocol/server xlator uses this framework to dump metrics from rpc layer,
> and client connections.
> 
> *d) format of the 'metrics' file.*
> 
> If you want any plot-able data on a graph, you need key (should be string),
> and value (should be a number), collected over time. So, this file should
> output data for the monitoring systems and not exactly for the
> debug-ability. We have 'statedump' for debug-ability.
> 
> So, I propose a plain text file, where data would be dumped like below.
> 
> ```
> # anything starting from # would be treated as comment.
> <key><space><value>
> # anything after the value would be ignored.
> ```
> Any better solutions are welcome. Ideally, we should keep this friendly for
> external projects to consume, like tendrl [8] or graphite, prometheus etc.
> Also note that, once we agree to the format, it would be very hard to
> change it as external projects would use it.
> 
> I would like to hear the feedback from people who are experienced with
> monitoring systems here.

The pcp.io community should be pretty responsive, and they do have
integrations with other projects. I do not know what format they use, or
how other projects export metrics to PCP, but it is one of the large
Open Source projects for performance monitoring.

Did you look at other projects that provide monitoring statistics yet?

> NOTE: the above format works fine with 'glustermetrics' project [9] and is
> working decently on 'experimental' branch.
> 
> ------
> 
> ** Discussions:*
> 
> Let me know how you all want to take the discussion forward?
> 
> Should we get to github, and discuss on each issue? or should I rebase and
> send the current patches from experimental to 'master' branch and discuss
> in our review system?  Or should we continue on the email here!

As an introduction email this is good. I still get lost in GitHub issues
and comments there, so that does not have my preference. Sending patches
for review, and send an email with further questions/suggestions/... is
best for me. The number of patches that are under review is sometimes a
little overwhelming, and reading/replying to an email is more efficient.
That also means one email per topic, and not a long combined email like
this (except for kickstarting the discussion).

Thanks!
Niels

> Regards,
> Amar
> 
> References:
> 
> [1] - https://github.com/gluster/glusterfs/issues/137
> [2] - https://github.com/gluster/glusterfs/issues/141
> [3] - https://github.com/gluster/glusterfs/issues/275
> [4] - https://github.com/gluster/glusterfs/issues/168
> [5] - http://lists.gluster.org/pipermail/maintainers/2017-August/002954.html
> (last email of the thread).
> [6] - https://github.com/gluster/glusterfs/issues/303
> [7] - https://github.com/gluster/glusterfs/issues/304
> [8] - https://github.com/Tendrl
> [9] - https://github.com/amarts/glustermetrics
> 
> -- 
> Amar Tumballi (amarts)

> _______________________________________________
> maintainers mailing list
> maintainers at gluster.org
> http://lists.gluster.org/mailman/listinfo/maintainers