[Gluster-devel] Metrics: and how to get them out from gluster

Fri Sep 8 06:19:18 UTC 2017

Thanks all for the feedback!

On Sat, Sep 2, 2017 at 12:21 AM, John Strunk <jstrunk at redhat.com> wrote:

>
>
> On Fri, Sep 1, 2017 at 1:27 AM, Amar Tumballi <atumball at redhat.com> wrote:
>
>> Disclaimer: This email is long, and did take significant time to write.
>> Do take time and read, review and give feedback, so we can have some
>> metrics related tasks done by Gluster 4.0
>>
>> ---
>> ** History:*
>>
>> To understand what is happening inside GlusterFS process, over the years,
>> we have opened many bugs and also coded few things with regard to
>> statedump, and did put some effort into io-stats translator to improve the
>> gluster's monitoring capabilities.
>>
>> But surely there is more required! And some glimpse of it is captured in
>> [1], [2], [3] & [4]. Also, I did send an email to this group [5] about
>> possibilities of capturing this information.
>>
>> ** Current problem:*
>>
>> When we talk about metrics or monitoring, we have to consider giving out
>> these data to a tool which can preserve the readings in a periodic time,
>> without a time graph, no metrics will make sense! So, the first challenge
>> itself is how to get them out? Should getting the metrics out from each
>> process need 'glusterd' interacting? or should we use signals? Which leads
>> us to *'challenge #1'.*
>>
>> Next is, should we depend on io-stats to do the reporting? If yes, how to
>> get information from between any two layers? Should we provide io-stats in
>> between all the nodes of translator graph? or should we utilize
>> STACK_WIND/UNWIND framework to get the details? This is our *'challenge
>> #2'*
>>
>> Once the above decision will be taken, then the question is, "what about
>> 'metrics' from other translators? Who gives it out (ie, dumps it?)? Why do
>> we need something similar to statedump, and can't we read info from
>> statedump itself?". But when we say 'metrics', we should have a key and a
>> number associated with it, statedump has lot more, and no format. If its
>> different from statedump, then what is our answer for translator code to
>> give out metrics? This is our *'challenge #3*'
>>
>> If we get a solution to above challenges, then I guess we are in a decent
>> shape for further development. Lets go through them one by one, in detail.
>>
>> ** Problems and proposed solutions:*
>>
>> *a) how to dump metrics data ?*
>>
>> Currently, I propose signal handler way, as it will give control for us
>> to choose what are the processes we need to capture information on, and
>> will be much faster than communicating through another tool. Also
>> considering we need to have these metrics taken every 10sec or so, there
>> will be a need for efficient way to get this out.
>>
>> But even there, we have challenges, because we have already chosen both
>> USR1 and USR2 signal handlers, one for statedump, another for toggling
>> latency monitoring respectively. It makes sense to continue to have
>> statedump use USR1, but toggling options should be technically (for
>> correctness too) be handled by glusterd volume set options, and there
>> should be a way to handle it in a better way by our 'reconfigure()'
>> framework in graph-switch. Proposal sent in github issue #303 [6].
>>
>> If we are good with above proposal, then we can make use of USR2 for
>> metrics dump. Next issue will be about the format of the file itself, which
>> we will discuss at the end of the email.
>>
>> NOTE: Above approach is already implemented in 'experimental' branch,
>> excluding handling of [6].
>>
>
>
This was done with SIGUSR2 mainly because for the 'implementation' to test
out other things, it was just 1 line change :-)

We should surely plan something else IMO too. Will wait for somemore time
before doing anything here.

> I'm going to pile on with the others discouraging the use of signals and
> put a vote in favor of using a network socket.
> In a previous project [10], we used a listening TCP socket to provide
> metrics to graphite. This has the ability to support multiple receivers by
> just sending a copy to each currently open connection, and if there is a
> concern about overwhelming receivers and/or slowing down the gluster
> sending side, these could be non-blocking sockets that simply drop data if
> there is no room in the outbound buffer. The data format we used was
> exactly the Graphite text format [11], which includes a timestamp directly
> with each metric. The downside is extra data, but it removes
> transmission/processing/queuing latency concerns. In practice, we
> calculated the timestamp once and used it for all metrics sent in the
> interval to minimize the overhead imposed by repeated gettimeofday().
> Another reason I like the socket approach is that in containerized
> environments, I can easily run a sidecar that grabs the metrics and
> forwards or processes them and it doesn't have to share anything more than
> a network port.
>
> The biggest drawback to the socket approach is it's passive nature. The
> receiver is stuck with whatever stat frequency gluster chooses, though this
> could be configured either globally or per connection.
>
> [10] https://github.com/NTAP/chronicle/
> [11] https://graphite.readthedocs.io/en/latest/feeding-carbon.html#the-
> plaintext-protocol
>
>
>
>>
>> *b) where to measure the latency and fops counts?*
>>
>> One of the possible way is to load io-stats in between all the nodes, but
>> it has its own limitations. Mainly, how to configure options in each of
>> this translator, will having too many translators slow down operation ?
>> (ie, create one extra 'frame' for every fop, and in a graph of 20 xlator,
>> it will be 20 extra frame creates for a single fop).
>>
>> I propose we handle this in 'STACK_WIND/UNWIND' macros itself, and
>> provide a placeholder to store all this data in translator structure
>> itself. This will be more cleaner, and no changes are required in code
>> base, other than in 'stack.h (and some in xlator.h)'.
>>
>> Also, we can provide 'option monitoring enable' (or disable) option as a
>> default option for every translator, and can handle it at xlator_init()
>> time itself. (This is not a blocker for 4.0, but good to have). Idea
>> proposed @ github #304 [7].
>>
>>
> I really don't like disabling the monitoring. If we design the
> infrastructure correctly, the overhead is minimal, and it's always easily
> available to users. If there is a question of whether the monitoring is
> fast and/or robust enough for continuous use in production use, we've
> missed the mark.
>
>
I am also of the opinion monitoring should be enabled by default. It
doesn't mean 'Debugging' is enabled by default. For the sake of those who
would like that few extra nanosecond performance benefits can disable it
with an option is what I am thinking.

Also, I would like to use clock_gettime() [20] instead of gettimeofday(),
as it would be faster with multi threaded process. Any suggestions there? I
can send this as a separate patch for this, and we can discuss merits and
issues there.

Also, I will send another patch on bringing io-stats feature to every
translator. Again, that can be debated in the patch itself.

>
>> NOTE: this approach is working pretty good already at 'experimental'
>> branch, excluding [7]. Depending on feedback, we can improve it further.
>>
>> *c) framework for xlators to provide private metrics*
>>
>> One possible solution is to use statedump functions. But to cause least
>> disruption to an existing code, I propose 2 new methods. 'dump_metrics()',
>> and 'reset_metrics()' to xlator methods, which can be dl_open()'d to xlator
>> structure.
>>
>> 'dump_metrics()' dumps the private metrics in the expected format, and
>> will be called from the global dump-metrics framework, and
>> 'reset_metrics()' would be called from a CLI command when someone wants to
>> restart metrics from 0 to check / validate few things in a running cluster.
>> Helps debug-ability.
>>
>
> Having a "reset" function complicates the data processing and analysis
> because it implies that there is a single consumer of the metrics data. Ok
> for debugging, I guess, but no legitimate consumer should ever use it.
>
>

Ack! Also I like the idea proposed by Xavi about having registered
function. But considering each translators any ways go through the exposed
functions/symbols using dlopen() approach today, I will still propose the
same approach. Will send the patch out, and one can suggest further on the
patch itself.

>
>> Further feedback welcome.
>>
>> NOTE: a sample code is already implemented in 'experimental' branch, and
>> protocol/server xlator uses this framework to dump metrics from rpc layer,
>> and client connections.
>>
>> *d) format of the 'metrics' file.*
>>
>> If you want any plot-able data on a graph, you need key (should be
>> string), and value (should be a number), collected over time. So, this file
>> should output data for the monitoring systems and not exactly for the
>> debug-ability. We have 'statedump' for debug-ability.
>>
>> So, I propose a plain text file, where data would be dumped like below.
>>
>> ```
>> # anything starting from # would be treated as comment.
>> <key><space><value>
>> # anything after the value would be ignored.
>> ```
>> Any better solutions are welcome. Ideally, we should keep this friendly
>> for external projects to consume, like tendrl [8] or graphite, prometheus
>> etc. Also note that, once we agree to the format, it would be very hard to
>> change it as external projects would use it.
>>
>> I would like to hear the feedback from people who are experienced with
>> monitoring systems here.
>>
>> NOTE: the above format works fine with 'glustermetrics' project [9] and
>> is working decently on 'experimental' branch.
>>
>>
> As mentioned above, I've previously used the Graphite plain text format:
> <metric_name><space><value><space><timestamp>. The common thing to do
> here is to have the metric name form a hierarchy to add structure to the
> metrics: "system.cpu.user.2" would be user-mode time from cpu core 2, for
> example. Prometheus seems to be slightly different in that it adds labels
> to the metrics [12].
>
> [12] https://prometheus.io/docs/concepts/data_model/
>
>
>> ------
>>
>> ** Discussions:*
>>
>> Let me know how you all want to take the discussion forward?
>>
>> Should we get to github, and discuss on each issue? or should I rebase
>> and send the current patches from experimental to 'master' branch and
>> discuss in our review system?  Or should we continue on the email here!
>>
>> Regards,
>> Amar
>>
>> References:
>>
>> [1] - https://github.com/gluster/glusterfs/issues/137
>> [2] - https://github.com/gluster/glusterfs/issues/141
>> [3] - https://github.com/gluster/glusterfs/issues/275
>> [4] - https://github.com/gluster/glusterfs/issues/168
>> [5] - http://lists.gluster.org/pipermail/maintainers/2017-August
>> /002954.html (last email of the thread).
>> [6] - https://github.com/gluster/glusterfs/issues/303
>> [7] - https://github.com/gluster/glusterfs/issues/304
>> [8] - https://github.com/Tendrl
>> [9] - https://github.com/amarts/glustermetrics
>>
>>
[20] - https://linux.die.net/man/3/clock_gettime

-- 
>> Amar Tumballi (amarts)
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel at gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>

-- 
Amar Tumballi (amarts)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.gluster.org/pipermail/gluster-devel/attachments/20170908/5565cb9a/attachment-0001.html>