[Gluster-Maintainers] [Gluster-devel] Metrics: and how to get them out from gluster

Xavier Hernandez xhernandez at datalab.es
Fri Sep 1 07:23:28 UTC 2017

Hi Amar,

I don't have time to review the changes in experimental branch yet, but 
here are some comments about these ideas...

On 01/09/17 07:27, Amar Tumballi wrote:
> Disclaimer: This email is long, and did take significant time to write. 
> Do take time and read, review and give feedback, so we can have some 
> metrics related tasks done by Gluster 4.0
> ---
> ** History:*
> To understand what is happening inside GlusterFS process, over the 
> years, we have opened many bugs and also coded few things with regard to 
> statedump, and did put some effort into io-stats translator to improve 
> the gluster's monitoring capabilities.
> But surely there is more required! And some glimpse of it is captured in 
> [1], [2], [3] & [4]. Also, I did send an email to this group [5] about 
> possibilities of capturing this information.
> ** Current problem:*
> When we talk about metrics or monitoring, we have to consider giving out 
> these data to a tool which can preserve the readings in a periodic time, 
> without a time graph, no metrics will make sense! So, the first 
> challenge itself is how to get them out? Should getting the metrics out 
> from each process need 'glusterd' interacting? or should we use signals? 
> Which leads us to *'challenge #1'.*

One problem I see here is that we will have multiple bricks and multiple 
clients (including FUSE and gfapi).

I assume we want to be able to monitor whole volume performance 
(aggregate values of all mount points), specific mount performance, and 
even specific brick performance.

In this case, the signal approach seems quite difficult to me, specially 
for gfapi based clients. Even for fuse mounts and brick processes we 
would need to connect to each place where one of these processes is and 
send the signal there. In this case, some clients may be not prepared to 
be accessed remotely in an easy way.

Using glusterd this problem could be minimized, but I'm not sure that 
the interface would be easy to implement (basically because we would 
need some kind of filtering syntax to avoid huge outputs) and the output 
could be complex to parse for other tools, specially considering that 
the amount of data could be significant and it will can change with the 
addition or change of translators.

I propose a third approach. It's based on a virtual directory similar to 
/sys and /proc on linux. We already have /.meta in gluster. We could 
extend that in a way that we could have data there from each mount point 
(fuse of gfapi), and each brick. Then we could define an api to allow 
each xlator to publish information in that directory in a simple way.

Using this approach, monitor tools can check only the interesting data 
directly mounting the volume as any other client and reading the desired 

To implement this we could centralize all statistics capturing in 
libglusterfs itself, and create a new translator (or reuse meta) to 
gather this information from libglusterfs and publish it into the 
virtual directory (probably we would need a server side and a client 
side xlator to be able to combine data from all mounts and bricks).

> Next is, should we depend on io-stats to do the reporting? If yes, how 
> to get information from between any two layers? Should we provide 
> io-stats in between all the nodes of translator graph?

I whouldn't depend on io-stats for reporting all the data. The 
monitoring seems to me a deeper thing than what a single translator can do.

Using the virtual directory approach, io-stats can place its statistics 
there, but it doesn't need to be aware of all other possible statistics 
from other xlators because each one will report its own statistics 

> or should we 
> utilize STACK_WIND/UNWIND framework to get the details? This is our 
> *'challenge #2'*

I think that gluster core itself (basically libglusterfs) should keep 
its own details on global things like this. This details could also be 
published in the virtual directory. From my point of view, io-stats 
should be left to provide global timings for the fops or be merged with 
the STACK_WIND/UNWIND framework and removed as an xlator.

> Once the above decision will be taken, then the question is, "what about 
> 'metrics' from other translators? Who gives it out (ie, dumps it?)? Why 
> do we need something similar to statedump, and can't we read info from 
> statedump itself?".

I think it would be better and easier to move the information from the 
statedump to the virtual directory instead of trying to use the 
statedump to report everything.

> But when we say 'metrics', we should have a key and 
> a number associated with it, statedump has lot more, and no format. If 
> its different from statedump, then what is our answer for translator 
> code to give out metrics? This is our *'challenge #3*'

Using the virtual directory structure, our key would be an specific file 
name in some directory that represents the hierarchical structure of the 
volume (xlators), and the value would be its contents.

Using this approach we could even allow some virtual files to be 
writable to trigger some action inside the whole volume, an specific 
mount or a brick, but this doesn't need to be considered right now.

> If we get a solution to above challenges, then I guess we are in a 
> decent shape for further development. Lets go through them one by one, 
> in detail.
> ** Problems and proposed solutions:*
> *a) how to dump metrics data ?*
> Currently, I propose signal handler way, as it will give control for us 
> to choose what are the processes we need to capture information on, and 
> will be much faster than communicating through another tool. Also 
> considering we need to have these metrics taken every 10sec or so, there 
> will be a need for efficient way to get this out.

Probably this is not enough. One clear example is multiplexed bricks. We 
only have a single process, so a signal will dump information about all 
of them. How will we be able to get information only from a single brick 
? we can process all the output, but this is unnecessary work when we 
only one a small piece of information.

> But even there, we have challenges, because we have already chosen both 
> USR1 and USR2 signal handlers, one for statedump, another for toggling 
> latency monitoring respectively. It makes sense to continue to have 
> statedump use USR1, but toggling options should be technically (for 
> correctness too) be handled by glusterd volume set options, and there 
> should be a way to handle it in a better way by our 'reconfigure()' 
> framework in graph-switch. Proposal sent in github issue #303 [6].
> If we are good with above proposal, then we can make use of USR2 for 
> metrics dump. Next issue will be about the format of the file itself, 
> which we will discuss at the end of the email.
> NOTE: Above approach is already implemented in 'experimental' branch, 
> excluding handling of [6].
> *b) where to measure the latency and fops counts?*
> One of the possible way is to load io-stats in between all the nodes, 
> but it has its own limitations. Mainly, how to configure options in each 
> of this translator, will having too many translators slow down operation 
> ? (ie, create one extra 'frame' for every fop, and in a graph of 20 
> xlator, it will be 20 extra frame creates for a single fop).

As I said previously I don't like this approach either.

> I propose we handle this in 'STACK_WIND/UNWIND' macros itself, and 
> provide a placeholder to store all this data in translator structure 
> itself. This will be more cleaner, and no changes are required in code 
> base, other than in 'stack.h (and some in xlator.h)'.

I agree.

> Also, we can provide 'option monitoring enable' (or disable) option as a 
> default option for every translator, and can handle it at xlator_init() 
> time itself. (This is not a blocker for 4.0, but good to have). Idea 
> proposed @ github #304 [7].

I'm not sure if this is really necessary. As I understand it, monitoring 
will be based exclusively on counters. Updating a counter is really 
fast. Adding an option to disable it will mean that the code will need 
to check if this option is enabled or not before updating the counters, 
which is slower.

One thing we could do however, is to add options to the xlator that 
publishes the data to tell it what statistics to show in the virtual 
directory. This way we can globally ignore statistics reported by some 
xlator if we want, but without needing to put specific code into each 
translator to enable or disable it.

> NOTE: this approach is working pretty good already at 'experimental' 
> branch, excluding [7]. Depending on feedback, we can improve it further.
> *c) framework for xlators to provide private metrics*
> One possible solution is to use statedump functions. But to cause least 
> disruption to an existing code, I propose 2 new methods. 
> 'dump_metrics()', and 'reset_metrics()' to xlator methods, which can be 
> dl_open()'d to xlator structure.

If we create a framework for metrics, I would prefer that each xlator 
registers its metrics with the framework. This way there's no need for 
additional functions to each xlator. Dump and reset will be done based 
on the registered metrics.

> 'dump_metrics()' dumps the private metrics in the expected format, and 
> will be called from the global dump-metrics framework, and 
> 'reset_metrics()' would be called from a CLI command when someone wants 
> to restart metrics from 0 to check / validate few things in a running 
> cluster. Helps debug-ability.
> Further feedback welcome.
> NOTE: a sample code is already implemented in 'experimental' branch, and 
> protocol/server xlator uses this framework to dump metrics from rpc 
> layer, and client connections.
> *d) format of the 'metrics' file.*
> If you want any plot-able data on a graph, you need key (should be 
> string), and value (should be a number), collected over time. So, this 
> file should output data for the monitoring systems and not exactly for 
> the debug-ability. We have 'statedump' for debug-ability.
> So, I propose a plain text file, where data would be dumped like below.

I agree. We could easily extract the values we want from the virtual 
directory and convert it to a plain text file in the desired form in a 
trivial way if necessary.

> ```
> # anything starting from # would be treated as comment.
> <key><space><value>
> # anything after the value would be ignored.
> ```
> Any better solutions are welcome. Ideally, we should keep this friendly 
> for external projects to consume, like tendrl [8] or graphite, 
> prometheus etc. Also note that, once we agree to the format, it would be 
> very hard to change it as external projects would use it.
> I would like to hear the feedback from people who are experienced with 
> monitoring systems here.
> NOTE: the above format works fine with 'glustermetrics' project [9] and 
> is working decently on 'experimental' branch.
> ------
> ** Discussions:*
> Let me know how you all want to take the discussion forward?
> Should we get to github, and discuss on each issue? or should I rebase 
> and send the current patches from experimental to 'master' branch and 
> discuss in our review system?  Or should we continue on the email here!
> Regards,
> Amar
> References:
> [1] - https://github.com/gluster/glusterfs/issues/137
> [2] - https://github.com/gluster/glusterfs/issues/141
> [3] - https://github.com/gluster/glusterfs/issues/275
> [4] - https://github.com/gluster/glusterfs/issues/168
> [5] - 
> http://lists.gluster.org/pipermail/maintainers/2017-August/002954.html 
> (last email of the thread).
> [6] - https://github.com/gluster/glusterfs/issues/303
> [7] - https://github.com/gluster/glusterfs/issues/304
> [8] - https://github.com/Tendrl
> [9] - https://github.com/amarts/glustermetrics
> -- 
> Amar Tumballi (amarts)
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel at gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel

More information about the maintainers mailing list