[DISCUSSION] The rebalancing process metrics update

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[DISCUSSION] The rebalancing process metrics update

Maxim Muzafarov

Currently, from my perspective, the Apache Ignite has a very raw
rebalance process metrics. Moreover, the most interesting metrics are
related to the Cache, not a CacheGroup and require enabling cache
statistics which can affect node performance.

Some of the metrics are not working as they expected. For instance,
`EstimatedRebalancingKeys` metric time to time returns `-1` value due
to an internal issues which require investigation (check [1] for
details). Another metric `rebalanceKeysReceived` metric treated as
CacheMetric in fact calculated for the whole cache group, see [2]
comment (e.g. historical rebalance, see IGNITE-11330 and code block
comment below). It confuses Ignite users.

I think the rebalance process metrics must be reworked, some issues
fixed and I invite you to participate in the current discussion.


I've posted my thought in the description of the issue [3]. Here is
some details.

All such metrics (or their analogue) must be available for the
CacheGroupMetrics and I'd like to suggest to do the following steps:


rebalancingPartitionsLeft long metric
rebalancingReceivedKeys long metric
rebalancingReceivedBytes long metric
rebalancingStartTime long metric
rebalancingFinishTime long metric

It is not possible to get the actual values of rebalanced partitions
from the `LocalNodeMovingPartitionsCount` since for the empty node
join the cluster we are owning and enabling WAL simultaneously for all
the partitions at once. Partitions are actually transferred, but not
yet owning. That's why `rebalancingPartitionsLeft` metric needed, from
my point.


rebalancingExpectedKeys long metric
rebalancingExpectedBytes long metric
rebalancingEvictedPartitionsLeft long metric

The investigation is needed for the issues with the calculation of
estimated rebalancing keys count for full and historical rebalance
processes and their actual partitions sizes. These metrics must be
calculated before the new rebalance started for each cache group on
rebalancing node, so the user can see real values of 'how many keys
will be rebalanced and can able to estimate the rebalance process
finish time using a monitoring system that he uses.

Phase-3 (statistics must be enabled)

rebalancingKeysRate HitRate metric
rebalancingBytesRate HitRate metric

Currently, I've observed a lot of CPU (up to 100%) consumption for the
calculation of such type of metrics. I think it should be investigated
too and these metrics by default must be disabled.


After the rebalance process cache group level metrics will be
implemented we need to mark rebalancing CacheMetrics deprecated and
remove them from metrics a newly introduced metrics framework [4].
Such cache metrics should be implemented in an old-fashion way (like
they were before the metrics framework added) to keep backwards
compatibility and must be removed it Apache Ignite 3.0

Any thoughts?

[1] https://issues.apache.org/jira/browse/IGNITE-11330?focusedCommentId=16867537&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16867537
[2] https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/GridDhtPartitionDemander.java#L1134
[3] https://issues.apache.org/jira/browse/IGNITE-12183
[4] https://issues.apache.org/jira/browse/IGNITE-11848