Cache Metrics

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Cache Metrics

Вячеслав Коптилин
Hi Experts,

I am working on https://issues.apache.org/jira/browse/IGNITE-3495

A few words about this issue:
It is about that the process of gathering/updating of cache metrics is
inconsistent in some cases.
Let's consider the following simple topology which contains only two nodes:
first node is a client node and the second is a server.
And client node starts requests to the server node, for instance
cache.put(), cache.putAll(), cache.get() etc.
In that case, metrics which are related to counters (cache hits, cache
misses, removals and puts) are calculated on the server side,
while time metrics are updated on the client node.

I think that both metrics (counters and time) should be calculated on the
same node. So, there are two obvious solution:

#1 Node that starts some operation is responsible for updating the cache
metrics.
Pro:
 - it will allow to get more accurate results of metrics.
Contra:
- this approach does not work in particular cases. for example, partitioned
cache with FULL_ASYNC write synchronization mode.
- needs to extend response messages (GridNearAtomicUpdateResponse,
GridNearGetResponse etc)
  in order to provide additional information from remote node: cache hits,
number of removal etc.
  So, it will lead to additional pressure on communication channel.
Perhaps, this impact will be small - 4 bytes per message or something like
that.
- backward incompatibility (this is a consequence of the previous point)

#2 Primary node (node that actually executes a request)
Pro:
- easy to implement
- backward compatible
Contra:
- time metrics will not include the time of communication between nodes, so
the results will be less accurate.
- perhaps we need to provide additional metric which will allow to get avg
time of communication between nodes.

Please let me know about your thoughts.
Perhaps, both alternatives are not so good...

Regards,
Slava.
Reply | Threaded
Open this post in threaded view
|

Re: Cache Metrics

Andrey Gura-2
Hi,

I believe that the first solution is better than second because it
takes into account network communication time. Average time of
communication between nodes doesn't make sense from my point of view.

So I vote for #1.

On Thu, Jul 13, 2017 at 11:52 PM, Вячеслав Коптилин
<[hidden email]> wrote:

> Hi Experts,
>
> I am working on https://issues.apache.org/jira/browse/IGNITE-3495
>
> A few words about this issue:
> It is about that the process of gathering/updating of cache metrics is
> inconsistent in some cases.
> Let's consider the following simple topology which contains only two nodes:
> first node is a client node and the second is a server.
> And client node starts requests to the server node, for instance
> cache.put(), cache.putAll(), cache.get() etc.
> In that case, metrics which are related to counters (cache hits, cache
> misses, removals and puts) are calculated on the server side,
> while time metrics are updated on the client node.
>
> I think that both metrics (counters and time) should be calculated on the
> same node. So, there are two obvious solution:
>
> #1 Node that starts some operation is responsible for updating the cache
> metrics.
> Pro:
>  - it will allow to get more accurate results of metrics.
> Contra:
> - this approach does not work in particular cases. for example, partitioned
> cache with FULL_ASYNC write synchronization mode.
> - needs to extend response messages (GridNearAtomicUpdateResponse,
> GridNearGetResponse etc)
>   in order to provide additional information from remote node: cache hits,
> number of removal etc.
>   So, it will lead to additional pressure on communication channel.
> Perhaps, this impact will be small - 4 bytes per message or something like
> that.
> - backward incompatibility (this is a consequence of the previous point)
>
> #2 Primary node (node that actually executes a request)
> Pro:
> - easy to implement
> - backward compatible
> Contra:
> - time metrics will not include the time of communication between nodes, so
> the results will be less accurate.
> - perhaps we need to provide additional metric which will allow to get avg
> time of communication between nodes.
>
> Please let me know about your thoughts.
> Perhaps, both alternatives are not so good...
>
> Regards,
> Slava.
Reply | Threaded
Open this post in threaded view
|

Re: Cache Metrics

Denis Magda-2
Guys,

What if we calculate it on both sides? The client will keep the total time needed to complete an operation including network hoops while a server (primary or backup) will count only local time.


Denis

> On Jul 17, 2017, at 7:07 AM, Andrey Gura <[hidden email]> wrote:
>
> Hi,
>
> I believe that the first solution is better than second because it
> takes into account network communication time. Average time of
> communication between nodes doesn't make sense from my point of view.
>
> So I vote for #1.
>
> On Thu, Jul 13, 2017 at 11:52 PM, Вячеслав Коптилин
> <[hidden email]> wrote:
>> Hi Experts,
>>
>> I am working on https://issues.apache.org/jira/browse/IGNITE-3495
>>
>> A few words about this issue:
>> It is about that the process of gathering/updating of cache metrics is
>> inconsistent in some cases.
>> Let's consider the following simple topology which contains only two nodes:
>> first node is a client node and the second is a server.
>> And client node starts requests to the server node, for instance
>> cache.put(), cache.putAll(), cache.get() etc.
>> In that case, metrics which are related to counters (cache hits, cache
>> misses, removals and puts) are calculated on the server side,
>> while time metrics are updated on the client node.
>>
>> I think that both metrics (counters and time) should be calculated on the
>> same node. So, there are two obvious solution:
>>
>> #1 Node that starts some operation is responsible for updating the cache
>> metrics.
>> Pro:
>> - it will allow to get more accurate results of metrics.
>> Contra:
>> - this approach does not work in particular cases. for example, partitioned
>> cache with FULL_ASYNC write synchronization mode.
>> - needs to extend response messages (GridNearAtomicUpdateResponse,
>> GridNearGetResponse etc)
>>  in order to provide additional information from remote node: cache hits,
>> number of removal etc.
>>  So, it will lead to additional pressure on communication channel.
>> Perhaps, this impact will be small - 4 bytes per message or something like
>> that.
>> - backward incompatibility (this is a consequence of the previous point)
>>
>> #2 Primary node (node that actually executes a request)
>> Pro:
>> - easy to implement
>> - backward compatible
>> Contra:
>> - time metrics will not include the time of communication between nodes, so
>> the results will be less accurate.
>> - perhaps we need to provide additional metric which will allow to get avg
>> time of communication between nodes.
>>
>> Please let me know about your thoughts.
>> Perhaps, both alternatives are not so good...
>>
>> Regards,
>> Slava.

Reply | Threaded
Open this post in threaded view
|

Re: Cache Metrics

Andrey Gura-2
Den,

doesn't make sense from my point if view. And we create new problem:
how should we aggregate this metrics when user requests metrics for
cluster group.

On Mon, Jul 24, 2017 at 8:48 PM, Denis Magda <[hidden email]> wrote:

> Guys,
>
> What if we calculate it on both sides? The client will keep the total time needed to complete an operation including network hoops while a server (primary or backup) will count only local time.
>
> —
> Denis
>
>> On Jul 17, 2017, at 7:07 AM, Andrey Gura <[hidden email]> wrote:
>>
>> Hi,
>>
>> I believe that the first solution is better than second because it
>> takes into account network communication time. Average time of
>> communication between nodes doesn't make sense from my point of view.
>>
>> So I vote for #1.
>>
>> On Thu, Jul 13, 2017 at 11:52 PM, Вячеслав Коптилин
>> <[hidden email]> wrote:
>>> Hi Experts,
>>>
>>> I am working on https://issues.apache.org/jira/browse/IGNITE-3495
>>>
>>> A few words about this issue:
>>> It is about that the process of gathering/updating of cache metrics is
>>> inconsistent in some cases.
>>> Let's consider the following simple topology which contains only two nodes:
>>> first node is a client node and the second is a server.
>>> And client node starts requests to the server node, for instance
>>> cache.put(), cache.putAll(), cache.get() etc.
>>> In that case, metrics which are related to counters (cache hits, cache
>>> misses, removals and puts) are calculated on the server side,
>>> while time metrics are updated on the client node.
>>>
>>> I think that both metrics (counters and time) should be calculated on the
>>> same node. So, there are two obvious solution:
>>>
>>> #1 Node that starts some operation is responsible for updating the cache
>>> metrics.
>>> Pro:
>>> - it will allow to get more accurate results of metrics.
>>> Contra:
>>> - this approach does not work in particular cases. for example, partitioned
>>> cache with FULL_ASYNC write synchronization mode.
>>> - needs to extend response messages (GridNearAtomicUpdateResponse,
>>> GridNearGetResponse etc)
>>>  in order to provide additional information from remote node: cache hits,
>>> number of removal etc.
>>>  So, it will lead to additional pressure on communication channel.
>>> Perhaps, this impact will be small - 4 bytes per message or something like
>>> that.
>>> - backward incompatibility (this is a consequence of the previous point)
>>>
>>> #2 Primary node (node that actually executes a request)
>>> Pro:
>>> - easy to implement
>>> - backward compatible
>>> Contra:
>>> - time metrics will not include the time of communication between nodes, so
>>> the results will be less accurate.
>>> - perhaps we need to provide additional metric which will allow to get avg
>>> time of communication between nodes.
>>>
>>> Please let me know about your thoughts.
>>> Perhaps, both alternatives are not so good...
>>>
>>> Regards,
>>> Slava.
>
Reply | Threaded
Open this post in threaded view
|

Re: Cache Metrics

Denis Magda-2
Andrey,

I would simply take an average if a mixed clients-servers cluster group is used.

In general, the goal of the ticket was to fix the time-based metrics on the server side. As far as I understand they are already calculated properly on the client’s considering network contribution, right? So, all that’s left to do is to count the same on the servers so that those metrics no longer return 0.


Denis
 

> On Jul 25, 2017, at 6:53 AM, Andrey Gura <[hidden email]> wrote:
>
> Den,
>
> doesn't make sense from my point if view. And we create new problem:
> how should we aggregate this metrics when user requests metrics for
> cluster group.
>
> On Mon, Jul 24, 2017 at 8:48 PM, Denis Magda <[hidden email]> wrote:
>> Guys,
>>
>> What if we calculate it on both sides? The client will keep the total time needed to complete an operation including network hoops while a server (primary or backup) will count only local time.
>>
>> —
>> Denis
>>
>>> On Jul 17, 2017, at 7:07 AM, Andrey Gura <[hidden email]> wrote:
>>>
>>> Hi,
>>>
>>> I believe that the first solution is better than second because it
>>> takes into account network communication time. Average time of
>>> communication between nodes doesn't make sense from my point of view.
>>>
>>> So I vote for #1.
>>>
>>> On Thu, Jul 13, 2017 at 11:52 PM, Вячеслав Коптилин
>>> <[hidden email]> wrote:
>>>> Hi Experts,
>>>>
>>>> I am working on https://issues.apache.org/jira/browse/IGNITE-3495
>>>>
>>>> A few words about this issue:
>>>> It is about that the process of gathering/updating of cache metrics is
>>>> inconsistent in some cases.
>>>> Let's consider the following simple topology which contains only two nodes:
>>>> first node is a client node and the second is a server.
>>>> And client node starts requests to the server node, for instance
>>>> cache.put(), cache.putAll(), cache.get() etc.
>>>> In that case, metrics which are related to counters (cache hits, cache
>>>> misses, removals and puts) are calculated on the server side,
>>>> while time metrics are updated on the client node.
>>>>
>>>> I think that both metrics (counters and time) should be calculated on the
>>>> same node. So, there are two obvious solution:
>>>>
>>>> #1 Node that starts some operation is responsible for updating the cache
>>>> metrics.
>>>> Pro:
>>>> - it will allow to get more accurate results of metrics.
>>>> Contra:
>>>> - this approach does not work in particular cases. for example, partitioned
>>>> cache with FULL_ASYNC write synchronization mode.
>>>> - needs to extend response messages (GridNearAtomicUpdateResponse,
>>>> GridNearGetResponse etc)
>>>> in order to provide additional information from remote node: cache hits,
>>>> number of removal etc.
>>>> So, it will lead to additional pressure on communication channel.
>>>> Perhaps, this impact will be small - 4 bytes per message or something like
>>>> that.
>>>> - backward incompatibility (this is a consequence of the previous point)
>>>>
>>>> #2 Primary node (node that actually executes a request)
>>>> Pro:
>>>> - easy to implement
>>>> - backward compatible
>>>> Contra:
>>>> - time metrics will not include the time of communication between nodes, so
>>>> the results will be less accurate.
>>>> - perhaps we need to provide additional metric which will allow to get avg
>>>> time of communication between nodes.
>>>>
>>>> Please let me know about your thoughts.
>>>> Perhaps, both alternatives are not so good...
>>>>
>>>> Regards,
>>>> Slava.
>>

Reply | Threaded
Open this post in threaded view
|

Re: Cache Metrics

Andrey Gura-2
Den,

I see at least two problems here:

1. Metrics meaning for end user. How user should interpret metrics in
this case. Moreover, average is bad gauge for monitoring because it
hides actual latencies. User should have possibility to get accurate
metrics in order to build some monitoring that can create percentile
based charts for example and accuracy is very important property for
such cases.

2. It just makes code more complex and we will have metrics related
logic in two places instead of one.



On Wed, Jul 26, 2017 at 4:45 AM, Denis Magda <[hidden email]> wrote:

> Andrey,
>
> I would simply take an average if a mixed clients-servers cluster group is used.
>
> In general, the goal of the ticket was to fix the time-based metrics on the server side. As far as I understand they are already calculated properly on the client’s considering network contribution, right? So, all that’s left to do is to count the same on the servers so that those metrics no longer return 0.
>
> —
> Denis
>
>> On Jul 25, 2017, at 6:53 AM, Andrey Gura <[hidden email]> wrote:
>>
>> Den,
>>
>> doesn't make sense from my point if view. And we create new problem:
>> how should we aggregate this metrics when user requests metrics for
>> cluster group.
>>
>> On Mon, Jul 24, 2017 at 8:48 PM, Denis Magda <[hidden email]> wrote:
>>> Guys,
>>>
>>> What if we calculate it on both sides? The client will keep the total time needed to complete an operation including network hoops while a server (primary or backup) will count only local time.
>>>
>>> —
>>> Denis
>>>
>>>> On Jul 17, 2017, at 7:07 AM, Andrey Gura <[hidden email]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I believe that the first solution is better than second because it
>>>> takes into account network communication time. Average time of
>>>> communication between nodes doesn't make sense from my point of view.
>>>>
>>>> So I vote for #1.
>>>>
>>>> On Thu, Jul 13, 2017 at 11:52 PM, Вячеслав Коптилин
>>>> <[hidden email]> wrote:
>>>>> Hi Experts,
>>>>>
>>>>> I am working on https://issues.apache.org/jira/browse/IGNITE-3495
>>>>>
>>>>> A few words about this issue:
>>>>> It is about that the process of gathering/updating of cache metrics is
>>>>> inconsistent in some cases.
>>>>> Let's consider the following simple topology which contains only two nodes:
>>>>> first node is a client node and the second is a server.
>>>>> And client node starts requests to the server node, for instance
>>>>> cache.put(), cache.putAll(), cache.get() etc.
>>>>> In that case, metrics which are related to counters (cache hits, cache
>>>>> misses, removals and puts) are calculated on the server side,
>>>>> while time metrics are updated on the client node.
>>>>>
>>>>> I think that both metrics (counters and time) should be calculated on the
>>>>> same node. So, there are two obvious solution:
>>>>>
>>>>> #1 Node that starts some operation is responsible for updating the cache
>>>>> metrics.
>>>>> Pro:
>>>>> - it will allow to get more accurate results of metrics.
>>>>> Contra:
>>>>> - this approach does not work in particular cases. for example, partitioned
>>>>> cache with FULL_ASYNC write synchronization mode.
>>>>> - needs to extend response messages (GridNearAtomicUpdateResponse,
>>>>> GridNearGetResponse etc)
>>>>> in order to provide additional information from remote node: cache hits,
>>>>> number of removal etc.
>>>>> So, it will lead to additional pressure on communication channel.
>>>>> Perhaps, this impact will be small - 4 bytes per message or something like
>>>>> that.
>>>>> - backward incompatibility (this is a consequence of the previous point)
>>>>>
>>>>> #2 Primary node (node that actually executes a request)
>>>>> Pro:
>>>>> - easy to implement
>>>>> - backward compatible
>>>>> Contra:
>>>>> - time metrics will not include the time of communication between nodes, so
>>>>> the results will be less accurate.
>>>>> - perhaps we need to provide additional metric which will allow to get avg
>>>>> time of communication between nodes.
>>>>>
>>>>> Please let me know about your thoughts.
>>>>> Perhaps, both alternatives are not so good...
>>>>>
>>>>> Regards,
>>>>> Slava.
>>>
>