Idea Details

Change the default handling of SNMP counter rollovers in CAPC back to the way it was pre-2.5.0

Last activity 25 days ago
Bob Milla's profile image
08-27-2015 11:15 AM

gaps.bmp

 

Shortly after we upgraded CAPC from version 2.4.1 to 2.5.0 we started to see a lot of data gaps in our network data, specifically for Utilization In/Out.  At first I thought it was a polling problem but after opening a support case and digging deeper we found out CAPC changed the default behavior for how it handles counter rollovers in 2.5.0.   In version 2.4.1 and earlier if a counter value was less then the previous value CAPC would just assume the counter rolled over and do the proper math to figure out the correct delta value.  Now in 2.5.0 it apparently assumes the counter was reset, waits 2 more polling cycles before it can calculate a delta again, and leaves you with a 10 minute data gap.

 

CAPC will use the low speed 32-bit counters (ifEntry) instead of the high speed interface 64-bit counters (ifXEntry) for any interface that has a speed of 20M or less.  So now lets do a little math to see how often the 32-bit ifInOctets or ifOutOctets counters could rollover on a 20M interface.

 

Highest value on a 32-bit Octet counter = 4,294,967,295

In bits that's 8 * 4,294,967,295 = 34,359,738,360

 

If your 20M interface is running at 100% your counter will rollover in 34,359,738,360 / 20,000,000 = 1717 seconds

 

1717 / 60 = 28.6 minutes

 

Who thinks a 10 minute data gap every 29 minutes is a good idea?

 

Now I'll admit that's an extreme example, but even if our 20M interface was running at 50% there would be a 10 minute data gap every hour.  Even on our lower speed interfaces in the 1.5M-6M range we were seeing far too many gaps in the utilization charts.

 

The good news is the fix is easy to put it back to the way it worked in 2.4.1.  The bad news is CA omitted this change in the 2.5.0 Release Notes, so we had to find out about it the hard way.

 

Here's how to change it back -

On your Data Collector create a file called com.ca.im.dm.snmp.collector.SnmpCollector.cfg in the  <Data_Collector_installation_ directory>/apache-karaf-2.3.0/etc  directory.

Add the following line to the file:

showGapsOnCounterRollover=false

 

 

I'm told by support that the reason for the change was to avoid the large data spikes that can occur when a counter resets.  I would argue that counter rollovers are going to occur a lot more often than counter resets.  I also don't think customers should have to choose between data spikes or data gaps.  I keep hearing how CAPC is supposed to be a "carrier class" product.  I would hope that a carrier class SNMP collector would be able to handle both counter rollovers and counter resets in its normalization calculations.


Comments

04-24-2017 04:18 AM

Dan, Has there been any update with regard to this

09-07-2015 08:08 AM

Jusin, ...and Dan...,

  

To be clear, such validation rules and expressions are a feature-request change, just a different approach which is independent of sysObject values

.

09-07-2015 07:55 AM

Justin I am familiar with such bugs within agents too, however my experience is that this can best be delt with as an 'validation' component of each vendor cert.  As such one or many vendors can commit common errors, and the collection engine can reject the rows as needed when these bad values are detected (eg billions pct...). 

What each rule/expression determines is when to reject a calculation row completely, producing a datagap for all metrics, also bypassing thresholding alarm changes for that item until the next cycle. 

I have included examples of such rules below.  The first 2 of these would be natural additions for all sites for these standard certs, the latter two examples show from field experience where custom self-certifications can benefit from this approach too.   The advantage of supporting 2+ rules per cert to allow verbose-logging to record which rule-expression failed to cause the data-gap.

* vendorCert_name: ifEntry/ifXEntry

                ruleDescription: Interface counters reset

      expression: (sysuptime < ifCounterDiscontinuityTime) || ((sysuptime - ifCounterDiscontinuityTime)/100 > duration)

               ruleDescription: Too many OUT octets for interface - possible rollover (i.e. expect pctutil_out always < 1000)

     expression: if(speed != 0) then (outoctets <= (duration*$ifSpeed_out)/8 * 10) else 1

               ruleDescription: Too many IN octets for interface - possible rollover (i.e. expect pctutil_in always < 1000)

     expression: if(speed != 0) then (inoctets <= (duration*$ifSpeed_in)/8 * 10) else 1

** vendorCert_name: CISCO_CLASS_BASED_QOS_MIB.cbQosCMStatsEntry

                ruleDescription: PostUtil Reset   (An Improvement on this would be CM "network" util < 100, i.e. always expect that postoctets * 8 < portspeed * duration)

     expression: postutil < 2000

                ruleDescription: Drop Couter Reset

     expression: droppkts < 1000000  and dropbytes < 1000000000

*** vendorCert_name: LOOPRUNNER_MIB.lrDs3NpEntry

               ruleDescription: Util Validation

     expression: rxutil <= 100 and txutil <= 100

**** vendorCert_name: TIMETRA_SAP_MIB.sapIngQosQueueStatsEntry

               ruleDescription: err detect

     expression: ingr_out >= 0 and ingr_out < 5500000000 and egr_in >= 0 and egr_in < 1500000000 and egr_out >= 0 and egr_out < 5500000000 and egrdrop_out < 100000000

09-01-2015 03:22 PM

Dan - Thanks for your prompt attention to this one.  We look forward to seeing what your investigation turns up.

09-01-2015 02:55 PM

You may be able to implement a purple colored line that is your baseline and have that denoted instead of gaps . This of course would need to be explained in some sort of FAQ or in the view.

This at least gives the eye something easier to see . Otherwise look at pps during that time and compared to the pps of this counter duration maybe there is a possibility there.

 

I did not see a baseline in your graph so it does indeed make it look very scatter-gram ish .

09-01-2015 01:28 PM

For us, having this feature enabled was critical as we had a few platforms with SNMP bugs where counters were prematurely rolling over.  Performance Manager was happily calculating this as a valid rollover, and was resulting in huge percentage utilizations (several billion % utilization).

 

We would rather see no data than bogus data.  With that said, we may not have any devices in our environment that only support 32-bit counters, as I can't say I've seen what the OP is reporting.

 

Perhaps the implementation of 'showGapsOnCounterRollover' could be changed to allow it to be enabled/disabled based on the sysObjectID.

09-01-2015 01:05 PM

Absolutely - depending on your event rule this would have a big impact on thresholds - the first part of our investigation is to compare behavior across NFA, eHealth, and CAPM and then step back and review what is the right way to ensure we have the most accurate data recorded in the system available for analytics and reporting - what i will do is update this issue when we complete the investigation later this month - i will also post a blog on the topic to explain what we are planning - data quality is a very important issue for us

09-01-2015 01:00 PM

For rollovers it is also very important to be thinking of how this effects alerting and thresholds.

 

I look forward to seeing how you address this. Please comment here on how you plan on getting back to us on this issue.

09-01-2015 12:42 PM

We definitely own the mistake of changing default behavior and not at a min updating the release notes - on behalf of the R&D team we really apologize for this.

 

We are shifting some resources to dig into this issue a bit more as we are hearing concerns from multiple customers - thanks for raising and as we consider any improvements / changes I will be sure to update the community!

 

Regarding this issue its a great use of community so thanks for posting - if we get a strong response from the community we will definitely consider changing the default but again - I want to see just how we can avoid having bad values or gaps recorded

08-27-2015 11:33 AM

Maybe a technical document would be better for this case, if there only the case of a record to be changed in the configuration.