DX NetOps

View Only

Back to discussions

Expand all | Collapse all

Help, interpretation and suggestions regarding the exposed behavior.

1. Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
ISAAC VELASCO
Posted Oct 13, 2021 03:55 PM

Reply Reply Privately
Hello Community.

I have the next situation iin Performance Management.

On certain occasions I observe that the functionality of the DC's is degraded in its functionality. What causes loss or long waiting times in the statistics.

Therefore I have a polling loss

The current state of the DCs is as follows.

So if someone could help me with some advice to avoid these drops in functionality.

Best Regards,

Isaac Velasco
2. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Oct 14, 2021 11:59 AM

Reply Reply Privately
Did the brown/green DC's dcmd process restart between 00:10 and 00:20 ?
That's the only thing that makes sense for the heap to go down that much.

Original Message
3. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
ISAAC VELASCO
Posted Oct 14, 2021 02:09 PM

Reply Reply Privately
Hello Jeffrey

No restart was made in the services. That is why I am looking for an answer that can help me get to the problem that arises.
Greetings. Isaac

Original Message
4. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Oct 14, 2021 02:57 PM

Reply Reply Privately
Is there a corresponding graph for application pause for the DCs? is there a spike in app pause too?

What does "ps -ef | grep java" show for the karaf process on those 2 DCs that dip? It should say when the process was started.

So the last graph about "Polling stopped due to prior timeouts" means that DC had 15 requests for 1 Metric Family on a device not return any response from the device. From the spike, that was 170k devices had the same thing happen.

Maybe there was a network issue, or maybe there was a large app pause due to java garbage collection that caused responses to not be processed.

Original Message
5. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
ISAAC VELASCO
Posted Oct 14, 2021 05:17 PM

Reply Reply Privately
Hello Jeffrey

I share the output of the command ps -fea java
The service had a restart on October 11,

root 21014 1 99 Oct11 ? 4-07:47:36 /CA/IMDataCollector/jre/bin/java -Xms2048M -Xmx32769M -server -Xms2048M -Xmx32769M -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Dcom.sun.management.jmxremote -XX:NewSize=1535m -XX:NewRatio=3 -XX:SurvivorRatio=6 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:TargetSurvivorRatio=50 -XX:InitialTenuringThreshold=15 -XX:MaxTenuringThreshold=15 -XX:+ScavengeBeforeFullGC -XX:+ExplicitGCInvokesConcurrent -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSClassUnloadingEnabled -Djava.endorsed.dirs=/CA/IMDataCollector/jre/jre/lib/endorsed:/CA/IMDataCollector/jre/lib/endorsed:/CA/IMDataCollector/apache-karaf-2.4.3/lib/endorsed -Djava.ext.dirs=/CA/IMDataCollector/jre/jre/lib/ext:/CA/IMDataCollector/jre/lib/ext:/CA/IMDataCollector/apache-karaf-2.4.3/lib/ext -Dkaraf.instances=/CA/IMDataCollector/apache-karaf-2.4.3/instances -Dkaraf.home=/CA/IMDataCollector/apache-karaf-2.4.3 -Dkaraf.base=/CA/IMDataCollector/apache-karaf-2.4.3 -Dkaraf.data=/CA/IMDataCollector/apache-karaf-2.4.3/data -Dkaraf.etc=/CA/IMDataCollector/apache-karaf-2.4.3/etc -Dda.data.home=/CA/IMDataCollector/apache-karaf-2.4.3/da_data -Dda.version=1.0.0.0 -Djava.io.tmpdir=/CA/IMDataCollector/apache-karaf-2.4.3/data/tmp -Djava.util.logging.config.file=/CA/IMDataCollector/apache-karaf-2.4.3/etc/java.util.logging.properties -XX:+HeapDumpOnOutOfMemoryError -Dorg.apache.activemq.SERIALIZABLE_PACKAGES=* -XX:OnOutOfMemoryError=/CA/IMDataCollector/apache-karaf-2.4.3/bin/restart -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf-jaas-boot.jar:/CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf-wrapper.jar:/CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf.jar org.apache.karaf.main.Main

There could be the option that the DA is the responsible of these falls in the performance of the DC's

Best regards,

Isaac Velasco.

Original Message
6. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Oct 15, 2021 04:44 PM

Reply Reply Privately
Okay, so not due to DC restart.

I don't think the issue is with the DA. The DC handles all the polling. The "Polling stopped due to prior timeouts" metric is calculated by DC and send as a poll response to the internal Device Polling Statistics MF.

You may see a delay in processing poll responses in DA due to DA heap/GC, but not reduced polling, or such a big drop in DC memory usage.

Do you have a graph of the application pause for those 2 DC in system health dashboard for DC?
What about poll item count and calculated metrics per sec for those 2 DCs under the DC Polling system health dashboard?
Do we see a drop in poll items and calc metrics per sec at same time?

Original Message
7. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
ISAAC VELASCO
Posted Oct 15, 2021 05:10 PM

Reply Reply Privately
Hello Jeffrey.

the answers to your questions.

Do you have a graph of the application pause for those 2 DC in system health dashboard for DC?
I share the dashboards related to item account and calculated metrics.
DC02

DC03

On the other hand Jeffrey, you know why the DC 03 only consumes about 7 or 8 GB of RAM, when at the configuration level it can take these values.
-Xms = 2049M
Xmx = 32769M

Greetings,
Isaac Velasco.

Original Message
8. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Oct 15, 2021 05:16 PM

Reply Reply Privately
So the poll item count remained the same, but the Calc Metrics is based on not seeing any poll responses to process metric expressions.
With the "Polling stopped due to prior timeouts" being high, that aligns with Calc Metrics going down. If there are timeouts, we're not gonna do as many Calcs as there isn't any data to process.

As for XMX, that's the amount of memory Java COULD use, but it'll only allocate an initial amount based off Xms 2G setting. Then as it needs more memory, it will ask the OS for more memory. The heap usage in our graphs reflects how much memory java is currently using for our application. It doesn't use all 32G if it doesn't need to.

Original Message
9. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
ISAAC VELASCO
Posted Oct 15, 2021 05:34 PM

Reply Reply Privately
Thank Jeffrey.

From what I understand, it would help us a lot by removing the burden from the DCs involved and distributing correctly.

You recommend increasing the capacity of the -Xms parameter to about 8 GB so that the application takes more resources or what would be your recommendation.

Greetings,
Isaac Velasco.

Original Message
10. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
ISAAC VELASCO
Posted Oct 18, 2021 01:37 PM

Reply Reply Privately
Hello Jeffrey.

Today the incident was presented again. Looking in the karaf.out log, the indication of the failure begins with the next line and is repeated until the degradation is restored.

ERROR
2021-10-18 01:16:27,661 | ERROR | monSocket-Reader | IcmpDaemonSocket | impl.icmpdaemon.IcmpDaemonSocket 865 | 198 - com.ca.im.data-collection-manager.icmp.icmp_daemon - 3.6.0.RELEASE-283 | | Received IcmpDaemon Response, but no request was found for it: [uid=44064771]. There may have been an IcmpDaemon timeout.

Could you help me if you know something about this error.

Best Regards.

Isaac Velasco.

Original Message
11. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
Broadcom Employee

Michael Poller
Posted Oct 18, 2021 02:46 PM

Reply Reply Privately
I'll let Jeff add additional info, but on the surface that error is known and indicative of network or system issues.

https://knowledge.broadcom.com/external/article?articleId=118694

Original Message
12. RE: Help, interpretation and suggestions regarding the exposed behavior.

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Oct 19, 2021 03:23 PM

Reply Reply Privately
First, no, I don't suggest making any modifications to DC XMX/XMS settings. Java is able to handle heap growth as needed up to XMX.
The XMS is really just a starting point to tell Java to just grab 2GB of ram to startup, then doing like 512MB and then require getting more from OS.

Next, the ICMP error seems to align with the calculated metrics per second and device stopped polling due to prior timeouts. If we can't ping or poll - or responses are delayed/coming in slower than allowed, we'll not do any metric calcs. And possibly memory goes down due to no data to process in SNMP and MVEL engines.

Original Message

DX NetOps

Help, interpretation and suggestions regarding the exposed behavior.

ISAAC VELASCOOct 13, 2021 03:55 PM

Jeffrey PinardOct 14, 2021 11:59 AM

ISAAC VELASCOOct 14, 2021 02:09 PM

Jeffrey PinardOct 14, 2021 02:57 PM

ISAAC VELASCOOct 14, 2021 05:17 PM

Jeffrey PinardOct 15, 2021 04:44 PM

ISAAC VELASCOOct 15, 2021 05:10 PM

Jeffrey PinardOct 15, 2021 05:16 PM

ISAAC VELASCOOct 15, 2021 05:34 PM

ISAAC VELASCOOct 18, 2021 01:37 PM

Michael PollerOct 18, 2021 02:46 PM

Jeffrey PinardOct 19, 2021 03:23 PM

1. Help, interpretation and suggestions regarding the exposed behavior.

2. RE: Help, interpretation and suggestions regarding the exposed behavior.

3. RE: Help, interpretation and suggestions regarding the exposed behavior.

4. RE: Help, interpretation and suggestions regarding the exposed behavior.

5. RE: Help, interpretation and suggestions regarding the exposed behavior.

6. RE: Help, interpretation and suggestions regarding the exposed behavior.

7. RE: Help, interpretation and suggestions regarding the exposed behavior.

8. RE: Help, interpretation and suggestions regarding the exposed behavior.

9. RE: Help, interpretation and suggestions regarding the exposed behavior.

10. RE: Help, interpretation and suggestions regarding the exposed behavior.

11. RE: Help, interpretation and suggestions regarding the exposed behavior.

12. RE: Help, interpretation and suggestions regarding the exposed behavior.