Okay, so not due to DC restart.
I don't think the issue is with the DA. The DC handles all the polling. The
"Polling stopped due to prior timeouts" metric is calculated by DC and send as a poll response to the internal Device Polling Statistics MF.
You may see a delay in processing poll responses in DA due to DA heap/GC, but not reduced polling, or such a big drop in DC memory usage.
Do you have a graph of the application pause for those 2 DC in system health dashboard for DC?
What about poll item count and calculated metrics per sec for those 2 DCs under the DC Polling system health dashboard?
Do we see a drop in poll items and calc metrics per sec at same time?
Original Message:
Sent: 10-14-2021 05:16 PM
From: Isaac Antonio Velasco Sandoval
Subject: Help, interpretation and suggestions regarding the exposed behavior.
Hello Jeffrey
I share the output of the command ps -fea java
The service had a restart on October 11,
root 21014 1 99 Oct11 ? 4-07:47:36 /CA/IMDataCollector/jre/bin/java -Xms2048M -Xmx32769M -server -Xms2048M -Xmx32769M -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Dcom.sun.management.jmxremote -XX:NewSize=1535m -XX:NewRatio=3 -XX:SurvivorRatio=6 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:TargetSurvivorRatio=50 -XX:InitialTenuringThreshold=15 -XX:MaxTenuringThreshold=15 -XX:+ScavengeBeforeFullGC -XX:+ExplicitGCInvokesConcurrent -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSClassUnloadingEnabled -Djava.endorsed.dirs=/CA/IMDataCollector/jre/jre/lib/endorsed:/CA/IMDataCollector/jre/lib/endorsed:/CA/IMDataCollector/apache-karaf-2.4.3/lib/endorsed -Djava.ext.dirs=/CA/IMDataCollector/jre/jre/lib/ext:/CA/IMDataCollector/jre/lib/ext:/CA/IMDataCollector/apache-karaf-2.4.3/lib/ext -Dkaraf.instances=/CA/IMDataCollector/apache-karaf-2.4.3/instances -Dkaraf.home=/CA/IMDataCollector/apache-karaf-2.4.3 -Dkaraf.base=/CA/IMDataCollector/apache-karaf-2.4.3 -Dkaraf.data=/CA/IMDataCollector/apache-karaf-2.4.3/data -Dkaraf.etc=/CA/IMDataCollector/apache-karaf-2.4.3/etc -Dda.data.home=/CA/IMDataCollector/apache-karaf-2.4.3/da_data -Dda.version=1.0.0.0 -Djava.io.tmpdir=/CA/IMDataCollector/apache-karaf-2.4.3/data/tmp -Djava.util.logging.config.file=/CA/IMDataCollector/apache-karaf-2.4.3/etc/java.util.logging.properties -XX:+HeapDumpOnOutOfMemoryError -Dorg.apache.activemq.SERIALIZABLE_PACKAGES=* -XX:OnOutOfMemoryError=/CA/IMDataCollector/apache-karaf-2.4.3/bin/restart -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf-jaas-boot.jar:/CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf-wrapper.jar:/CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf.jar org.apache.karaf.main.Main
There could be the option that the DA is the responsible of these falls in the performance of the DC's
Best regards,
Isaac Velasco.
Original Message:
Sent: 10-14-2021 02:56 PM
From: JEFFREY PINARD
Subject: Help, interpretation and suggestions regarding the exposed behavior.
Is there a corresponding graph for application pause for the DCs? is there a spike in app pause too?
What does "ps -ef | grep java" show for the karaf process on those 2 DCs that dip? It should say when the process was started.
So the last graph about "Polling stopped due to prior timeouts" means that DC had 15 requests for 1 Metric Family on a device not return any response from the device. From the spike, that was 170k devices had the same thing happen.
Maybe there was a network issue, or maybe there was a large app pause due to java garbage collection that caused responses to not be processed.
Original Message:
Sent: 10-14-2021 02:09 PM
From: Isaac Antonio Velasco Sandoval
Subject: Help, interpretation and suggestions regarding the exposed behavior.
Hello Jeffrey
No restart was made in the services. That is why I am looking for an answer that can help me get to the problem that arises.
Greetings. Isaac
Original Message:
Sent: 10-14-2021 11:58 AM
From: JEFFREY PINARD
Subject: Help, interpretation and suggestions regarding the exposed behavior.
Did the brown/green DC's dcmd process restart between 00:10 and 00:20 ?
That's the only thing that makes sense for the heap to go down that much.
Original Message:
Sent: 10-13-2021 03:54 PM
From: Isaac Antonio Velasco Sandoval
Subject: Help, interpretation and suggestions regarding the exposed behavior.
Hello Community.
I have the next situation iin Performance Management.
On certain occasions I observe that the functionality of the DC's is degraded in its functionality. What causes loss or long waiting times in the statistics.
Therefore I have a polling loss
The current state of the DCs is as follows.
So if someone could help me with some advice to avoid these drops in functionality.
Best Regards,
Isaac Velasco