DX NetOps

 View Only
Expand all | Collapse all

Help, interpretation and suggestions regarding the exposed behavior.

  • 1.  Help, interpretation and suggestions regarding the exposed behavior.

    Posted Oct 13, 2021 03:55 PM
    Hello Community.

    I have the next situation iin Performance Management.

    On certain occasions I observe that the functionality of the DC's is degraded in its functionality. What causes loss or long waiting times in the statistics.


     Therefore I have a polling loss

    The current state of the DCs is as follows.

    So if someone could help me with some advice to avoid these drops in functionality.


    Best Regards,

    Isaac Velasco


  • 2.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Broadcom Employee
    Posted Oct 14, 2021 11:59 AM
    Did the brown/green DC's dcmd process restart between 00:10 and 00:20 ?
    That's the only thing that makes sense for the heap to go down that much.


  • 3.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Posted Oct 14, 2021 02:09 PM
    Hello Jeffrey

    No restart was made in the services. That is why I am looking for an answer that can help me get to the problem that arises.
    Greetings. Isaac



  • 4.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Broadcom Employee
    Posted Oct 14, 2021 02:57 PM

    Is there a corresponding graph for application pause for the DCs?  is there a spike in app pause too?

    What does "ps -ef | grep java" show for the karaf process on those 2 DCs that dip?   It should say when the process was started.

    So the last graph about "Polling stopped due to prior timeouts" means that DC had 15 requests for 1 Metric Family on a device not return any response from the device.  From the spike, that was 170k devices had the same thing happen.

    Maybe there was a network issue, or maybe there was a large app pause due to java garbage collection that caused responses to not be processed.




  • 5.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Posted Oct 14, 2021 05:17 PM
    Hello Jeffrey

    I share the output of the command ps -fea java
    The service had a restart on October 11,

    root 21014 1 99 Oct11 ? 4-07:47:36 /CA/IMDataCollector/jre/bin/java -Xms2048M -Xmx32769M -server -Xms2048M -Xmx32769M -XX:+UnlockDiagnosticVMOptions -XX:+UnsyncloadClass -Dcom.sun.management.jmxremote -XX:NewSize=1535m -XX:NewRatio=3 -XX:SurvivorRatio=6 -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:TargetSurvivorRatio=50 -XX:InitialTenuringThreshold=15 -XX:MaxTenuringThreshold=15 -XX:+ScavengeBeforeFullGC -XX:+ExplicitGCInvokesConcurrent -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+CMSClassUnloadingEnabled -Djava.endorsed.dirs=/CA/IMDataCollector/jre/jre/lib/endorsed:/CA/IMDataCollector/jre/lib/endorsed:/CA/IMDataCollector/apache-karaf-2.4.3/lib/endorsed -Djava.ext.dirs=/CA/IMDataCollector/jre/jre/lib/ext:/CA/IMDataCollector/jre/lib/ext:/CA/IMDataCollector/apache-karaf-2.4.3/lib/ext -Dkaraf.instances=/CA/IMDataCollector/apache-karaf-2.4.3/instances -Dkaraf.home=/CA/IMDataCollector/apache-karaf-2.4.3 -Dkaraf.base=/CA/IMDataCollector/apache-karaf-2.4.3 -Dkaraf.data=/CA/IMDataCollector/apache-karaf-2.4.3/data -Dkaraf.etc=/CA/IMDataCollector/apache-karaf-2.4.3/etc -Dda.data.home=/CA/IMDataCollector/apache-karaf-2.4.3/da_data -Dda.version=1.0.0.0 -Djava.io.tmpdir=/CA/IMDataCollector/apache-karaf-2.4.3/data/tmp -Djava.util.logging.config.file=/CA/IMDataCollector/apache-karaf-2.4.3/etc/java.util.logging.properties -XX:+HeapDumpOnOutOfMemoryError -Dorg.apache.activemq.SERIALIZABLE_PACKAGES=* -XX:OnOutOfMemoryError=/CA/IMDataCollector/apache-karaf-2.4.3/bin/restart -Dkaraf.startLocalConsole=false -Dkaraf.startRemoteShell=true -classpath /CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf-jaas-boot.jar:/CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf-wrapper.jar:/CA/IMDataCollector/apache-karaf-2.4.3/lib/karaf.jar org.apache.karaf.main.Main

    There could be the option that the DA is the responsible of these falls in the performance of the DC's

    Best regards,

    Isaac Velasco.



  • 6.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Broadcom Employee
    Posted Oct 15, 2021 04:44 PM
    Okay, so not due to DC restart.

    I don't think the issue is with the DA.  The DC handles all the polling.  The "Polling stopped due to prior timeouts" metric is calculated by DC and send as a poll response to the internal Device Polling Statistics MF.

    You may see a delay in processing poll responses in DA due to DA heap/GC, but not reduced polling, or such a big drop in DC memory usage.

    Do you have a graph of the application pause for those 2 DC in system health dashboard for DC?
    What about poll item count and calculated metrics per sec for those 2 DCs under the DC Polling system health dashboard?
    Do we see a drop in poll items and calc metrics per sec at same time?




  • 7.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Posted Oct 15, 2021 05:10 PM
    Hello Jeffrey.

    the answers to your questions.

    Do you have a graph of the application pause for those 2 DC in system health dashboard for DC?
    I share the dashboards related to item account and calculated metrics.
    DC02

    DC03

    On the other hand Jeffrey, you know why the DC 03 only consumes about 7 or 8 GB of RAM, when at the configuration level it can take these values.
    -Xms = 2049M
    Xmx = 32769M

    Greetings,

    Isaac Velasco.



  • 8.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Broadcom Employee
    Posted Oct 15, 2021 05:16 PM
    So the poll item count remained the same, but the Calc Metrics is based on not seeing any poll responses to process metric expressions.
    With the "Polling stopped due to prior timeouts" being high, that aligns with Calc Metrics going down.  If there are timeouts, we're not gonna do as many Calcs as there isn't any data to process.

    As for XMX, that's the amount of memory Java COULD use, but it'll only allocate an initial amount based off Xms 2G setting. Then as it needs more memory, it will ask the OS for more memory.   The heap usage in our graphs reflects how much memory java is currently using for our application.  It doesn't use all 32G if it doesn't need to.


  • 9.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Posted Oct 15, 2021 05:34 PM
    Thank Jeffrey.

    From what I understand, it would help us a lot by removing the burden from the DCs involved and distributing correctly.

    You recommend increasing the capacity of the -Xms parameter to about 8 GB so that the application takes more resources or what would be your recommendation.


    Greetings,
    Isaac Velasco.



  • 10.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Posted Oct 18, 2021 01:37 PM
    Hello Jeffrey.

    Today the incident was presented again. Looking in the karaf.out log, the indication of the failure begins with the next line and is repeated until the degradation is restored.

    ERROR
    2021-10-18 01:16:27,661 | ERROR | monSocket-Reader | IcmpDaemonSocket | impl.icmpdaemon.IcmpDaemonSocket 865 | 198 - com.ca.im.data-collection-manager.icmp.icmp_daemon - 3.6.0.RELEASE-283 | | Received IcmpDaemon Response, but no request was found for it: [uid=44064771]. There may have been an IcmpDaemon timeout.

    Could you help me if you know something about this error.

    Best Regards.

    Isaac Velasco.


  • 11.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Broadcom Employee
    Posted Oct 18, 2021 02:46 PM
    I'll let Jeff add additional info, but on the surface that error is known and indicative of network or system issues.

    https://knowledge.broadcom.com/external/article?articleId=118694


  • 12.  RE: Help, interpretation and suggestions regarding the exposed behavior.

    Broadcom Employee
    Posted Oct 19, 2021 03:23 PM
    First, no, I don't suggest making any modifications to DC XMX/XMS settings.  Java is able to handle heap growth as needed up to XMX.
    The XMS is really just a starting point to tell Java to just grab 2GB of ram to startup, then doing like 512MB and then require getting more from OS.

    Next, the ICMP error seems to align with the calculated metrics per second and device stopped polling due to prior timeouts.   If we can't ping or poll - or responses are delayed/coming in slower than allowed, we'll not do any metric calcs.  And possibly memory goes down due to no data to process in SNMP and MVEL engines.