DX NetOps

 View Only
Expand all | Collapse all

DEGRADED Status doesn't change

  • 1.  DEGRADED Status doesn't change

    Posted May 18, 2021 12:21 PM
    Hello CA team,

    After rebooting the entire solution the data aggregator doesn't no change its status as degraded and you can see this information in logs:

    root@VMLPFMPRD12 /opt/CA/IMDataAggregator/apache-karaf-2.4.3/data/log # tail karaf.log
    WARN | Host:VMLPFMPRD16 | 2021-05-18 13:10:38,267 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:04:57.671 was longer then a threshold of 20000 ms.
    INFO | Host:VMLPFMPRD14 | 2021-05-18 13:10:57,722 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to OK
    WARN | Host:VMLPFMPRD14 | 2021-05-18 13:12:33,421 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 successful, but the response time of 0:01:25.694 was longer then a threshold of 20000 ms.
    INFO | Host:VMLPFMPRD14 | 2021-05-18 13:12:33,422 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from OK to DEGRADED
    WARN | Host:VMLPFMPRD14 | 2021-05-18 13:13:15,547 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 successful, but the response time of 0:00:32.121 was longer then a threshold of 20000 ms.
    WARN | Host:VMLPFMPRD16 | 2021-05-18 13:13:15,548 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:02:27.280 was longer then a threshold of 20000 ms.

    Every time I see these karaf logs the status change from ok to degraded and it doesn't stop,so what I have to do to fix this error?the entire solution takes a long time to sync?



    thank you,

    Valéria


  • 2.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted May 18, 2021 12:37 PM

    DB Node DEGRADED state means it's taking more than 20 secs for the DB heartbeats to return/complete.  After 5 mins of no heartbeat responses, we mark a node down until we get a valid heartbeat.
    One seemed to take 4m57s to complete.

    You should check the DA heap and app pause self monitoring views to see if DA is above 70% heap and GC is high.  That can cause heartbeats to be slow to be processed, but should effect all nodes heartbeat check, not just 1 or 2, I believe.

    The degraded threshold state means it's taking 80% for 15 mins (meaning 3 runs of thresholding are taking over 80% of each 5 min allocated time to run).  Could be DR is busy to process them in a timely manner, and may also be why heartbeats are so slow, if there is no DA heap/GC issue.
    No nodes are down? how does CPU and memory usage look like on DR nodes?
    You can always run as the db admin user in vsql:

       select * from cpu_usage order by start_time desc,node_name limit 50;
       select * from memory_usage order by start_time desc,node_name limit 50;

    The above is vertica recording what it sees as memory and cpu usage on the box.

    Also, check the Event Processing self monitoring dashboard to see what % of poll cycle and calc times is for thresholding.




  • 3.  RE: DEGRADED Status doesn't change

    Posted May 18, 2021 12:54 PM
    Hello Jeffrey,

    All the nodes are up and I can't see the poll cycle as you told me :


    How can I do to run theses command as you told me on vertica?

    select * from cpu_usage order by start_time desc,node_name limit 50;
       select * from memory_usage order by start_time desc,node_name limit 50;

    I've restarted the entire solution 2 hours ago,I think it takes a long time to sync and below you can see more logs from karaf on data aggregator:

    INFO | Host:VMLPFMPRD14 | 2021-05-18 13:48:58,983 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from OK to DEGRADED
    WARN | Host:VMLPFMPRD15 | 2021-05-18 13:48:58,986 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD15 successful, but the response time of 0:01:27.198 was longer then a threshold of 20000 ms.
    WARN | Host:VMLPFMPRD16 | 2021-05-18 13:49:00,119 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:01:29.768 was longer then a threshold of 20000 ms.
    INFO | Host:VMLPFMPRD14 | 2021-05-18 13:49:09,192 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to OK
    INFO | Host:VMLPFMPRD15 | 2021-05-18 13:49:14,537 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD15 changing from DEGRADED to OK

    thank you,

    Valéria


  • 4.  RE: DEGRADED Status doesn't change

    Posted May 18, 2021 01:21 PM
    drdata=> select * from cpu_usage order by start_time desc,node_name limit 50;
    node_name | start_time | end_time | average_cpu_usage_percent
    -------------------+---------------------+---------------------+---------------------------
    v_drdata_node0001 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 8.76
    v_drdata_node0002 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 25.98
    v_drdata_node0003 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 26.3
    v_drdata_node0001 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 3.68
    v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:16:00 | 24.17
    v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 27.6
    v_drdata_node0003 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 24.6
    v_drdata_node0001 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.87
    v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 14.72
    v_drdata_node0001 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 13.28
    v_drdata_node0002 | 2021-05-18 14:14:00 | 2021-05-18 14:16:00 | 20.63
    v_drdata_node0001 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 19.48
    v_drdata_node0002 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 29.79
    v_drdata_node0003 | 2021-05-18 14:13:00 | 2021-05-18 14:15:00 | 21.56
    v_drdata_node0001 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 7.62
    v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 16.85
    v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:12:00 | 10.84
    v_drdata_node0003 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 27.98
    v_drdata_node0001 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 5.05
    v_drdata_node0003 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 24.28
    v_drdata_node0001 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 13.05
    v_drdata_node0002 | 2021-05-18 14:10:00 | 2021-05-18 14:12:00 | 23.22
    v_drdata_node0003 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 23.81
    v_drdata_node0001 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 12.13
    v_drdata_node0002 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 22.42
    v_drdata_node0003 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 25.36
    v_drdata_node0001 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 9.5
    v_drdata_node0002 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 22.41
    v_drdata_node0003 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 18.82
    v_drdata_node0001 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 20.46
    v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:07:00 | 26.6
    v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 27.37
    v_drdata_node0003 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 26.09
    v_drdata_node0001 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 10.79
    v_drdata_node0003 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 22.57
    v_drdata_node0001 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 5.76
    v_drdata_node0002 | 2021-05-18 14:05:00 | 2021-05-18 14:07:00 | 24.09
    v_drdata_node0003 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 19.53
    v_drdata_node0001 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 15.17
    v_drdata_node0002 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 25.85
    v_drdata_node0003 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 17.59
    v_drdata_node0001 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 16.34
    v_drdata_node0002 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 24.53
    v_drdata_node0003 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 17.75
    v_drdata_node0001 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 18.32
    v_drdata_node0002 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 22.13
    v_drdata_node0003 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 22.55
    v_drdata_node0001 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 21.63
    v_drdata_node0002 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 29.99
    v_drdata_node0003 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 21.86
    (50 rows)

    drdata=> select * from memory_usage order by start_time desc,node_name limit 50;
    node_name | start_time | end_time | average_memory_usage_percent
    -------------------+---------------------+---------------------+------------------------------
    v_drdata_node0001 | 2021-05-18 14:19:00 | 2021-05-18 14:20:00 | 6.17
    v_drdata_node0001 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.16
    v_drdata_node0002 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.29
    v_drdata_node0003 | 2021-05-18 14:18:00 | 2021-05-18 14:18:00 | 6.16
    v_drdata_node0003 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.15
    v_drdata_node0001 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 6.17
    v_drdata_node0002 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 6.28
    v_drdata_node0001 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 6.17
    v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 6.28
    v_drdata_node0003 | 2021-05-18 14:16:00 | 2021-05-18 14:18:00 | 6.16
    v_drdata_node0001 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.16
    v_drdata_node0002 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.28
    v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.16
    v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:15:00 | 6.16
    v_drdata_node0001 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 6.16
    v_drdata_node0002 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 6.28
    v_drdata_node0001 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 6.18
    v_drdata_node0002 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 6.35
    v_drdata_node0003 | 2021-05-18 14:13:00 | 2021-05-18 14:15:00 | 6.17
    v_drdata_node0001 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.16
    v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:12:00 | 6.27
    v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.27
    v_drdata_node0003 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.15
    v_drdata_node0001 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.16
    v_drdata_node0002 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.28
    v_drdata_node0003 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.16
    v_drdata_node0001 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 6.16
    v_drdata_node0003 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 6.16
    v_drdata_node0001 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 6.18
    v_drdata_node0002 | 2021-05-18 14:09:00 | 2021-05-18 14:11:00 | 6.33
    v_drdata_node0003 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 6.23
    v_drdata_node0001 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.12
    v_drdata_node0002 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.23
    v_drdata_node0003 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.13
    v_drdata_node0001 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.12
    v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.23
    v_drdata_node0003 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.12
    v_drdata_node0001 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.21
    v_drdata_node0002 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.27
    v_drdata_node0003 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.14
    v_drdata_node0001 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.12
    v_drdata_node0002 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.23
    v_drdata_node0003 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.13
    v_drdata_node0001 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.12
    v_drdata_node0002 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.24
    v_drdata_node0003 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.12
    v_drdata_node0001 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.15
    v_drdata_node0002 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.25
    v_drdata_node0003 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.13
    v_drdata_node0001 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 6.15
    (50 rows)


  • 5.  RE: DEGRADED Status doesn't change

    Posted May 18, 2021 01:59 PM
    ERROR | t Monitor Thread | 2021-05-18 14:44:18,159 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 execeeded max non-success time of 300000
    WARN | t Monitor Thread | 2021-05-18 14:44:18,160 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD16 changing from DEGRADED to DOWN
    ERROR | t Monitor Thread | 2021-05-18 14:44:20,915 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD15 execeeded max non-success time of 300000
    WARN | t Monitor Thread | 2021-05-18 14:44:20,916 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD15 changing from DEGRADED to DOWN
    ERROR | t Monitor Thread | 2021-05-18 14:44:21,371 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 execeeded max non-success time of 300000
    WARN | t Monitor Thread | 2021-05-18 14:44:21,371 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to DOWN
    ERROR | nager-thread-502 | 2021-05-18 14:44:21,372 | shutdown | ces.shutdown.ShutdownManagerImpl 131 | ommon.core.services.impl | | Shutting down the data aggregator.It was detected that no data repository nodes were contactable. The uncontactable hosts are:[VMLPFMPRD14, VMLPFMPRD16, VMLPFMPRD15]
    ERROR | nager-thread-495 | 2021-05-18 14:44:21,372 | shutdown | tTolerantDBConnectionManagerImpl 221 | ommon.core.services.impl | | No DB host name available.
    INFO | nager-thread-495 | 2021-05-18 14:44:21,373 | shutdown | tTolerantDBConnectionManagerImpl 376 | ommon.core.services.impl | | The primary host for database transactions is now set to null
    ERROR | nager-thread-495 | 2021-05-18 14:44:21,373 | shutdown | tTolerantDBConnectionManagerImpl 179 | ommon.core.services.impl | | The primary data repository host 'VMLPFMPRD14' is no longer available, and there are no available secondary hosts. Current Host Status: {VMLPFMPRD14=DOWN, VMLPFMPRD16=DOWN, VMLPFMPRD15=DOWN}


  • 6.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted May 19, 2021 03:09 PM
    Open a support case for the DA going down due to heartbeat.  They are better equipped to debug the issue live if necessary.

    The CPU/memory vertica usage doesn't seem bad.  So gotta think something is up with DA heap/GC.

    Support/I need to look at DA health views where it shows application pause and heap usage percentage.
    Especially around 2021-05-18 14:44:18 when all nodes appeared to disappear.




  • 7.  RE: DEGRADED Status doesn't change

    Posted May 19, 2021 03:28 PM
    Hello Jeffrey,

    I really appreciate your help and I'll open the case for this problem.
    My last question would be about the lost connection between DR and DA why the first reason it happens?

    thank you,

    Valéria


  • 8.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted May 19, 2021 04:04 PM
    It doesn't seem to be a resource issue on vertica at first glance.

    So that leaves DA GC (java garbage collection of memory) that can pause the DA for a long time depending current heap % usage.
    If the app is paused for 5 mins, then we the app is marked active, by java, DA will mark all nodes down and shutdown.

    We had another customer where the connections we use to contact DR for heartbeats were being closed after 1 hr by some firewall but we aren't told.  We go and try to use them, and never get back a failure.  We hang for 5 mins and then shutdown the DA.


  • 9.  RE: DEGRADED Status doesn't change

    Posted May 26, 2021 11:22 AM
    Hello Jeffrey,

    Is there anyway to monitor the disk i/o on the data aggregator and repositories?Something like script or report?

    thank you,

    Valéria


  • 10.  RE: DEGRADED Status doesn't change

    Posted May 26, 2021 11:31 AM
    I would install SysEDGE on the hosts and monitor parameters there.

    ------------------------------
    Senior Consultant
    SolvIT Networks
    ------------------------------



  • 11.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted May 26, 2021 11:32 AM
    https://www.networkworld.com/article/3330497/linux-commands-for-measuring-disk-activity.html

    That talks about a few OS tools you can install and use to monitor I/O at OS level.


  • 12.  RE: DEGRADED Status doesn't change

    Posted May 28, 2021 02:25 PM
    Hello Jeffrey,

    If I increase the memory on DA to 64 instead 32 that I have today.The problem that have with the repositories lost connect with DA as you can see in the logs could end up: 

    The primary data repository host 'VMLPFMPRD15' is no longer available, and there are no available secondary hosts. Current Host Status: {VMLPFMPRD14=DOWN, VMLPFMPRD16=DOWN, VMLPFMPRD15=DOWN}
    WARN | anager-thread-17 | 2021-05-28 15:09:10,678 | DataRepositoryNodeManager | y.impl.DataRepositoryNodeManager 195 | ger.core.aggregator.impl | | Unable to find existing item for host null

    Recently I increase the memory in all repositories to 147456 I think it could be a problem because the others servers like DA and DC have 32 each one.

    Thank you

    Valéria


  • 13.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted May 28, 2021 02:43 PM
    We appear to create an item per Node.  This item is used for logging status changes with the DR nodes.
    It's basically failing to find the item for NULL (hostname).  Maybe because all hosts are down, hostname is null for the changeEvent.

    Increasing memory on DA should not cause it to not connect to the DR nodes.  There must be another reason that they aren't connecting.
    This sounds like a S1 down situation that you should contact support to help you live if needed.
    Is the DA no longer coming up?   Does admintools show all nodes UP?



  • 14.  RE: DEGRADED Status doesn't change

    Posted May 31, 2021 10:36 AM
    Hello Jeffrey,

    I think I could be something like a latency on the network,because in the last days when I moved all servers from entire CAPC to another node inside of hyper-v the capc has started to working again.Today I sent to you any logs from DA and now I'm restoring backup to vertica repository:

    ERROR | t Monitor Thread | 2021-05-31 10:15:39,572 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 execeeded max non-success time of 300000
    WARN | t Monitor Thread | 2021-05-31 10:15:39,573 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to DOWN
    ERROR | ager-thread-6426 | 2021-05-31 10:15:39,574 | shutdown | ces.shutdown.ShutdownManagerImpl 131 | ommon.core.services.impl | | Shutting down the data aggregator.It was detected that no data repository nodes were contactable. The uncontactable hosts are:[VMLPFMPRD14, VMLPFMPRD16, VMLPFMPRD15]
    ERROR | ager-thread-6429 | 2021-05-31 10:15:39,575 | shutdown | tTolerantDBConnectionManagerImpl 221 | ommon.core.services.impl | | No DB host name available.
    INFO | ager-thread-6429 | 2021-05-31 10:15:39,576 | shutdown | tTolerantDBConnectionManagerImpl 376 | ommon.core.services.impl | | The primary host for database transactions is now set to null
    ERROR | ager-thread-6429 | 2021-05-31 10:15:39,576 | shutdown | tTolerantDBConnectionManagerImpl 179 | ommon.core.services.impl | | The primary data repository host 'VMLPFMPRD14' is no longer available, and there are no available secondary hosts. Current Host Status: {VMLPFMPRD14=DOWN, VMLPFMPRD16=DOWN, VMLPFMPRD15=DOWN}
    WARN | anager-thread-43 | 2021-05-31 10:15:39,577 | DataRepositoryNodeManager | y.impl.DataRepositoryNodeManager 195 | ger.core.aggregator.impl | | Unable to find existing item for host null

    thank you,

    Valéria


  • 15.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted Jun 02, 2021 09:42 AM
    Yeh, network latency can play a big role in heartbeat and even query speed.

    You can always run the /opt/vertica/bin/vnetperf tool on one of the nodes, specifying all 3 nodes, maybe add DA also to list and it'll measure DR to DA net speed also.


  • 16.  RE: DEGRADED Status doesn't change

    Posted Jun 03, 2021 09:23 AM
    Hello Jeffrey,

    At Least 2 days ago the netflow doesn't sync anymore with CAPC and I didn't find any clue in /opt/CA/PerformanceCenter/DM/logs:


    How can I fix this error?

    thank you,

    Valéria


  • 17.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted Jun 03, 2021 11:01 AM
    Good to see DA is up and running.

    As for NFA, can you check for any ERROR in the DMService.log when NFA syncs?  What is the complete error?

    If it's a read timed out, it's most likely during last PUSH stage where NFA does some post processing and that is taking more than 20 mins to run.
    You can follow the resolution in this KB to extend the timeout to 1 hr to give NFA more time.  This KB applies to all data sources, not just Spectrum.
    https://knowledge.broadcom.com/external/article/137391/spectrum-data-source-is-frequently-faili.html


  • 18.  RE: DEGRADED Status doesn't change

    Posted Jun 03, 2021 12:02 PM
    Hello Jeffrey,

    Thank you for your reply.
    I can't not see any errors in DM/logs:


    I've looked up for that you told me about the Inventorytimeoutexception but I didn't find anything in these logs.

    thank you,

    Valéria


  • 19.  RE: DEGRADED Status doesn't change

    Posted Jun 03, 2021 12:19 PM
    Hello Jeffey,

    I find the error :


    How can i fix this error?

    thank you,

    Valéria


  • 20.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted Jun 03, 2021 12:24 PM
    Yeh, the article is not 100% correct as to the message to look for.  I've asked support to fix the KB.

    But the real issue is in the latest image.  Some reason the SourceGUID product ID changed on the NFA side.  And now PC rejects the sync request because it doesn't think it's talking to the right NFA box.

    This has been seen before but not sure why NFA ends up creating a new sourceGUID.

    I'm gonna send email to a NFA support engineer who can comment further with a workaround.  I think he has a KB for this.


  • 21.  RE: DEGRADED Status doesn't change

    Broadcom Employee
    Posted Jun 03, 2021 12:38 PM
    Valeria,

    To resolve this error please follow these instructions:

    https://knowledge.broadcom.com/external/article/5313/data-source-fails-to-sync-and-receives-t.html

    If you have any issues, let us know.

    Thanks,

    Justin Signa


  • 22.  RE: DEGRADED Status doesn't change

    Posted Jun 03, 2021 01:58 PM
    Hello Justin and Jeffrey,

    The netflow has started to sync without problems.
    Thank you for your supporting.

    Valéria