DX NetOps

View Only

Back to discussions

Expand all | Collapse all

DEGRADED Status doesn't change

1. DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 18, 2021 12:21 PM

Reply Reply Privately
Hello CA team,

After rebooting the entire solution the data aggregator doesn't no change its status as degraded and you can see this information in logs:

root@VMLPFMPRD12 /opt/CA/IMDataAggregator/apache-karaf-2.4.3/data/log # tail karaf.log
WARN | Host:VMLPFMPRD16 | 2021-05-18 13:10:38,267 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:04:57.671 was longer then a threshold of 20000 ms.
INFO | Host:VMLPFMPRD14 | 2021-05-18 13:10:57,722 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to OK
WARN | Host:VMLPFMPRD14 | 2021-05-18 13:12:33,421 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 successful, but the response time of 0:01:25.694 was longer then a threshold of 20000 ms.
INFO | Host:VMLPFMPRD14 | 2021-05-18 13:12:33,422 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from OK to DEGRADED
WARN | Host:VMLPFMPRD14 | 2021-05-18 13:13:15,547 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 successful, but the response time of 0:00:32.121 was longer then a threshold of 20000 ms.
WARN | Host:VMLPFMPRD16 | 2021-05-18 13:13:15,548 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:02:27.280 was longer then a threshold of 20000 ms.

Every time I see these karaf logs the status change from ok to degraded and it doesn't stop,so what I have to do to fix this error?the entire solution takes a long time to sync?

thank you,

Valéria
2. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted May 18, 2021 12:37 PM

Reply Reply Privately
DB Node DEGRADED state means it's taking more than 20 secs for the DB heartbeats to return/complete. After 5 mins of no heartbeat responses, we mark a node down until we get a valid heartbeat.
One seemed to take 4m57s to complete.

You should check the DA heap and app pause self monitoring views to see if DA is above 70% heap and GC is high. That can cause heartbeats to be slow to be processed, but should effect all nodes heartbeat check, not just 1 or 2, I believe.

The degraded threshold state means it's taking 80% for 15 mins (meaning 3 runs of thresholding are taking over 80% of each 5 min allocated time to run). Could be DR is busy to process them in a timely manner, and may also be why heartbeats are so slow, if there is no DA heap/GC issue.
No nodes are down? how does CPU and memory usage look like on DR nodes?
You can always run as the db admin user in vsql:

select * from cpu_usage order by start_time desc,node_name limit 50;
select * from memory_usage order by start_time desc,node_name limit 50;

The above is vertica recording what it sees as memory and cpu usage on the box.

Also, check the Event Processing self monitoring dashboard to see what % of poll cycle and calc times is for thresholding.

Original Message
3. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 18, 2021 12:54 PM

Reply Reply Privately
Hello Jeffrey,

All the nodes are up and I can't see the poll cycle as you told me :

How can I do to run theses command as you told me on vertica?

select * from cpu_usage order by start_time desc,node_name limit 50;
select * from memory_usage order by start_time desc,node_name limit 50;

I've restarted the entire solution 2 hours ago,I think it takes a long time to sync and below you can see more logs from karaf on data aggregator:

INFO | Host:VMLPFMPRD14 | 2021-05-18 13:48:58,983 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from OK to DEGRADED
WARN | Host:VMLPFMPRD15 | 2021-05-18 13:48:58,986 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD15 successful, but the response time of 0:01:27.198 was longer then a threshold of 20000 ms.
WARN | Host:VMLPFMPRD16 | 2021-05-18 13:49:00,119 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:01:29.768 was longer then a threshold of 20000 ms.
INFO | Host:VMLPFMPRD14 | 2021-05-18 13:49:09,192 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to OK
INFO | Host:VMLPFMPRD15 | 2021-05-18 13:49:14,537 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD15 changing from DEGRADED to OK

thank you,

Valéria

Original Message
4. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 18, 2021 01:21 PM

Reply Reply Privately
drdata=> select * from cpu_usage order by start_time desc,node_name limit 50;
node_name | start_time | end_time | average_cpu_usage_percent
-------------------+---------------------+---------------------+---------------------------
v_drdata_node0001 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 8.76
v_drdata_node0002 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 25.98
v_drdata_node0003 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 26.3
v_drdata_node0001 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 3.68
v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:16:00 | 24.17
v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 27.6
v_drdata_node0003 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 24.6
v_drdata_node0001 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.87
v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 14.72
v_drdata_node0001 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 13.28
v_drdata_node0002 | 2021-05-18 14:14:00 | 2021-05-18 14:16:00 | 20.63
v_drdata_node0001 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 19.48
v_drdata_node0002 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 29.79
v_drdata_node0003 | 2021-05-18 14:13:00 | 2021-05-18 14:15:00 | 21.56
v_drdata_node0001 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 7.62
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 16.85
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:12:00 | 10.84
v_drdata_node0003 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 27.98
v_drdata_node0001 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 5.05
v_drdata_node0003 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 24.28
v_drdata_node0001 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 13.05
v_drdata_node0002 | 2021-05-18 14:10:00 | 2021-05-18 14:12:00 | 23.22
v_drdata_node0003 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 23.81
v_drdata_node0001 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 12.13
v_drdata_node0002 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 22.42
v_drdata_node0003 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 25.36
v_drdata_node0001 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 9.5
v_drdata_node0002 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 22.41
v_drdata_node0003 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 18.82
v_drdata_node0001 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 20.46
v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:07:00 | 26.6
v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 27.37
v_drdata_node0003 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 26.09
v_drdata_node0001 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 10.79
v_drdata_node0003 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 22.57
v_drdata_node0001 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 5.76
v_drdata_node0002 | 2021-05-18 14:05:00 | 2021-05-18 14:07:00 | 24.09
v_drdata_node0003 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 19.53
v_drdata_node0001 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 15.17
v_drdata_node0002 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 25.85
v_drdata_node0003 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 17.59
v_drdata_node0001 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 16.34
v_drdata_node0002 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 24.53
v_drdata_node0003 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 17.75
v_drdata_node0001 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 18.32
v_drdata_node0002 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 22.13
v_drdata_node0003 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 22.55
v_drdata_node0001 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 21.63
v_drdata_node0002 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 29.99
v_drdata_node0003 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 21.86
(50 rows)

drdata=> select * from memory_usage order by start_time desc,node_name limit 50;
node_name | start_time | end_time | average_memory_usage_percent
-------------------+---------------------+---------------------+------------------------------
v_drdata_node0001 | 2021-05-18 14:19:00 | 2021-05-18 14:20:00 | 6.17
v_drdata_node0001 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.29
v_drdata_node0003 | 2021-05-18 14:18:00 | 2021-05-18 14:18:00 | 6.16
v_drdata_node0003 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.15
v_drdata_node0001 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 6.17
v_drdata_node0002 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 6.28
v_drdata_node0001 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 6.17
v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 6.28
v_drdata_node0003 | 2021-05-18 14:16:00 | 2021-05-18 14:18:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.28
v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.16
v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:15:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 6.28
v_drdata_node0001 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 6.18
v_drdata_node0002 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 6.35
v_drdata_node0003 | 2021-05-18 14:13:00 | 2021-05-18 14:15:00 | 6.17
v_drdata_node0001 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:12:00 | 6.27
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.27
v_drdata_node0003 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.15
v_drdata_node0001 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.28
v_drdata_node0003 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 6.16
v_drdata_node0003 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 6.18
v_drdata_node0002 | 2021-05-18 14:09:00 | 2021-05-18 14:11:00 | 6.33
v_drdata_node0003 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 6.23
v_drdata_node0001 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.23
v_drdata_node0003 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.13
v_drdata_node0001 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.23
v_drdata_node0003 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.12
v_drdata_node0001 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.21
v_drdata_node0002 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.27
v_drdata_node0003 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.14
v_drdata_node0001 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.23
v_drdata_node0003 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.13
v_drdata_node0001 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.24
v_drdata_node0003 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.12
v_drdata_node0001 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.15
v_drdata_node0002 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.25
v_drdata_node0003 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.13
v_drdata_node0001 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 6.15
(50 rows)

Original Message
5. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 18, 2021 01:59 PM

Reply Reply Privately
ERROR | t Monitor Thread | 2021-05-18 14:44:18,159 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 execeeded max non-success time of 300000
WARN | t Monitor Thread | 2021-05-18 14:44:18,160 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD16 changing from DEGRADED to DOWN
ERROR | t Monitor Thread | 2021-05-18 14:44:20,915 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD15 execeeded max non-success time of 300000
WARN | t Monitor Thread | 2021-05-18 14:44:20,916 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD15 changing from DEGRADED to DOWN
ERROR | t Monitor Thread | 2021-05-18 14:44:21,371 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 execeeded max non-success time of 300000
WARN | t Monitor Thread | 2021-05-18 14:44:21,371 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to DOWN
ERROR | nager-thread-502 | 2021-05-18 14:44:21,372 | shutdown | ces.shutdown.ShutdownManagerImpl 131 | ommon.core.services.impl | | Shutting down the data aggregator.It was detected that no data repository nodes were contactable. The uncontactable hosts are:[VMLPFMPRD14, VMLPFMPRD16, VMLPFMPRD15]
ERROR | nager-thread-495 | 2021-05-18 14:44:21,372 | shutdown | tTolerantDBConnectionManagerImpl 221 | ommon.core.services.impl | | No DB host name available.
INFO | nager-thread-495 | 2021-05-18 14:44:21,373 | shutdown | tTolerantDBConnectionManagerImpl 376 | ommon.core.services.impl | | The primary host for database transactions is now set to null
ERROR | nager-thread-495 | 2021-05-18 14:44:21,373 | shutdown | tTolerantDBConnectionManagerImpl 179 | ommon.core.services.impl | | The primary data repository host 'VMLPFMPRD14' is no longer available, and there are no available secondary hosts. Current Host Status: {VMLPFMPRD14=DOWN, VMLPFMPRD16=DOWN, VMLPFMPRD15=DOWN}

Original Message
6. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted May 19, 2021 03:09 PM

Reply Reply Privately
Open a support case for the DA going down due to heartbeat. They are better equipped to debug the issue live if necessary.

The CPU/memory vertica usage doesn't seem bad. So gotta think something is up with DA heap/GC.

Support/I need to look at DA health views where it shows application pause and heap usage percentage.
Especially around 2021-05-18 14:44:18 when all nodes appeared to disappear.

Original Message
7. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 19, 2021 03:28 PM

Reply Reply Privately
Hello Jeffrey,

I really appreciate your help and I'll open the case for this problem.
My last question would be about the lost connection between DR and DA why the first reason it happens?

thank you,

Valéria

Original Message
8. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted May 19, 2021 04:04 PM

Reply Reply Privately
It doesn't seem to be a resource issue on vertica at first glance.

So that leaves DA GC (java garbage collection of memory) that can pause the DA for a long time depending current heap % usage.
If the app is paused for 5 mins, then we the app is marked active, by java, DA will mark all nodes down and shutdown.

We had another customer where the connections we use to contact DR for heartbeats were being closed after 1 hr by some firewall but we aren't told. We go and try to use them, and never get back a failure. We hang for 5 mins and then shutdown the DA.

Original Message
9. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 26, 2021 11:22 AM

Reply Reply Privately
Hello Jeffrey,

Is there anyway to monitor the disk i/o on the data aggregator and repositories?Something like script or report?

thank you,

Valéria

Original Message
10. RE: DEGRADED Status doesn't change

0 Recommend
Catalin Farcasanu
Posted May 26, 2021 11:31 AM

Reply Reply Privately
I would install SysEDGE on the hosts and monitor parameters there.

------------------------------
Senior Consultant
SolvIT Networks
------------------------------

Original Message
11. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted May 26, 2021 11:32 AM

Reply Reply Privately
https://www.networkworld.com/article/3330497/linux-commands-for-measuring-disk-activity.html

That talks about a few OS tools you can install and use to monitor I/O at OS level.

Original Message
12. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 28, 2021 02:25 PM

Reply Reply Privately
Hello Jeffrey,

If I increase the memory on DA to 64 instead 32 that I have today.The problem that have with the repositories lost connect with DA as you can see in the logs could end up:

The primary data repository host 'VMLPFMPRD15' is no longer available, and there are no available secondary hosts. Current Host Status: {VMLPFMPRD14=DOWN, VMLPFMPRD16=DOWN, VMLPFMPRD15=DOWN}
WARN | anager-thread-17 | 2021-05-28 15:09:10,678 | DataRepositoryNodeManager | y.impl.DataRepositoryNodeManager 195 | ger.core.aggregator.impl | | Unable to find existing item for host null

Recently I increase the memory in all repositories to 147456 I think it could be a problem because the others servers like DA and DC have 32 each one.

Thank you

Valéria

Original Message
13. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted May 28, 2021 02:43 PM

Reply Reply Privately
We appear to create an item per Node. This item is used for logging status changes with the DR nodes.
It's basically failing to find the item for NULL (hostname). Maybe because all hosts are down, hostname is null for the changeEvent.

Increasing memory on DA should not cause it to not connect to the DR nodes. There must be another reason that they aren't connecting.
This sounds like a S1 down situation that you should contact support to help you live if needed.
Is the DA no longer coming up? Does admintools show all nodes UP?

Original Message
14. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted May 31, 2021 10:36 AM

Reply Reply Privately
Hello Jeffrey,

I think I could be something like a latency on the network,because in the last days when I moved all servers from entire CAPC to another node inside of hyper-v the capc has started to working again.Today I sent to you any logs from DA and now I'm restoring backup to vertica repository:

ERROR | t Monitor Thread | 2021-05-31 10:15:39,572 | shutdown | ase.heartbeat.DBStateManagerImpl 390 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 execeeded max non-success time of 300000
WARN | t Monitor Thread | 2021-05-31 10:15:39,573 | shutdown | ase.heartbeat.DBStateManagerImpl 731 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to DOWN
ERROR | ager-thread-6426 | 2021-05-31 10:15:39,574 | shutdown | ces.shutdown.ShutdownManagerImpl 131 | ommon.core.services.impl | | Shutting down the data aggregator.It was detected that no data repository nodes were contactable. The uncontactable hosts are:[VMLPFMPRD14, VMLPFMPRD16, VMLPFMPRD15]
ERROR | ager-thread-6429 | 2021-05-31 10:15:39,575 | shutdown | tTolerantDBConnectionManagerImpl 221 | ommon.core.services.impl | | No DB host name available.
INFO | ager-thread-6429 | 2021-05-31 10:15:39,576 | shutdown | tTolerantDBConnectionManagerImpl 376 | ommon.core.services.impl | | The primary host for database transactions is now set to null
ERROR | ager-thread-6429 | 2021-05-31 10:15:39,576 | shutdown | tTolerantDBConnectionManagerImpl 179 | ommon.core.services.impl | | The primary data repository host 'VMLPFMPRD14' is no longer available, and there are no available secondary hosts. Current Host Status: {VMLPFMPRD14=DOWN, VMLPFMPRD16=DOWN, VMLPFMPRD15=DOWN}
WARN | anager-thread-43 | 2021-05-31 10:15:39,577 | DataRepositoryNodeManager | y.impl.DataRepositoryNodeManager 195 | ger.core.aggregator.impl | | Unable to find existing item for host null

thank you,

Valéria

Original Message
15. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Jun 02, 2021 09:42 AM

Reply Reply Privately
Yeh, network latency can play a big role in heartbeat and even query speed.

You can always run the /opt/vertica/bin/vnetperf tool on one of the nodes, specifying all 3 nodes, maybe add DA also to list and it'll measure DR to DA net speed also.

Original Message
16. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted Jun 03, 2021 09:23 AM

Reply Reply Privately
Hello Jeffrey,

At Least 2 days ago the netflow doesn't sync anymore with CAPC and I didn't find any clue in /opt/CA/PerformanceCenter/DM/logs:

How can I fix this error?

thank you,

Valéria

Original Message
17. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Jun 03, 2021 11:01 AM

Reply Reply Privately
Good to see DA is up and running.

As for NFA, can you check for any ERROR in the DMService.log when NFA syncs? What is the complete error?

If it's a read timed out, it's most likely during last PUSH stage where NFA does some post processing and that is taking more than 20 mins to run.
You can follow the resolution in this KB to extend the timeout to 1 hr to give NFA more time. This KB applies to all data sources, not just Spectrum.
https://knowledge.broadcom.com/external/article/137391/spectrum-data-source-is-frequently-faili.html

Original Message
18. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted Jun 03, 2021 12:02 PM

Reply Reply Privately
Hello Jeffrey,

Thank you for your reply.
I can't not see any errors in DM/logs:

I've looked up for that you told me about the Inventorytimeoutexception but I didn't find anything in these logs.

thank you,

Valéria

Original Message
19. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted Jun 03, 2021 12:19 PM

Reply Reply Privately
Hello Jeffey,

I find the error :

How can i fix this error?

thank you,

Valéria

Original Message
20. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Jeffrey Pinard
Posted Jun 03, 2021 12:24 PM

Reply Reply Privately
Yeh, the article is not 100% correct as to the message to look for. I've asked support to fix the KB.

But the real issue is in the latest image. Some reason the SourceGUID product ID changed on the NFA side. And now PC rejects the sync request because it doesn't think it's talking to the right NFA box.

This has been seen before but not sure why NFA ends up creating a new sourceGUID.

I'm gonna send email to a NFA support engineer who can comment further with a workaround. I think he has a KB for this.

Original Message
21. RE: DEGRADED Status doesn't change

0 Recommend
Broadcom Employee

Justin Signa
Posted Jun 03, 2021 12:38 PM

Reply Reply Privately
Valeria,

To resolve this error please follow these instructions:

https://knowledge.broadcom.com/external/article/5313/data-source-fails-to-sync-and-receives-t.html

If you have any issues, let us know.

Thanks,

Justin Signa

Original Message
22. RE: DEGRADED Status doesn't change

0 Recommend
Valeria Cunha
Posted Jun 03, 2021 01:58 PM

Reply Reply Privately
Hello Justin and Jeffrey,

The netflow has started to sync without problems.
Thank you for your supporting.

Valéria

Original Message

DX NetOps

DEGRADED Status doesn't change

Valeria CunhaMay 18, 2021 12:21 PM

Jeffrey PinardMay 18, 2021 12:37 PM

Valeria CunhaMay 18, 2021 12:54 PM

Valeria CunhaMay 18, 2021 01:21 PM

Valeria CunhaMay 18, 2021 01:59 PM

Jeffrey PinardMay 19, 2021 03:09 PM

Valeria CunhaMay 19, 2021 03:28 PM

Jeffrey PinardMay 19, 2021 04:04 PM

Valeria CunhaMay 26, 2021 11:22 AM

Catalin FarcasanuMay 26, 2021 11:31 AM

Jeffrey PinardMay 26, 2021 11:32 AM

Valeria CunhaMay 28, 2021 02:25 PM

Jeffrey PinardMay 28, 2021 02:43 PM

Valeria CunhaMay 31, 2021 10:36 AM

Jeffrey PinardJun 02, 2021 09:42 AM

Valeria CunhaJun 03, 2021 09:23 AM

Jeffrey PinardJun 03, 2021 11:01 AM

Valeria CunhaJun 03, 2021 12:02 PM

Valeria CunhaJun 03, 2021 12:19 PM

Jeffrey PinardJun 03, 2021 12:24 PM

Justin SignaJun 03, 2021 12:38 PM

Valeria CunhaJun 03, 2021 01:58 PM

1. DEGRADED Status doesn't change

2. RE: DEGRADED Status doesn't change

3. RE: DEGRADED Status doesn't change

4. RE: DEGRADED Status doesn't change

5. RE: DEGRADED Status doesn't change

6. RE: DEGRADED Status doesn't change

7. RE: DEGRADED Status doesn't change

8. RE: DEGRADED Status doesn't change

9. RE: DEGRADED Status doesn't change

10. RE: DEGRADED Status doesn't change

11. RE: DEGRADED Status doesn't change

12. RE: DEGRADED Status doesn't change

13. RE: DEGRADED Status doesn't change

14. RE: DEGRADED Status doesn't change

15. RE: DEGRADED Status doesn't change

16. RE: DEGRADED Status doesn't change

17. RE: DEGRADED Status doesn't change

18. RE: DEGRADED Status doesn't change

19. RE: DEGRADED Status doesn't change

20. RE: DEGRADED Status doesn't change

21. RE: DEGRADED Status doesn't change

22. RE: DEGRADED Status doesn't change