drdata=> select * from cpu_usage order by start_time desc,node_name limit 50;
node_name | start_time | end_time | average_cpu_usage_percent
-------------------+---------------------+---------------------+---------------------------
v_drdata_node0001 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 8.76
v_drdata_node0002 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 25.98
v_drdata_node0003 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 26.3
v_drdata_node0001 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 3.68
v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:16:00 | 24.17
v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 27.6
v_drdata_node0003 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 24.6
v_drdata_node0001 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.87
v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 14.72
v_drdata_node0001 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 13.28
v_drdata_node0002 | 2021-05-18 14:14:00 | 2021-05-18 14:16:00 | 20.63
v_drdata_node0001 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 19.48
v_drdata_node0002 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 29.79
v_drdata_node0003 | 2021-05-18 14:13:00 | 2021-05-18 14:15:00 | 21.56
v_drdata_node0001 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 7.62
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 16.85
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:12:00 | 10.84
v_drdata_node0003 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 27.98
v_drdata_node0001 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 5.05
v_drdata_node0003 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 24.28
v_drdata_node0001 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 13.05
v_drdata_node0002 | 2021-05-18 14:10:00 | 2021-05-18 14:12:00 | 23.22
v_drdata_node0003 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 23.81
v_drdata_node0001 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 12.13
v_drdata_node0002 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 22.42
v_drdata_node0003 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 25.36
v_drdata_node0001 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 9.5
v_drdata_node0002 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 22.41
v_drdata_node0003 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 18.82
v_drdata_node0001 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 20.46
v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:07:00 | 26.6
v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 27.37
v_drdata_node0003 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 26.09
v_drdata_node0001 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 10.79
v_drdata_node0003 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 22.57
v_drdata_node0001 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 5.76
v_drdata_node0002 | 2021-05-18 14:05:00 | 2021-05-18 14:07:00 | 24.09
v_drdata_node0003 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 19.53
v_drdata_node0001 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 15.17
v_drdata_node0002 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 25.85
v_drdata_node0003 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 17.59
v_drdata_node0001 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 16.34
v_drdata_node0002 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 24.53
v_drdata_node0003 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 17.75
v_drdata_node0001 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 18.32
v_drdata_node0002 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 22.13
v_drdata_node0003 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 22.55
v_drdata_node0001 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 21.63
v_drdata_node0002 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 29.99
v_drdata_node0003 | 2021-05-18 14:01:00 | 2021-05-18 14:02:00 | 21.86
(50 rows)
drdata=> select * from memory_usage order by start_time desc,node_name limit 50;
node_name | start_time | end_time | average_memory_usage_percent
-------------------+---------------------+---------------------+------------------------------
v_drdata_node0001 | 2021-05-18 14:19:00 | 2021-05-18 14:20:00 | 6.17
v_drdata_node0001 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.29
v_drdata_node0003 | 2021-05-18 14:18:00 | 2021-05-18 14:18:00 | 6.16
v_drdata_node0003 | 2021-05-18 14:18:00 | 2021-05-18 14:19:00 | 6.15
v_drdata_node0001 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 6.17
v_drdata_node0002 | 2021-05-18 14:17:00 | 2021-05-18 14:18:00 | 6.28
v_drdata_node0001 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 6.17
v_drdata_node0002 | 2021-05-18 14:16:00 | 2021-05-18 14:17:00 | 6.28
v_drdata_node0003 | 2021-05-18 14:16:00 | 2021-05-18 14:18:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.28
v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:16:00 | 6.16
v_drdata_node0003 | 2021-05-18 14:15:00 | 2021-05-18 14:15:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:14:00 | 2021-05-18 14:15:00 | 6.28
v_drdata_node0001 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 6.18
v_drdata_node0002 | 2021-05-18 14:13:00 | 2021-05-18 14:14:00 | 6.35
v_drdata_node0003 | 2021-05-18 14:13:00 | 2021-05-18 14:15:00 | 6.17
v_drdata_node0001 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:12:00 | 6.27
v_drdata_node0002 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.27
v_drdata_node0003 | 2021-05-18 14:12:00 | 2021-05-18 14:13:00 | 6.15
v_drdata_node0001 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.16
v_drdata_node0002 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.28
v_drdata_node0003 | 2021-05-18 14:11:00 | 2021-05-18 14:12:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 6.16
v_drdata_node0003 | 2021-05-18 14:10:00 | 2021-05-18 14:11:00 | 6.16
v_drdata_node0001 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 6.18
v_drdata_node0002 | 2021-05-18 14:09:00 | 2021-05-18 14:11:00 | 6.33
v_drdata_node0003 | 2021-05-18 14:09:00 | 2021-05-18 14:10:00 | 6.23
v_drdata_node0001 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.23
v_drdata_node0003 | 2021-05-18 14:08:00 | 2021-05-18 14:09:00 | 6.13
v_drdata_node0001 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.23
v_drdata_node0003 | 2021-05-18 14:07:00 | 2021-05-18 14:08:00 | 6.12
v_drdata_node0001 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.21
v_drdata_node0002 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.27
v_drdata_node0003 | 2021-05-18 14:06:00 | 2021-05-18 14:07:00 | 6.14
v_drdata_node0001 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.23
v_drdata_node0003 | 2021-05-18 14:05:00 | 2021-05-18 14:06:00 | 6.13
v_drdata_node0001 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.12
v_drdata_node0002 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.24
v_drdata_node0003 | 2021-05-18 14:04:00 | 2021-05-18 14:05:00 | 6.12
v_drdata_node0001 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.15
v_drdata_node0002 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.25
v_drdata_node0003 | 2021-05-18 14:03:00 | 2021-05-18 14:04:00 | 6.13
v_drdata_node0001 | 2021-05-18 14:02:00 | 2021-05-18 14:03:00 | 6.15
(50 rows)
Original Message:
Sent: 05-18-2021 12:53 PM
From: Valeria Cunha
Subject: DEGRADED Status doesn't change
Hello Jeffrey,
All the nodes are up and I can't see the poll cycle as you told me :
How can I do to run theses command as you told me on vertica?
select * from cpu_usage order by start_time desc,node_name limit 50;
select * from memory_usage order by start_time desc,node_name limit 50;
I've restarted the entire solution 2 hours ago,I think it takes a long time to sync and below you can see more logs from karaf on data aggregator:
INFO | Host:VMLPFMPRD14 | 2021-05-18 13:48:58,983 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from OK to DEGRADED
WARN | Host:VMLPFMPRD15 | 2021-05-18 13:48:58,986 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD15 successful, but the response time of 0:01:27.198 was longer then a threshold of 20000 ms.
WARN | Host:VMLPFMPRD16 | 2021-05-18 13:49:00,119 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:01:29.768 was longer then a threshold of 20000 ms.
INFO | Host:VMLPFMPRD14 | 2021-05-18 13:49:09,192 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to OK
INFO | Host:VMLPFMPRD15 | 2021-05-18 13:49:14,537 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD15 changing from DEGRADED to OK
thank you,
Valéria
Original Message:
Sent: 05-18-2021 12:36 PM
From: JEFFREY PINARD
Subject: DEGRADED Status doesn't change
DB Node DEGRADED state means it's taking more than 20 secs for the DB heartbeats to return/complete. After 5 mins of no heartbeat responses, we mark a node down until we get a valid heartbeat.
One seemed to take 4m57s to complete.
You should check the DA heap and app pause self monitoring views to see if DA is above 70% heap and GC is high. That can cause heartbeats to be slow to be processed, but should effect all nodes heartbeat check, not just 1 or 2, I believe.
The degraded threshold state means it's taking 80% for 15 mins (meaning 3 runs of thresholding are taking over 80% of each 5 min allocated time to run). Could be DR is busy to process them in a timely manner, and may also be why heartbeats are so slow, if there is no DA heap/GC issue.
No nodes are down? how does CPU and memory usage look like on DR nodes?
You can always run as the db admin user in vsql:
select * from cpu_usage order by start_time desc,node_name limit 50;
select * from memory_usage order by start_time desc,node_name limit 50;
The above is vertica recording what it sees as memory and cpu usage on the box.
Also, check the Event Processing self monitoring dashboard to see what % of poll cycle and calc times is for thresholding.
Original Message:
Sent: 05-18-2021 12:20 PM
From: Valeria Cunha
Subject: DEGRADED Status doesn't change
Hello CA team,
After rebooting the entire solution the data aggregator doesn't no change its status as degraded and you can see this information in logs:
root@VMLPFMPRD12 /opt/CA/IMDataAggregator/apache-karaf-2.4.3/data/log # tail karaf.log
WARN | Host:VMLPFMPRD16 | 2021-05-18 13:10:38,267 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:04:57.671 was longer then a threshold of 20000 ms.
INFO | Host:VMLPFMPRD14 | 2021-05-18 13:10:57,722 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from DEGRADED to OK
WARN | Host:VMLPFMPRD14 | 2021-05-18 13:12:33,421 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 successful, but the response time of 0:01:25.694 was longer then a threshold of 20000 ms.
INFO | Host:VMLPFMPRD14 | 2021-05-18 13:12:33,422 | shutdown | ase.heartbeat.DBStateManagerImpl 735 | ommon.core.services.impl | | DB state for host VMLPFMPRD14 changing from OK to DEGRADED
WARN | Host:VMLPFMPRD14 | 2021-05-18 13:13:15,547 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD14 successful, but the response time of 0:00:32.121 was longer then a threshold of 20000 ms.
WARN | Host:VMLPFMPRD16 | 2021-05-18 13:13:15,548 | shutdown | ase.heartbeat.DBStateManagerImpl 792 | ommon.core.services.impl | | DB heartbeat to host VMLPFMPRD16 successful, but the response time of 0:02:27.280 was longer then a threshold of 20000 ms.
Every time I see these karaf logs the status change from ok to degraded and it doesn't stop,so what I have to do to fix this error?the entire solution takes a long time to sync?
thank you,
Valéria