I have random Linux servers running native Net-SNMP that randomly loose SNMP management or can't be modeled at all. There are no firewalls or ACLs in the path between the servers and the landscape and they are all reachable with ICMP. Any Ideas what could be dropping them?
I would start looking at the TCPdump output.
Syntax: tcpdump -i eth0 -n "host 172.20.5.122" -s 0 -device_name.pcap
Are you managing using snmpv3?
Yes. We're getting the same issue while testing with v2 as well
This may be a known snmpv3 issue. The 10.02.03.PTF_10.2.321 patch for Spectrum 10.2.3 addresses the following issues:
Snmp engine ID discovery
After reboot device can't be monitored with SNMPv3
Reset for SNMPv3 Authentication Fails.
Updating Profiles hangs.
Fails to process SNMP responses from different IP.
With changed source ip address, SS crashes on polling after snmpv3 reset
Symptom: Due to changes in engineID on the device causing SNMPv3 authentication failures.
Resolution: Now Spectrum respects the engineID changes on the engineID.
Management Agent Lost Alarms in 10.2.3
SNMPv3: update engineID in Spectrum when it's changed on Device.
Validate SNMPv3 profile fields during Profile creation.
Ok looks like we're running 10.2.1. I will try to patch it and see from there.
We have had numerous issues with NetSNMP stability. We had several issues where systems that had the NetSNMP agent will show Management Agent Lost, but when you poll them, they respond. After doing some tests on these servers, we found that NetSNMP is ok with its default configuration, but almost any time you do anything that may have it use more CPU, it will be slow to respond. Our first problem was when we also had the SysEdge agent on servers and we would tell the NetSNMP agent to also point to the SysEdge OIDs (while we transitioned off SysEdge). Once we removed that reference everything was good again for a while.
So, you may want to do some sniffing and see how long it takes for requests to get responded to. You may end up having to push up the timeouts, but remember to be careful pushing up the timeouts because that also impacts how quickly Spectrum starts up (will take Spectrum longer to detect systems that are unavailable). So if timeouts are what you need to set, be selective on what systems you update. Maybe use a policy to define your timeouts.