We are in the process of going live with our new ticketing system, ServiceNow. Everything is working as expected, except for 1 very important piece, which is when we Resolve a Ticket in ServiceNow, that alert is not being Acknowledged in Nim Alert Console.
We have 2 environments, one for testing and one for production. The alerts that we are sending over to production ServiceNow from test Nim environment, are being Acknowledged when the ticket is Resolved. So everything works as expected there within Test. But when the same alert is sent from production Nim environment, to the production instance of ServiceNow, arlarms are NOT being Acknowledged when ticket is resolved.
After digging through the Trace logs of the sdgtw probe, I found the issue, or at least something pointing to what the issue may be, which is:
Lab/Test environment log file for sdgtw probe, which Acknowledges the alert in Nim properly----------
Oct 19 12:43:22:970 [ServiceNow, sdgtw] user idadministrator
Oct 19 12:43:22:993 [ServiceNow, sdgtw] alarmlist[com.nimsoft.events.api.model.Alarm@7fd43795]
Oct 19 12:43:23:117 [ServiceNow, sdgtw] [clearAlarmsForClosedIncidents ()]: Alarms acknowledgment successfully using AlarmService API aId[DV11899587-80892]
Oct 19 12:43:23:117 [ServiceNow, sdgtw] Clearing the cache for alarm with Id - DV11899587-80892 and incidentId 62d2aa74db55e780995a791c8c9619fd
Production log file for sdgtw probe, which IS NOT Acknowledging the alert in Nim properly----------
Oct 19 14:59:17:981 [ServiceNow, sdgtw] user idadministrator
Oct 19 14:59:18:101 [ServiceNow, sdgtw] alarmlist
Oct 19 14:59:18:101 [ServiceNow, sdgtw] Clearing the cache for alarm with Id - SV16579396-19936 and incidentId 32537abcdb55e780995a791c8c9619ae
As you can see, the alarmlist is empty, and we don't get the successful Acknowledgement message. After digging some more, I found the following Error in the log file for production Nim:
Error while connecting AlarmService API Reason: (11) command not found, Received status (11) on response (for sendRcv) for cmd = 'dispatcher' ST
Its worth noting that I am currently on our Secondary Hub within our Nim UIM environment. But the Secondary hub is acting as our primary, which houses the following probes to support the sdgtw to servicenow integration:
Unfortunately, that did not resolve the issue.
Perhaps there is another probe that is needed for handling dispatching (maybe the UDM_Manager)?
Additionally, I did open up a case to CA Support and they said that, per the documentation, the Secondary hub does not support the sdgtw probe. But being that are secondary is acting as the primary, I suspect we may just be missing a probe to facilitate the communications for Acknowledging an alert.
Any assistance would be greatly appreciated!
After a bit more digging, I noticed that the trellis probe is encountering issue's upon startup-
Oct 21 14:08:57:615 [main, trellis] Initiator 'com.ca.trellis.persist.relational.DataSourceInitiator' threw an exception during application. Oct 21 14:08:57:615 [main, trellis] Reason: Oct 21 14:08:57:616 [main, trellis] com.lift.SystemException: configuration
Caused by: (4) not found, Received status (4) on response (for sendRcv) for cmd = 'nametoip' name = 'data_engine'
Oct 21 14:08:59:004 [main, trellis] Initiator 'com.ca.trellis.persist.relational.PersistenceUnitInitiator' threw an exception during application. Oct 21 14:08:59:004 [main, trellis] Reason: Oct 21 14:08:59:004 [main, trellis] com.ca.trellis.spi.deployment.DeploymentException: Referenced object identified by 'tnt2-ds' did not existPlease fix your configuration
Oct 21 14:08:59:980 [main, trellis] Caught exception while trying to start Trellis. The probe should be responsive, but Trellis isn'tOct 21 14:08:59:980 [main, trellis] java.lang.IllegalStateException: org.springframework.context.annotation.AnnotationConfigApplicationContext@e4f8592 has not been refreshed yet
So to test, I started up the data_engine probe on the Secondary, as it appears to be a requirement per the documentation, and the trellis looks better, with the exception of the ACE probe, which I am not sure if it is required for the sdgtw:
Oct 21 14:51:17:308 [main, trellis] Creating Shift Context Oct 21 14:51:17:376 [main, trellis] Registering service: class com.nimsoft.events.nas.NasAlarmServiceImpl Oct 21 14:51:18:272 [main, trellis] Creating Shift Context Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.services.ugs.DefaultGroupService Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.tnt2.services.DefaultLegacyGroupService Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.ugs.metadata.FlywayMigrationService Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.tnt2.services.DefaultComputerSystemService Oct 21 14:51:18:273 [main, trellis] Registering service: class com.ca.uim.tnt2.services.DefaultConfigurationItemService Oct 21 14:51:18:925 [taskScheduler-1, trellis] ACE could not be located. Not configuring Oct 21 14:51:19:120 [main, trellis] ****************[ Starting ]**************** Oct 21 14:51:19:120 [main, trellis] 2.01 Oct 21 14:51:23:748 [main, trellis] Failed to contact ACE. Configuration
After 'Resolving' a ticket within ServiceNow, I am now getting a different alert in the trace log of the sdgtw...it appears to be ignoring it now:
Oct 21 14:30:18:368 [ServiceNow, sdgtw] responseCode ::  response messege :: [OK]Oct 21 14:30:18:375 [ServiceNow, sdgtw] Incident found for closing [com.ca.integration.normalization.omodel.Incident@14982fb2]Oct 21 14:30:18:375 [ServiceNow, sdgtw] Completed executing the filter. Number of records returned - 1Oct 21 14:30:18:375 [ServiceNow, sdgtw] Ignoring the incidentId '198d782ddb992b80995a791c8c961905' as it is not associated with any Alarm.
But the thing is, there is an Alarm, with that id, in the console. Not sure why it is ignoring it.
...and now the trellis probe is kicking out some more interesting log messages, its repeating this:
Oct 21 14:58:57:579 [attach_socket, trellis] Dispatcher caught unchecked service exception. This could be normal behavior, but you may want to examine it anyway
Additionally...while comparing the production Trellis to test Trellis...both of them receive the "ACE could not be located. Not configuring". But the prod Trellis, receives the "Failed to contact ACE". So I looked at the ACE logs for both prod and test, and they both have:
Oct 21 15:11:24:679 ERROR [attach_socket, com.nimsoft.nimbus.NimServerSession] Exception in NimServerSessionThread.run. Closing session.Oct 21 15:11:24:680 ERROR [attach_socket, com.nimsoft.nimbus.NimServerSession] (2) communication error, Error when trying to send on session (S) com.nimsoft.nimbus.NimServerSession(Socket[addr=/10.240.135.14,port=56388,localport=48033]): Software caused connection abort: socket write error
...I decided to restart the ACE probe, cause why not...and only the production Trellis received the following:
Oct 21 15:05:41:368 [attach_socket, trellis] An exception occurred while processing a message from Socket[addr=/10.240.135.14,port=56171,localport=48043]. Oct 21 15:05:41:368 [attach_socket, trellis] (120) Callback error, Exception in callback for public void com.ca.trellis.shift.core.TrellisDispatchCoordinator.dispatch(com.nimsoft.nimbus.NimSession,com.nimsoft.nimbus.PDS) throws com.nimsoft.nimbus.NimException: No qualifying bean of type [com.ca.trellis.shift.core.ShiftDispatcher] is defined: No qualifying bean of type [com.ca.trellis.shift.core.ShiftDispatcher] is defined
Looks like we have circled back around to this 'dispatcher'. Does anyone have any insight on this one?
One more piece of info-
I was just able to replicate the original issue in the lab by deactivating the Trellis probe, here are the logs:
Oct 21 23:52:10:141 [ServiceNow, sdgtw] responseCode ::  response messege :: [OK]Oct 21 23:52:10:145 [ServiceNow, sdgtw] Incident found for closing [com.ca.integration.normalization.omodel.Incident@4a1156ed]Oct 21 23:52:10:145 [ServiceNow, sdgtw] Completed executing the filter. Number of records returned - 1Oct 21 23:52:10:145 [ServiceNow, sdgtw] Closing the alarm with Id - CQ03113799-12121 associated with the incidentId 7fcecfe9db196f40be427b668c961900Oct 21 23:52:10:154 [ServiceNow, sdgtw] user idadministratorOct 21 23:52:10:157 [ServiceNow, sdgtw] alarmlistOct 21 23:52:10:157 [ServiceNow, sdgtw] Clearing the cache for alarm with Id - CQ03113799-12121 and incidentId 7fcecfe9db196f40be427b668c961900
The empty "alarmlist" is the same thing we were experiencing before moving the 'sdgtw' onto the the same hub server as the data_engine. So the Trellis probe definitely has something to do with this issue.
Any assistance on this matter would be greatly appreciated!
Alrighty...got her figured out-
The short of it, we ended up having to relocate our NAS probe back to the primary. As per the documentation, the SDGTW probe only works on the primary server. In our environment, that is not so cut and dry being that we have had to offload probes to other servers in light of breaching subscribers limits.
The probes that are absolutely necessary for the SDGTW probe to operate are:
CA Support is working on getting this info added to the sdgtw probe documentation which will be helpful for non-typical environments. They will also add in there how each of these probes work together to facilitate communications.
Figured I should also share the following-
One of the issues we ran in to was Priority/Severity Mapping. It turned out that ServiceNow and the SDGTW probe were configured properly. The issue was actually our probes. In our old ticketing system, we used prefixes to identify the Severity, example:
[P1:S1] Disk Space on device $servername has breached the 10% free space threshold
...where the [P1:S1] would serve as a High level ticket. P0:S0 would be critical, so on and so forth. So it was never much of a concern for us to make sure our probe message pools had the correct severity configured for each message, being that we used that identifier.
But in the new system, this created a problem, being that the actual severity levels are dictated by values:
To fix this, we used the NAS Preprocessing engine to change the severity. For example:
If an inbound alarm had a prefix of P1:S1, but the NIM Level was set to Informational or something other than 4-Major...the preprocessing rules we now have built force it to the correct NIML value. Here is a really good link on how to accomplish that if severity levels are causing issues for you:
How to change alarm severity using a NAS script - CA Knowledge