DX Unified Infrastructure Management

Expand all | Collapse all

Survey: How do we enrich alarms?

  • 1.  Survey: How do we enrich alarms?

    Posted Jun 09, 2015 11:00 AM

    As part of the improvements we are making to the UIM event handler, I want to reach out to the community and survey how we enrich alarms.  When I ask about enriching an alarm, I am asking how you commonly modify an incoming alarm with data from other sources.

     

    What I would like are specifics on your common use cases.  For example, I've detailed one of the possible use case below.  Please respond with your use case in similar detail.

     

    This is your opportunity to influence the capabilities of the product as it is being developed.  So, please, provide as many unique use cases as you can think of.

     

    Use Case:

    My company is a managed service provider (MSP).  As an MSP, we have many thousands of devices that are used by multiple tenant customers.  When we receive an alarm from a device, I need to query a CSV file that lists the devices by customer, and add the customer name as a property of the alarm.  This way, I can search the alarms by customer name to know which alarms pertain to each customer.

     

    Jim Perkins

    CA Technologies

    Product Manger, Events



  • 2.  Re: Survey: How do we enrich alarms?

    Posted Jun 09, 2015 03:16 PM

    Hi James,

     

    I've posted some blogs about my integrations, which include alarm enrichment. If you want to read it in detail, check it out at Developing custom UIM integrations part 2: CMDB and enrichment. I'll summarize here for convenience.

     

    Our two primary uses for enriching alarms are:

    1. Ability to create tickets in Service-Now with correct CIs and Service Offerings linked to them. This means enriching alarms with CI and service offering identifiers.

    2. Enriching alarms with service hours for the device. We can then use this information to further process the alarms as necessary, for example SMS and email notifications to on call staff only if the device has such service hours.

     

    There are also some other uses, for example after a ticket is created I put the ticket ID in the alarm, after which I can use URL actions in USM to open the ticket. I guess it's debatable if that's actually enriching.

     

    -jon



  • 3.  Re: Survey: How do we enrich alarms?

    Posted Jun 18, 2015 08:57 AM

    Jim,

     

    While we have not done this yet our enrichment plans cover the primary use case below

     

    Enrich the alarm using information from CA Cloud Service Management (CA CSM) e.g. group on call, importance of the device and the applications / services potentially impacted. The issue that we have today is as best as I can tell enrichment can only be done via a JDBC based SQL query. That is a problem for us today as the SaaS based CA CSM does not support direct JDBC connections and the information in question is only available via API. This leaves us to query a database extract that is provided by the CA CSM team every day instead of current information.

     

    -Alquin



  • 4.  Re: Survey: How do we enrich alarms?

    Posted Jun 18, 2015 09:08 AM

    This is also an issue I had when I was planning my enrichment: data I need is primarily accessible through web service apis. Also one of the reasons I syncronize the data to the UIM db locally. It would be convenient if enrichment was able to use such methods, but therein also lies also a problem: web services are potentially slow to response.

     

    Definitely a good point, not sure though if it's easy to overcome by any universal method.

     

    -jon



  • 5.  Re: Survey: How do we enrich alarms?

    Posted Jun 18, 2015 11:14 AM

    We do several things - mostly using preprocessor scripts and AO profiles.

     

    Our situation is a little different that most because we essentially manage an application on a specific set of hardware. As opposed to the traditional monitoring of a server. Sort of like if you ran a DBA team for a large organization and your SLA to your database users was availability of the database software - not the server itself.

     

    And because we manage multiple products that reside on potentially a single robot there are some housekeeping things we need to do.

     

    Consider what happens with CDM disk usage alarms in the scenario where you have a Sybase install on drive D:, Oracle on drive E:, and the OS on drive C:. CDM issues related to drive D: go to the Sybase team, E: goes to oracle, C: goes to your server team. We place a product indicator in each message so that our ticketing system can route the message to the correct support desk. This is part of the message setup for the drive and works well. The problem is the close since there's only a single default closure message, there's no way to modify that to include the product. So we use preprocessor scripts to look up the correct product for probes where there's not full control over the message and correct it there.

     

    We have almost 4000 monitored systems and dictate a uniform monitoring profile. The problem is that some customers disregard best practices. So, again with CDM, we dictate a 95% low threshold but we have customers who intentionally violate that. In these known cases, we also use preprocessor scripts to throw these alerts out when thy arrive. There is no value in knowing that an error exists and will continue to exist indefinitely. Yes, we could manage this by disabling the alert on the specific CDM probe but putting it in a preprocesor script lets us manage all the exceptions in one place.

     

    Because customers come and go but the "going" isn't always amicable, it is often the case that we lose access to a customer hub system but they don't disable the Nimsoft software. We also prevent the creation of cases for the known list of inactive customers even though those systems continue to send us data. Presumably maintenance mode could be used for this too but I have never had the success of getting that process to work as expected. Neither could CA so I don't think it was just me.

     

    There's also a weird case where we might have something like payroll and inventory systems using the same instance of Oracle. In that case we'll take it one more step and figure out which product the oracle message belong to and further adjust the message.

     

    For probes like syslog and net_connect, the messaging reflects the robot name of the system running the probe, not the location that the actual event is occurring. We resolve the correct robot name in the preprocessing script and then let the event store so that it is associated with the correct system.

     

    Because of the limitations of what you can do in preprocessing there are a couple things that are handles as AO profiles instead.

     

    Our product relies on network attached storage and in some cases a single server might have 400+ filesystem mounts. If you have a switch failure you then get a CDM event for each unreachable filesystem and one from net_connect saying it can't connect to the storage server. So we have a fairly complicated LUA script that evaluates the outstanding CDM and net_connect alerts and figures out if there are "too many" and if so, rolls them all into a single alert - mostly by setting the unnecessary alerts invisible and creating a meta alert that indicates that something major is going on. This avoids the situation of getting 401 email pages at 3:00am on a Sunday morning.

     

    We also introduce a problem or trouble id into the message - either in the probe config or in the preprocessor script if the probe config doesn't allow modifying the messages generated. We use that problem id to then query the alarm history table to check if this is a repeat event. There's a different level of attention to be placed on a net_connect failure that has never been recorded before and one that has happened 20 times in the last 24 hours. So, if it meets the criteria of "too often", the priority of hte event is artificially increased and some additional text is added to the message to indicate that this is a repeat offender and needs additional attention.

     

    At this point we probably have an event that is usable. All of our events are sent to Salesforce because we use that CRM system for support ticketing. We use an AO with a LUA script to craft that message and at that point of generation, we look up any additional support information that might be relevant to the event based on problem id. This might include a lookup of KB articles, suggested troubleshooting steps, some canned events that are traditionally always done (like df -k on a linux system), etc.

     

    Bah - that's a lot of crafting....

     

    -Garin.



  • 6.  Re: Survey: How do we enrich alarms?

    Posted Jun 18, 2015 11:22 AM

    Oh, and it is true that some of this can be done in the existing infrastructure using triggers and whatnot but almost everything in place today native to the product is aimed at individual configurations. There's no way for instance to create a CDM trigger that's specific to a single robot automatically. So it's easy to create a trigger to track if a filesystem is unavailable. It is impossible to easily create a trigger that represents how many filesystems are unavailable per origin for a thousand origins or an arbitrary undefined set of origins. Or to think of it like a database, there's no "group by" functionality in the product.

     

    So everything listed above is done in LUA. We preprocess in LUA, we AO in LUA, we send email in LUA, we report in LUA, etc.

     

    The only "enrichment" thing that we use canned functionality for is closing old events based on age.

     

    -Garin



  • 7.  Re: Survey: How do we enrich alarms?

    Posted Jun 18, 2015 10:14 PM

    Community, can anyone else contribute to this request for feedback? Please reply to Jim via this thread. 



  • 8.  Re: Survey: How do we enrich alarms?

    Posted Jun 19, 2015 11:48 AM

    Previously we had used Alarm Enrichment to populate the alarm source's OS information in a User Tag (for both robots/hubs and non-robot/non-hub systems) so that we could then use the Auto-Operator to email the alarms to the appropriate support groups, which is a basic out of box requirement that does not exist in UIM. However, this proved to be unreliable since we experienced issues with the discovery server removing OS information from CIs in UIM.



  • 9.  Re: Survey: How do we enrich alarms?

    Posted Jun 19, 2015 12:28 PM

    I have always used the nas pre-processing rules for enriching the alarms via the custom 1-5 alarm fields or laundering the user tags.  Also available today is the alarm_enrichment probe for enriching alarms before being processed by the nas.  However, the big drawback on using the alarm_enrichment probe is that you can only match on one alarm field in order to trigger enrichment. 

    My experience has been that in large, diverse environments you need the ability to match on multiple alarm fields since the systems may not be easily classified by one specific field.  As an example you may have several different applications running on one server and you want to enrich those alarm messages differently to enhance the information that is contained in the alarm - i.e. for Spectrum Integration for event models or for Service Desk (Remedy, CA, Service Manager) to identify the support group, etc.

    A drawback in the nas pre-processing enrichment of alarms is that you only have the ability to match one rule and there is no priority in the rule order.  When creating pre-processing rules you have to be very careful in how your rules are written so that they match what you want, don't step on some other rule.  The request that I hear over and over again is the alarm enrichment process needs to include the ability to match on multiple rules, with a priority and be performed before nas processing and before sending onto integrated tools such as Spectrum, etc.



  • 10.  Re: Survey: How do we enrich alarms?

    Posted Jun 19, 2015 12:36 PM

    I agree on the summary of the limitations on the preprocessing rules. The documentation is really deficient too with regards to how they actually work. The matching limitations are fairly easy to get around. What we do is to have two monolithic preprocessing scripts: Production and test and that's the only criteria that's matched at the nas level (all our robots are named consistently so if the robot is named LAB* it's a test robot otherwise it's production). Then within the script itself all the logic for what should be several sequenced preprocessing profiles exists. It makes testing a little more difficult but on the other hand, there's no question about which script might have made the change to your data.

     

    -Garin



  • 11.  Re: Survey: How do we enrich alarms?

    Posted Mar 15, 2016 08:48 PM

    Jim,

     

    Could you comment on the fact if it`s correct you can match only on one alarm field? (lookup_by_alarm_field)

    otherwise could you provide an example for multiple matching, We can`t get passed this.

     

    Kind Regards.

     

    Rob