CA Service Management

Expand all | Collapse all

Timing of Initial SLA Events

  • 1.  Timing of Initial SLA Events

    Posted 03-19-2019 09:36 PM

    We have quite a busy system and occasionally have Attached Events firing late.

     

    We are running r12.6

     

    We already have in place the configuration changes mentioned in CA Support Document TEC434997 (giving Animator its own domsrvr process and the Anima record it's own Virtual DB Agent).

     

    That document also suggests "It is beneficial to add a minimum of a 1 minute delay time to each event as opposed to anything smaller." Does anyone know what is so magical about the first minute, as surely by deferring the Attached Events they may just coincide with other tickets logged in the next minute? Is it just to allow the save of the ticket to complete, or something else?

     

     

    Thanks,

     

    Alan



  • 2.  Re: Timing of Initial SLA Events

    Posted 03-20-2019 02:04 AM

    Hello Alan,

     

    > We have quite a busy system and occasionally have Attached Events firing late.

    > We are running r12.6

     

    Consider upgrading. Depending on how busy "busy" is, and if it is warranted by the load, you may see some substantial benefits from switching to ITSM 17.2 and using the "Advanced Availability" mode. This gives each Application server its own virtual database, and its own channel to the database, resulting in much better load spreading of the SDM processes. The Animator process is still a singleton process, so it doesn't benefit from having multiple copies of itself, but it does mean that a lot of the rest of the load is moved away, which frees resources for the Animator process.

    CA Service Desk Manager Considerations - CA Service Management - 17.2 - CA Technologies Documentation 

    Besides that, it gets you an updated technology stack and you'll probably update hardware to support the new configuration.

     

    I no longer have access to that document. It's got one of our old "TEC" numbers, and I can't even see a copy in the Google Cache. Is that information about a one minute grace period still in any of the current DocOPs or "KB" prefix knowledge documents? A lot of the old TECs got retired if they were not applicable to the current releases.

     

    Still, the reason that the suggestion is there is to avoid a race condition between the Animator entries and other housekeeping on the ticket when it is first Saved. I can't recall if there is anything hardwired about how often the Animator first checks in, but definitely there is housekeeping on the ticket that needs to complete first. A specific example is the Affected End User field which kicks off a bunch of checks for other fields on the ticket. As the Animator processes often includes conditional checks on field values, these should be present first - and issues have arisen where the Animator runs its checks before other fields have completed their population. The "one minute" recommendation is probably just a "common sense" value that someone decided on, rather than a strict programming "the Animator has a one minute limit before it can attach or fire."

     

    The genuine underlying issue is the Animator firing late, and this can be from any number of causes:

     

    * Too many Events for the system to handle.

    * Configuration not appropriate for the number of Events.

    * Hardware not appropriate for the system load.

     

    These all tie into each other of course, but often you can find that one is more of a limiting factor than the others.

     

    The Animator is often the visible sign of a performance bottleneck, simply because it is one of the most used processes on the system. It can be the workhorse of the system, and so delays are seen and felt here first. There may not be an Animator issue (although though there could be), but rather a performance issue elsewhere is having an impact.

    This is where a general performance review should come in.

     

    Here are some common things that we see with Animator.

     

    * Called too frequently.

    - Do you really need to check conditions every minute or 10 minutes, or can once an hour or once a day suffice?

    - Too many Events for the reality of the business needs.

     

    * Called when not needed.

    - Would the functionality be better served by dedicated SPL code, changing business process to not use an Event or adding in a Workflow, or email? etc

     

    * System overloaded.

    - Are there additional domsrvr/webengine pairs to handle web client load?

    - Are there secondary servers for web client load, knowledge, attachments, Web Services etc?

    - Do other SDM processes need their own agents?

    - Do the pdm_vdbinfo and dbagent commands reveal system stress? 
    - Is hardware sufficient for needs? (One CPU per domsrvr/webengine pair, for example).

    - Where is the bottleneck? SDM process, database, network, CPU, memory, SQL query format etc.

     

    * Are the Events efficient?

    - Custom code and Events/Macros can slow processing if there are faults.

     

    * Is the database overloaded?

    - A busy system on SDM 12.6 may have got a large number of entries in tables like session_log, not_log_header, call_req etc which simply aren't needed and which draw unnecessary resources from the virtual database. Archive and Purge can free this up.

     

     

    Really, a good review of the system is the only way to understand what is actually going on with performance delays. You'll also find guides to tuning performance in the DocOps for SDM.  But any system from that time is likely to have outgrown its original planned size. What was a good system setup then, may not match to what is asked of it now. Or there may have been setup choices made then which were okay when a system is small, but which become an issue as a system gets larger, such as Tomcat memory allocation, number of DB agents, monitor_joins etc.

     

    Thanks, Kyle_R.



  • 3.  Re: Timing of Initial SLA Events

    Posted 03-20-2019 09:12 AM

    try to add my 2 cents here...

    the doc also mentions "Be mindful of potential impact to performance and Event processing when specifying a "Repeat Delay Time". Be sure this is not too frequent" ...this is something you may want to take a look. Have many repeating events certainly will overload the app and the records pushed back to anima table will make the situation worse. Also make sure Anima table should have its own db agent...that is, in NX.env file, there should be an entry like this

    @NX_VIRTDB_AGENTxx=Anima

    where xx is an integer number(agent number)