Idea Details

Fail Over - Protect the MoM, the Collectors can deal

Last activity 06-13-2019 10:04 AM
Billy Cole's profile image
02-17-2015 01:08 PM

Problem Space:

With the agents setup to fail-over, with the help of the MOM to a new collector when a collector fails, the question becomes how to fail over the MoM.  Without the MoM, end users can not get to the metrics that might help them determine how deep and how far reaching an outage/critical failure has gotten.

 

For those with CEM, fail-over for the APM database (postgresql) would also be helpful.

 

Problem Description

There are two base failure cases, internal and external.

With the internal failure case, the MoM has stopped functioning and end users are unable to start their workstations or log into Webview.  By the time they have called the CA APM admin, the admin remote into the network to find out, why yes, the MoM is down.  Try to do CPR on the MoM, (Clue, Problem, Restart) by this time, the usefulness of the data to help with the critical event is either (hopefully) over or preventing the MoM from functioning.  If there was a way to have an active automatic failover MoM, hopefully that will cover 80% of the internal case.

 

Externally, the case is the hosting OS/Server the MoM resides decides that 3:00 am is a good time to wake up your CA APM Admin and tell him how much you adore him/her.  Could they please fix what is broken.  In this case, a fail-over MoM on a different host, and not just a watch dog process that will restart the mom at the signs of problems might help prevent the adoring fans from calling your CA APM Estate waking up the misses.

 

This topic was covered in 11/2013 web cast but haven't seen much of the topic since then.

 

References:

November 2013 - Webcast Replay - High Availability


Comments

03-27-2015 07:06 AM

Thanks Haruhiko for your suggestion.  I posted another message, referenced below to find out about the cluster solution for PostgreSQL.

 

But what about CA APM DB?  Failover with Postgres

 

The PostgreSQL clustering solution is not tested, certified or supported by CA so it is not an option.

02-27-2015 07:10 AM

I fully agree with Fred.K. on the matter that we as the customer require a solid and supported out of the box solution.  Continuing with the jiffy-bake your own is not only insulting but far more difficult to correct within your customer's environments than the vendor repairing the fault within the enterprise software.

 

The WebView should be aware of your supported MOM fail-over strategies and behave accordingly instead of turning the already over burdened CA APM Administrators into CA development and support team.

02-26-2015 03:15 PM

Large enterprises have slowly multiple teams (some outsourced) with separation of duty that makes what was a simple task for one person with access quasi impossible without substantial effort.

It is much simpler for the webview designer to provide the proper down status when the backend is down (status 5xx). What I'm proposing is a simple feature to make the WV more useful now.

In theory I would arguee that the WV1 should connect to MoM x when it knows that MoM y is down.

I am pretty sure that cloudification will happen and that you will have as many collectors and moms and wv as needed wherever needed ... software driven

02-26-2015 02:27 PM

Depends on how robust you want it. A lot of companies want a solution they can let their network teams manage so my solution would work for those folks.

I worked for a company that had three datacenters in different states where all applications where load balanced based upon the customer's IP address.

 

A vendor is not going to care often as to what is happening in front of their app and why should they? Customers can still get to the app by always using the same address

DNS alias is easy and doesn't require a lot maintenance.

02-26-2015 02:18 PM

you would have to implement LB instead of letting Apache do it out of the box == adds another point of failure, extra cost to write, maintain, next release might not conform to past format . . .. and not vendor supported.

02-26-2015 12:47 PM

Could you write a healthcheck script to read the WV log? I did something like that in another lifetime with a different product.

If a particular condition occurred, that indicated to the LB that we should send traffic to the next server.

02-26-2015 11:31 AM

When the mom is "inactive" the webview is happy to make you login and show you a blank screen or empty investigator with http status 200 which means all is well == which clearly it is not.

I am not fond to have the LB able to login (good user access management is not cheap) to the application and look for a missing element - seems overkill for a simple up/down

02-26-2015 10:04 AM

Fred.K

 

Instead of relying on WV report the MOM/EM is not available, use a DNS alias/LB in front of your MOM and point WV at that? Or did I misunderstand your configuration?

02-26-2015 09:58 AM

The MoM in our case are quite critical as it is the manager sending out the alerts. So failover is a topic of great interest to me - hence my vote 'up'.

Pre v9.5 we also needed the MoM for the agents to reconnect but that is not as critical as the agents have cached the collectors.

Starting with v9.6 the failover between two active MoMs in two datacenters (w/o lockfile) appears to work.  A few more basic things are needed:

- we use an apache reverse proxy in front of webview - what I need now is for the webview to return a proper failed status (500?) when the mom is not active so we can load balance the webviews.

- sync the key configurations between the two moms (dynamic domains.xml in v9.8 will be cool :-)

  (a long term solution is of course to manage the managers config from the APM Control Center:-)

- maybe a failback when the primary MoM is back.

- an alert that MoM failed over.

     - I am not sure is if when the MoM wakes up it also re-evaluates all the alerts (alike a restart) and resends them all over again

02-26-2015 08:13 AM

Sorry to say, yes, I've been called early Sunday mornings, Saturday night because someone couldn't get to the monitoring system and once at 11 pm Thursday because there was an alert that was red but they didn't get an email.    Most of the time it is to help figure out what is going on or to get things back on line so it might not be the same SLA as production, but we (the APM admins) are on 24/7 on call rotation.

 

On the Webcast on 11/2013, that discusses this topic pointed out that there is no automatic MOM failover and wanted to see if there was an update to this.

November 2013 - Webcast Replay - High Availability

Within this presentation, it listed the on slide 12 "MOM Something much better, someday" than the MOM shared file system or the MOM lock file, and shared.

 

Automatic MOM failover was also covered on slide 19 of 20 listing "MOM (automatic) Doesn't exist, 5-20 minutes loss of alerting, workstation connectivity, additional failure of the file system possible."  With these options, a loss of the monitoring system for 5 to 20 minutes isn't really a failover to support having the monitoring system accessible and usable during a critical event.

 

Also within the failover cycle, what is the behavior of the alerts/actions with threshold periods below 20 to 80 periods (15 second period)?  Within the documentation, there is a polling cycle (introscope.enterprisemanager.failover.interval) for the secondary default to 2 minutes.  What are the drawbacks of setting this to a minute?  How does the secondary check the primary for control?

 

And I just love the "It is up to you to restart the secondary Enterprise Manager. ", so now we need to put a watch-dog process into place to do this.  Since we are hosting our EMs on Suse Linux, does CA have the watch-dog shell scripts that will watch for this secondary EM to restart or is that also up to the Unix admins to figure out and support?  In the docs there is a Window's example but not a Linux example.

 

On the DB failover, Postgresql, what has CA tested, certified and will support when things fail.  Without support, then it isn't an option.

02-26-2015 01:41 AM

Hi,

 

  • Agents fail over between Collectors
  • Automatic MOM failover has been supported in the product for a long time (see Use Enterprise Manager Failover in the Config and Admin Guide)
  • DB failover is supported by both Oracle and PostgreSQL

What is missing?

BTW, do you really have the same SLAs for APM as for your production systems, i.e. waking the APM Admin at 3am? Wow!

02-17-2015 01:18 PM

Here is the vendor supported clustering solution from PostgreSQL: PostgreSQL: Documentation: 9.2: Creating a Database Cluster