Automic Workload Automation

 View Only
  • 1.  Automic DR creation checklist

    Posted Jun 23, 2020 03:32 PM
    Hi All,

    Thanks for taking time to read this. 

    We are planning to create a disaster recovery option for the Automic application servers(Windows), Database and the AWI. Our DB and windows team(manages the application servers and AWI) are setting up a DR site at a different location. 

    From the application point of view, what are the steps we need to follow before and after these server are built basically check its functionality?

    1. Do we need to disable/stop or check anything from the AWI or application servers while they migrate the services to the new DR site?
    2. Is there a specific set of tasks we need to follow to check the functionality from the AWI after the new DR servers are in place? For eg, running a test job or checking the agents are up etc.

    Can someone please throw some light on this, basically we are only used with the job/schedule management and this is something new for us. 

    Any help would be highly appreciated. Thanks.

    ------------------------------
    Regards,
    Gokul Krishnan
    ------------------------------


  • 2.  RE: Automic DR creation checklist

    Posted Jun 24, 2020 03:47 AM
    Edited by Carsten Schmitz Jun 24, 2020 03:53 AM
    ​Hi.

    I believe you don't need anything else than you'd need for a regular clustered Installation, just that one cluster member (or a bunch of them) is at a remote location.

    So engine in two places (all run WP, CP, the Java stuff as in a regular cluster, only your primary site runs the PWP), AWI install in both locations with traffic distributed to it by a load balancer which prioritizes your primary site. For the database, you are free to replicate it to both sites as you see fit, with the options provided by your DB vendor.

    I'd have just one concern though:

    Since the cluster with the standby nodes went away (called "non-stop" Cluster, or "active-passive"), the only available model is the active-active cluster, where your standby site would work as part of the engine all the time as well. I have actually (to my surprise) not heard of anyone operating this partly on a truly remote site. It may be that, if you have a bad ping between your sites, this may seriously degrade engine performance and possibly causes timing bugs to surface as well. You'd need to test that and see how your particular setup affects the engine performance.

    If performance is not satisfactory, you'd likely need to stop the processes at the remote site (e.g. by stopping Service Manager on the engine servers as a whole), and only start up your DR site's engine back up when you're actually moving to the DR site exclusively. You'd then need sufficient hardware power on either side to run your business on it.

    Hth,

    ------------------------------
    # signature.sh --verbose=[true|false]
    # no configurables beyond this point, only signature

    Using the slide show widget for posting individual images is wrong. Please use the "insert image" button in the editor.

    Did you know? I will NOT respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums!

    "Efficient Solutions Monthly Magazine" says: These contain very good advise on asking good questions. No, you do not need StackExchange for Automic, but asking questions the right way never hurts:

    http://www.catb.org/~esr/faqs/smart-questions.html

    https://www.chiark.greenend.org.uk/~sgtatham/bugs.html
    ------------------------------



  • 3.  RE: Automic DR creation checklist

    Posted Jun 30, 2020 02:50 PM
    Thanks a lot Carsten for the details.

    Our team are currently planning on DR creation and it is likely to be tested after the creation of components.

    From the Automic User perspective, they would need a step by step action items before and after the activity in order to fail over and test it.

    Following are the points we have added,

    1. Best time for the activity has been shared, it is when less no of jobs are getting executed in Automic.
    2. Before the start of the activity, we will disable all working QUEUES from Automic AWI Console.
    3. We will login to Automic webserver/Application servers and shutdown UC4 services.
    4. Once the activity is over, we will restart the UC4 services and enable the QUEUES for processing.
    5. A test workflow and different test jobs will be created and will be manually submitted to verify that we are getting the desired results.
    6. Will monitor the performance of Automic engine for 24 hours and provide feedback.
    Is there any prerequisites or any critical steps that we need to perform before and after the activity? 

    Thanks for the help.


  • 4.  RE: Automic DR creation checklist

    Posted Jul 01, 2020 04:53 AM
    Hi,

    Sounds mostly good to me - I assume what you mean is shutting down the main site, but then letting the DR site take over for a while between steps 2 and four, including starting the queues at the DR site for a test, right? I mean - otherwise, what would be the point of the test, me thinks ...

    Three small notes:

    #1 we have occasionally seen scenarios where organizational mechanisms (such as queues) didn't end all jobs as desired by us or didn't prevent them from starting - either because jobs were stuck, long-running, or because of supposed bugs. So you may want to double-check that no jobs of truly high importance are running before shutting down either site. While jobs on most Automic agents should continue even without an engine as disowned OS processes, and report back to the engine when it (i.e. the engine) comes back, you never know.

    #2 also think about how you want to deal with attached systems, at least the important ones. The point of a DR setup (and DR test) with a remote site is usually to be operational when your main site gets flooded / set on fire / invaded by Ninjas at an inconvenient moment. How about your agent systems, do they get mirrored at the DR site as well? If not, you might have an engine with no agents and nothing to do when disaster strikes. If so - well, then a full DR test suddenly gets more complicated.

    #3 think about how you replicate your database. It's part of the engine for all intents and purposes. If your engine database is on site, and the site is on fire, the db needs to have been replicated to the DR site. Depending on your DBMS and vendor, it needs to be brought up, or it may need time to switch. This should also be tested in a full DR test. And never ever have Automic servers of the same engine connect against databases in different or inconsistent states if at all avoidable (like, using an old snapshot at the DR site instead of the actual, current, fully replicated DB). It might work but it might just as well spell havoc, so it's best not to.

    Also, of course a most realistic DR test involves yanking out the main server's power cord during the time of heaviest load with no stopping of queues or other preparation - train as you fight, fight as you train, as they say - but maybe don't do that too often / without a good backup / without ample warning :)

    Best,
    Carsten
    ​​​​