Automic Workload Automation

 View Only
Expand all | Collapse all

AE windows server patching: routine maintenance issues

  • 1.  AE windows server patching: routine maintenance issues

    Posted Jan 31, 2020 05:50 PM
    Edited by Pete Wirfs Jan 31, 2020 05:50 PM
    We are running AE V12.3.0 on Windows servers with SQLServer on the same servers.  We patch the OS on these servers as often as every month which typically requires a reboot, so we schedule a weekend outage to pause operations and stop the client for this.

    However after the AE comes back up, sometimes some of the agents will not connect, even though they show as running from ServiceManagerDialog.  The first time this happened I contacted support and they instructed me to delete an orphaned communication row from the mqsrv table with this instruction;

         delete from mqsrv where mqsrv_name=<agentname>

    We have confirmed that deleting this orphaned row allows the agent to connect.  Support also recommended we should do a better job of shutting down the AE prior to maintenance.  But this suggestion ignores the fact this problem would be unavoidable if one suffered an accidental reboot.  We've also had it bite us during DR testing when we restore the server from an active point in time.

    WHY HAS THIS CHANGED?
    This was never a problem for us under V11.  So in our view, V12 broke something.

    REAL BUSINESS IMPACT
    From our view, V11 was more resilient regarding accidental reboots than V12.  (Sadly, we've had 2 accidental reboots in the last 5 years when our machine room lost power, so this is not a hypothetical for us.)

    WHAT ABOUT YOUR STORY?
    I'm curious to know if anyone else has encountered this as a new-to-V12 issue?

    I'm not ruling out that it could be unique to how we are installed....  Not everyone uses Windows, not everyone uses SQLServer, and not everyone installs both of them onto the same server.

    ------------------------------
    Pete
    ------------------------------


  • 2.  RE: AE windows server patching: routine maintenance issues

    Posted Jan 31, 2020 05:59 PM
    Edited by Pete Wirfs Jan 31, 2020 06:09 PM
    Well, I see I posted this same problem back in November 2019 too;

    https://community.broadcom.com/enterprisesoftware/communities/community-home/digestviewer/viewthread?MessageKey=d83f936a-9cf3-4aee-a98d-8138a827fdf8&CommunityKey=2e1b01c9-f310-4635-829f-aead2f6587c4&tab=digestviewer#bmd83f936a-9cf3-4aee-a98d-8138a827fdf8

    One of the suggestions is that AE should be started with cold-start.  We have not tried that.  But what are the negative impacts of doing so?  Are other things truncated as well as this communication table?

    ------------------------------
    Pete
    ------------------------------



  • 3.  RE: AE windows server patching: routine maintenance issues

    Posted Feb 10, 2020 04:22 AM
    > ​We have not tried that.  But what are the negative impacts of doing so?

    We had a different failure where the PWP would hang on server reboot and only comes up eventually when killed and launched repeatedly. We also solved this by a cold start (or at least we believe that did the trick once, and now the issue has returned - since the associated Support ticket is open for many weeks, we're just speculating here).

    So I inquired into what cold start means for the actual operation in a recent thread, but I am still not fully satisfied. Beyond technical descriptions about purging certain tables and some input to the effect of stuff in "preparing" being possibly reset, there still does not seem to exist a full list of effects this has on active objects in the various states. I'd really like to see Automic provide this.

    Until then, the risk of something screwing up over a cold start is probably plainly on the client, since the manual says that a cold start is only ever to be attempted if directed to do so by Automic Support.

    Br,


  • 4.  RE: AE windows server patching: routine maintenance issues
    Best Answer

    Posted Feb 03, 2020 05:06 AM
    Hi Pete, 

    I think my post 

    https://community.broadcom.com/enterprisesoftware/communities/community-home/digestviewer/viewthread?MessageKey=24b5dcbe-1091-4792-b868-6c04db16d4fa&CommunityKey=2e1b01c9-f310-4635-829f-aead2f6587c4&tab=digestviewer#bm24b5dcbe-1091-4792-b868-6c04db16d4fa

    relates to the exact same problem you have. Some of my agents couldn't reconnect to the system, even though they were active. I got them reconnected by shutting down all CPs - JCP as well - except CP2 

    "20200113/063349.431 - U00003366 Connection to agent 'MSSQL' already exists (old connection '*CP002#00000007', new connection '*CP001#00000430')"

    I think, this issue really needs to be fixed, because it shouldn't be a "normal" procedure to run a delete-statement just to get an agent reconnected. But thanks for the delete-statement anyway!

    Cheers
    Christoph 







  • 5.  RE: AE windows server patching: routine maintenance issues

    Posted Feb 10, 2020 04:13 AM
    > ​I'm curious to know if anyone else has encountered this as a new-to-V12 issue?

    I have yet to see this, but I am adding a +1.

    This is of concern to us because like many shops we reboot our servers for automated updates (for us, that's once every month). There isn't anyone available at night to manually coax the Engine into shutting down cleanly, it will basically have to honor the shutdown signal and then come up clean afterwards.

    Br,


  • 6.  RE: AE windows server patching: routine maintenance issues

    Posted Feb 14, 2020 11:00 AM
    I've just encountered this same error.  Our v12.3.1.HF1 POC system was rebooted for windows patching.  The AE app server and it's associated agent returned as operational after the reboot.  One agent box did not.  Ran the command provided by Pete (THANKS!) and it resumed operation.

    Yes this is definitely bothersome and needs to be addressed quickly.


  • 7.  RE: AE windows server patching: routine maintenance issues

    Posted Feb 14, 2020 11:29 AM
    Our most recent patch cycle did not require any human intervention.   I suspect this problem depends upon what activities are in flight at the time of the reboot. 

    Because of this problem, we now run a script periodically to validate that our agents are connected.

    ------------------------------
    Pete
    ------------------------------