Automic Workload Automation

Expand all | Collapse all

AE windows server patching: routine maintenance issues

Jump to Best Answer
  • 1.  AE windows server patching: routine maintenance issues

    Posted 24 days ago
    Edited by Pete Wirfs 24 days ago
    We are running AE V12.3.0 on Windows servers with SQLServer on the same servers.  We patch the OS on these servers as often as every month which typically requires a reboot, so we schedule a weekend outage to pause operations and stop the client for this.

    However after the AE comes back up, sometimes some of the agents will not connect, even though they show as running from ServiceManagerDialog.  The first time this happened I contacted support and they instructed me to delete an orphaned communication row from the mqsrv table with this instruction;

         delete from mqsrv where mqsrv_name=<agentname>

    We have confirmed that deleting this orphaned row allows the agent to connect.  Support also recommended we should do a better job of shutting down the AE prior to maintenance.  But this suggestion ignores the fact this problem would be unavoidable if one suffered an accidental reboot.  We've also had it bite us during DR testing when we restore the server from an active point in time.

    WHY HAS THIS CHANGED?
    This was never a problem for us under V11.  So in our view, V12 broke something.

    REAL BUSINESS IMPACT
    From our view, V11 was more resilient regarding accidental reboots than V12.  (Sadly, we've had 2 accidental reboots in the last 5 years when our machine room lost power, so this is not a hypothetical for us.)

    WHAT ABOUT YOUR STORY?
    I'm curious to know if anyone else has encountered this as a new-to-V12 issue?

    I'm not ruling out that it could be unique to how we are installed....  Not everyone uses Windows, not everyone uses SQLServer, and not everyone installs both of them onto the same server.

    ------------------------------
    Pete
    ------------------------------


  • 2.  RE: AE windows server patching: routine maintenance issues

    Posted 24 days ago
    Edited by Pete Wirfs 24 days ago
    Well, I see I posted this same problem back in November 2019 too;

    https://community.broadcom.com/enterprisesoftware/communities/community-home/digestviewer/viewthread?MessageKey=d83f936a-9cf3-4aee-a98d-8138a827fdf8&CommunityKey=2e1b01c9-f310-4635-829f-aead2f6587c4&tab=digestviewer#bmd83f936a-9cf3-4aee-a98d-8138a827fdf8

    One of the suggestions is that AE should be started with cold-start.  We have not tried that.  But what are the negative impacts of doing so?  Are other things truncated as well as this communication table?

    ------------------------------
    Pete
    ------------------------------



  • 3.  RE: AE windows server patching: routine maintenance issues

    Posted 14 days ago
    > ​We have not tried that.  But what are the negative impacts of doing so?

    We had a different failure where the PWP would hang on server reboot and only comes up eventually when killed and launched repeatedly. We also solved this by a cold start (or at least we believe that did the trick once, and now the issue has returned - since the associated Support ticket is open for many weeks, we're just speculating here).

    So I inquired into what cold start means for the actual operation in a recent thread, but I am still not fully satisfied. Beyond technical descriptions about purging certain tables and some input to the effect of stuff in "preparing" being possibly reset, there still does not seem to exist a full list of effects this has on active objects in the various states. I'd really like to see Automic provide this.

    Until then, the risk of something screwing up over a cold start is probably plainly on the client, since the manual says that a cold start is only ever to be attempted if directed to do so by Automic Support.

    Br,

    ------------------------------
    These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):

    http://www.catb.org/~esr/faqs/smart-questions.html

    https://www.chiark.greenend.org.uk/~sgtatham/bugs.html

    I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
    ------------------------------



  • 4.  RE: AE windows server patching: routine maintenance issues
    Best Answer

    Posted 21 days ago
    Hi Pete,

    I think my post

    https://community.broadcom.com/enterprisesoftware/communities/community-home/digestviewer/viewthread?MessageKey=24b5dcbe-1091-4792-b868-6c04db16d4fa&CommunityKey=2e1b01c9-f310-4635-829f-aead2f6587c4&tab=digestviewer#bm24b5dcbe-1091-4792-b868-6c04db16d4fa

    relates to the exact same problem you have. Some of my agents couldn't reconnect to the system, even though they were active. I got them reconnected by shutting down all CPs - JCP as well - except CP2

    "20200113/063349.431 - U00003366 Connection to agent 'MSSQL' already exists (old connection '*CP002#00000007', new connection '*CP001#00000430')"

    I think, this issue really needs to be fixed, because it shouldn't be a "normal" procedure to run a delete-statement just to get an agent reconnected. But thanks for the delete-statement anyway!

    Cheers
    Christoph







  • 5.  RE: AE windows server patching: routine maintenance issues

    Posted 14 days ago
    > ​I'm curious to know if anyone else has encountered this as a new-to-V12 issue?

    I have yet to see this, but I am adding a +1.

    This is of concern to us because like many shops we reboot our servers for automated updates (for us, that's once every month). There isn't anyone available at night to manually coax the Engine into shutting down cleanly, it will basically have to honor the shutdown signal and then come up clean afterwards.

    Br,

    ------------------------------
    These contain very good advise on asking questions and describing supposed bugs (no, you do not need to go to StackExchange for Automic questions, but yes, the parts on asking detailed, useful questions ARE usually relevant):

    http://www.catb.org/~esr/faqs/smart-questions.html

    https://www.chiark.greenend.org.uk/~sgtatham/bugs.html

    I will not respond to PM asking for help unless there's an actual reason to keep the discussion off of the public forums.
    ------------------------------



  • 6.  RE: AE windows server patching: routine maintenance issues

    Posted 10 days ago
    I've just encountered this same error.  Our v12.3.1.HF1 POC system was rebooted for windows patching.  The AE app server and it's associated agent returned as operational after the reboot.  One agent box did not.  Ran the command provided by Pete (THANKS!) and it resumed operation.

    Yes this is definitely bothersome and needs to be addressed quickly.


  • 7.  RE: AE windows server patching: routine maintenance issues

    Posted 10 days ago
    Our most recent patch cycle did not require any human intervention.   I suspect this problem depends upon what activities are in flight at the time of the reboot.

    Because of this problem, we now run a script periodically to validate that our agents are connected.

    ------------------------------
    Pete
    ------------------------------