Automic Workload Automation

 View Only
Expand all | Collapse all

That bug in 12.3.6+hf.2.build.1628511942582 and your general approach to releases (still).

  • 1.  That bug in 12.3.6+hf.2.build.1628511942582 and your general approach to releases (still).

    Posted Oct 22, 2021 06:17 PM
    Edited by Carsten Schmitz Oct 25, 2021 07:38 AM

    tl;dr: Don't upgrade your Linux Agent to 12.3.6+hf.2.build.162851194258, at least not until someone provides answers.



    Hey Broadcom, all.

    I usually try to refrain from posting here these days, but it's that time of the year again.

    Yes, like most people, I tend to post when someone puts me into an especially grumpy mood, and not posting doesn't mean that I struggle day to day with your product. You usually don't, and that's the nice thing I can say, but you are certainly free to delete any post you don't like, like this one. It's your house. But I try to keep it as constructive as possible. Also, honest. And I'm sure other customers will appreciate an advance warning on the bug I'm going to discuss here as well.

    I would also really like product management to take note and maybe take a lesson from this. And I would also really appreciate it if someone with deeper insights into this particular bug (like, ideally, an Automic developer) could possibly share some insights with us, as to why my experience with this bug went in the curious ways it did.

    So let's go:

    Today, I noticed that a Linux job of mine was failing. The job is quite simple, it has been unaltered for months, and had always worked. It runs on an entire host group of agents, daily, and it suddenly began to fail on only one of them, for no discernible reason.

    I looked at the job log for that host, and saw this:

    --- snip ---
    /opt/ae/current/agents/unix/temp/JBHPXAVE.TXT: line 2: syntax error near unexpected token `$'\024\374\001\177''
    /opt/ae/current/agents/unix/temp/JBHPXAVE.TXT: line 2: `€þ(?ü'
    --- snip ---

    Hm. This seems like a curious case of b�d encoding, no? To be honest, this was especially curious to me, because as the author of the idea to support UTF-8 fully across your software stack, which to my knowledge is still lingering, I kind of assumed you mostly do US-ASCII in these spots.

    I did some research, during which I found that simple test case jobs of mine (literally "hello world!" jobs) also failed. Then I discovered that someone (man or machine, I'm still trying to find out) had recently updated that agent from 12.3.6+build.1623930631118 to 12.3.6+hf.2.build.1628511942582 using CAU.

    But curiously, the problem didn't start with that upgrade. However, further investigation would show that it's a problem starting with 12.3.6+hf.2.build.1628511942582 nonetheless. So there's our first major gripe with the software: Despite being upgraded via CAU on 2021-10-15, the problem only started appearing from 2021-10-18 onward. CAU, it seems, isn't as stateful as it should be.

    By this time, I also discovered that ALL jobs (all JOBS, that is) on that host were failing, because I got support tickets from more clients. Curiously, we have OTHER hosts with the same OS and LC_ configuration, and the problem appeared to be limited to THIS host only. Why? I wouldn't have the slightest clue!

    To establish whether this is a problem with 12.3.6+hf.2.build.1628511942582, I downgraded the host to 12.3.6+build.1623930631118, and the problem disappeared. Then, to confirm, I upgraded again to 12.3.6+hf.2.build.1628511942582. Keep in mind that because the issue was limited to one host among a pool of seemingly identically configured hosts, there was no point of doing this on a dev system. I needed to do it on the production system, with all the impact to the users (but then, eh, they were impacted anyway by jobs failing with weird reports).

    After re-upgrading, to my surprise, the JOBS that previously failed worked now - kind of. It still had a weird encoding error in the report, but it ran through:

    /opt/ae/current/agents/unix/temp/JBHPWWNM.TXT: line 2: ™:þÿ: command not found            <--- ATTN: Broadcom
    ***************************************************************************
    ** ucxjlx6m version 12.3.6+hf.2.build.1628511942582 changelist 1626867156 **
    ** JOB 0399341942 (ProcID:0000022832) START AT 19.10.2021 / 11:22:33 **
    ** UTC TIME 19.10.2021 / 09:22:33 **
    ** TEXT=" Job started " **
    ***************************************************************************
    hello world.
    ***************************************************************************
    ** ucxjlx6m version 12.3.6+hf.2.build.1628511942582 changelist 1626867156 **
    ** JOB 0399341942 (ProcID:0000022832) ENDED AT 19.10.2021 / 11:22:33 **
    ** UTC TIME 19.10.2021 / 09:22:33 **
    ** TEXT=" Job ended " RETCODE=00 **
    ***************************************************************************

    Now this is exciting! Downgrade and re-upgrade makes the problem less severe, but not go away!

    Just to confirm what's going on, I introduced a "sleep" statement, and this is actually what the agent writes into the text file that it fills with shell code:

    #! /bin/bash
    p���
    /opt/ae/current/agents/unix/bin/ucxj???m IPA=10.202.12.27 PNR=2300 MNR=0100 JNR=0399359050 TYP=S TXT=" Job started" TRC=0
    echo "hello world."
    sleep 500
    /opt/ae/current/agents/unix/bin/ucxj???m MNR=0100 IPA=10.202.12.27 JNR=0399359050 PNR=2300 TYP=E RET=$? TXT=" Job ended" TRC=0

    (curiously, that unused shebang is broken also, but note especially the encoding mess ...).

    I reported this to a support partner, and he told me this is a known bug that'll be fixed in 12.3.7.

    What I wonder in particular is this:

    1. What a weird bug! Why do I only see it days after CAU, and only on one of many seemingly identically configured hosts? Why did it get better by down- and re-upgrading?


    2. Why didn't clients seemingly get notified about such a critical, seemingly known bug? If it got introduced with a Hotfix (+HF2 it seems), wouldn't that also warrant a speedy Hotfix to fix that new bug? There is 12.3.6 +HF3 by now, but at least according to my support partner AND the change logs, that doesn't appear to have the bug fix? If so, why is 12.3.6 HF2 and 12.3.6 HF3 available for download without a big, glaring warning about the known issue?

    Thanks!



  • 2.  RE: That bug in 12.3.6+hf.2.build.1628511942582 and your general approach to releases (still).

    Broadcom Employee
    Posted Oct 25, 2021 03:42 AM
    Hello Carsten,
    please note that this topic was raised to Product Management and Engineering and a hotfix was requested so that we could have a bug fix sooner than usual.
    Since the 12.3.7 was already fixing this problem when it was opened, as a developer had it fixed for v21, and a simple workaround had been found which was restoring to the previous version, we (and the customers waiting for the bug fix) did not consider necessary to release another 12.3.6HFX before the 12.3.7 release which is occurring today.

    This is the knowledge article with the different versions affected and fixes, for both 12.2,12.3 and 21 versions:
    https://knowledge.broadcom.com/external/article?articleId=220215 

    We are very sorry the frustration this may have produced you and we will try to do better next time.
    Thanks for your understanding,
    Adrian


  • 3.  RE: That bug in 12.3.6+hf.2.build.1628511942582 and your general approach to releases (still).

    Broadcom Employee
    Posted Oct 25, 2021 09:34 AM
    Hi @Carsten Schmitz

    First and foremost we want to thank you for your post and are happy to provide you with answers to your questions as well as give more insights into what happened in that special case.
    1.) Q: Why does the error just happen on one host and rather randomly:
    Answer: A character array was not initialised. That has then the effect that "something" is in there (whatever is in the RAM on that place). Depending on what it contains - that is random and different on any machine - the bug can appear or not. That is the reason why it just happened sporadically and not everywhere. 
    2.) Why didn't clients get notified and why wasn't there a Hotfix?
    As you can see from the answer to your first question, this is a bug that happens sporadically and randomly. Also, you explored it just on one host out of many. Due to that fact, the severity of this bug was not recognised immediately.
    Internally we realised that we had a severe bug and on that day the bug ticket was raised to a blocker ticket. By that point, 12.3.6 +HF3 was already released and that is the reason why the fix is not in there.
    We were able to identify the issue and find a solution for it on October 13. On October 14 the decision was made to release the fix with 12.3.7, which is just around the corner. 
    You are correct with your statement that (at the latest on October 14) a notification should have gone out and the Hotfix should have been taken down from the DLC and that was the intention. For some reason, the HF was not taken down as intended and we are currently investigating where our internal processes broke down to avoid such a situation in the future.
    Thank you again for your post and your constructive feedback.