tl;dr: Don't upgrade your Linux Agent to 12.3.6+hf.2.build.162851194258, at least not until someone provides answers.
Hey Broadcom, all.
I usually try to refrain from posting here these days, but it's that time of the year again.
Yes, like most people, I tend to post when someone puts me into an especially grumpy mood, and not posting doesn't mean that I struggle day to day with your product. You usually don't, and that's the nice thing I can say, but you are certainly free to delete any post you don't like, like this one. It's your house. But I try to keep it as constructive as possible. Also, honest. And I'm sure other customers will appreciate an advance warning on the bug I'm going to discuss here as well.
I would also really like product management to take note and maybe take a lesson from this. And I would also really appreciate it if someone with deeper insights into this particular bug (like, ideally, an Automic developer) could possibly share some insights with us, as to why my experience with this bug went in the curious ways it did.
So let's go:
Today, I noticed that a Linux job of mine was failing. The job is quite simple, it has been unaltered for months, and had always worked. It runs on an entire host group of agents, daily, and it suddenly began to fail on only one of them, for no discernible reason.
I looked at the job log for that host, and saw this:
--- snip ---
/opt/ae/current/agents/unix/temp/JBHPXAVE.TXT: line 2: syntax error near unexpected token `$'\024\374\001\177''
/opt/ae/current/agents/unix/temp/JBHPXAVE.TXT: line 2: `€þ(?ü'
--- snip ---
Hm. This seems like a curious case of b�d encoding, no? To be honest, this was especially curious to me, because as the author of the idea to support UTF-8 fully across your software stack, which to my knowledge is still lingering, I kind of assumed you mostly do US-ASCII in these spots.
I did some research, during which I found that simple test case jobs of mine (literally "hello world!" jobs) also failed. Then I discovered that someone (man or machine, I'm still trying to find out) had recently updated that agent from 12.3.6+build.1623930631118 to 12.3.6+hf.2.build.1628511942582 using CAU.
But curiously, the problem didn't start with that upgrade. However, further investigation would show that it's a problem starting with 12.3.6+hf.2.build.1628511942582 nonetheless. So there's our first major gripe with the software: Despite being upgraded via CAU on 2021-10-15, the problem only started appearing from 2021-10-18 onward. CAU, it seems, isn't as stateful as it should be.
By this time, I also discovered that ALL jobs (all JOBS, that is) on that host were failing, because I got support tickets from more clients. Curiously, we have OTHER hosts with the same OS and LC_ configuration, and the problem appeared to be limited to THIS host only. Why? I wouldn't have the slightest clue!
To establish whether this is a problem with 12.3.6+hf.2.build.1628511942582, I downgraded the host to 12.3.6+build.1623930631118, and the problem disappeared. Then, to confirm, I upgraded again to 12.3.6+hf.2.build.1628511942582. Keep in mind that because the issue was limited to one host among a pool of seemingly identically configured hosts, there was no point of doing this on a dev system. I needed to do it on the production system, with all the impact to the users (but then, eh, they were impacted anyway by jobs failing with weird reports).
After re-upgrading, to my surprise, the JOBS that previously failed worked now - kind of. It still had a weird encoding error in the report, but it ran through:
/opt/ae/current/agents/unix/temp/JBHPWWNM.TXT: line 2: ™:þÿ: command not found <--- ATTN: Broadcom
***************************************************************************
** ucxjlx6m version 12.3.6+hf.2.build.1628511942582 changelist 1626867156 **
** JOB 0399341942 (ProcID:0000022832) START AT 19.10.2021 / 11:22:33 **
** UTC TIME 19.10.2021 / 09:22:33 **
** TEXT=" Job started " **
***************************************************************************
hello world.
***************************************************************************
** ucxjlx6m version 12.3.6+hf.2.build.1628511942582 changelist 1626867156 **
** JOB 0399341942 (ProcID:0000022832) ENDED AT 19.10.2021 / 11:22:33 **
** UTC TIME 19.10.2021 / 09:22:33 **
** TEXT=" Job ended " RETCODE=00 **
***************************************************************************
Now this is exciting! Downgrade and re-upgrade makes the problem less severe, but not go away!
Just to confirm what's going on, I introduced a "sleep" statement, and this is actually what the agent writes into the text file that it fills with shell code:
#! /bin/bash
p���
/opt/ae/current/agents/unix/bin/ucxj???m IPA=10.202.12.27 PNR=2300 MNR=0100 JNR=0399359050 TYP=S TXT=" Job started" TRC=0
echo "hello world."
sleep 500
/opt/ae/current/agents/unix/bin/ucxj???m MNR=0100 IPA=10.202.12.27 JNR=0399359050 PNR=2300 TYP=E RET=$? TXT=" Job ended" TRC=0
(curiously, that unused shebang is broken also, but note especially the encoding mess ...).
I reported this to a support partner, and he told me this is a known bug that'll be fixed in 12.3.7.
What I wonder in particular is this:
1. What a weird bug! Why do I only see it days after CAU, and only on one of many seemingly identically configured hosts? Why did it get better by down- and re-upgrading?
2. Why didn't clients seemingly get notified about such a critical, seemingly known bug? If it got introduced with a Hotfix (+HF2 it seems), wouldn't that also warrant a speedy Hotfix to fix that new bug? There is 12.3.6 +HF3 by now, but at least according to my support partner AND the change logs, that doesn't appear to have the bug fix? If so, why is 12.3.6 HF2 and 12.3.6 HF3 available for download without a big, glaring warning about the known issue?
Thanks!