Automic Workload Automation

 View Only
  • 1.  PSA: Code shipped with Automic can devastate your Oracle performance (and Automic)

    Posted Jul 07, 2022 05:56 AM

    Hi.

    How time flies when you're (not) having fun ... we haven't had one from my "PSA" series in years!

    But here's one now. But first:

    Disclaimers:

    1. I am not an Oracle DBA. I may get even basic concepts wrong.
    2. I don't even particularly like Oracle
    3. Everything I researched ("googled") is "best effort". I tried to find confirmations from Oracle, but Oracle being Oracle, Oracle is ... well ... Oracle.

    Primer (skip this if you know your DB lingo):

    1. Database indexes are rad. DB fast they make. Indexes are like floors in an office building: When looking for room number 420, there's no need to search floors one, two, three and five. Just use the index ("floor four") and rest assured that all rooms starting with four will be on that floor. Always keep your indexes up to date. *)

    *) Automic has two pages of documentation on the topic of how to best rebuild indexes. They say totally different things. Make sure to read them both, then roll a die.

    2. BUT - the Oracle DBA says (seriously, he said that recently!): What about "Statistics"? Statistics are vital, too! In my simple mind, statistics are what tells you the fastest way to office 420. Walk the fourth floor? Yes, that's an option. But there's a fire ladder from the parking lot that ends right next to office 420. Using that saves valuable seconds! The DBA, with a much less simple mind than mine, would say that statistics tell the DB optimizer how to optimize queries. The optimizer learns from analyzing statistics, which are based on real or even simulated workloads, which index or other operation is the fastest. Always keep your statistics up to date, too!

    *) just to make all of our lives 2% more miserable: there are even edge cases where "no statistics" performs BETTER than "totally current statistics", but these are edge cases that we won't look at here. Just assume proper statistics are always good. It's the "proper" part that deserves attention.


    The ugly reality here, and maybe even there

    We recently remembered that we didn't rebuild DB statistics in quite some time. No biggie, eh? I'll just do that right now!

    We had a job to do this for us. As with many of these, I suspect that it was installed by a consultant, and doesn't come with Automic, but I'm not sure. However, that job merely calls some SQL code that IS shipped with Automic. It's called "utility/db/oracle/12.3/upd_stat.sql. Splendid! Vendor-supplied solution. LOVING IT!

    The SQL file (did I mention it's shipped with Automic?) contains loads of statements like these:

    analyze table XAO estimate statistics;
    analyze table XREQ estimate statistics;
    analyze table XRO estimate statistics;


    So I ran that SQL (that came with Automic) through sqlplus64, as one is wont to do.

    The plot thickens

    And then my entire AWI stopped. Automic was still processing SOME jobs, but I could not bring up any "executions" in AWI anymore. A few moments later, I could not open my JOBS or other objects anymore. Everything ran into timeouts. And we all know, when things run into timeouts in AWI, Automic does NOT terminate the SQL queries that it has sent to the DB server in the background. So now my users were also clicking buttons, making more SQL queries happen, and it all piles up into a delicious mess of time outs and cookie dough. Ouch!

    No biggie, I'll have Oracle restarted, and I restart Automic. Rebooting always helps, eh? Not so.

    The analysis

    I conferencialized with the Oracle specialists, they tried a bunch of things. Meanwhile, I enabled DB traces (via ini file, since AWI was beyond useless now). From the packet storm, I fished, for example, this beauty of a query:

    SELECT AH.*, COALESCE(EH.EH_Status, AH.AH_Status) AS EH_Status FROM AH LEFT JOIN EH ON EH.EH_AH_Idnr = AH.AH_Idnr WHERE ah_idnr IN ( SELECT * FROM ( SELECT ah_idnr FROM AH WHERE AH_Client = 100 AND
    AH_OH_Idnr = 1337 ORDER BY AH_TimeStamp1 DESC) WHERE rownum <= 50);


    This is the query for bringing up the "executions" of a job. This used to run in seconds. Now? Now, this took six to seven minutes to complete (of course, it says "No rows found" because 1337 isn't a real OH id, but merely something I used for a test in place of the actual prepared statement parameters, but that's not the point. Also, that rownum junk? Ignore that. That's just a complicated way for Oracle people to say "limit 50;").

    But at the same time, I could do "select count(*) from AH;" and "select * from AH where rownum <= 50);" in mere milliseconds. What gives!?


    To the Gooooooogle-Mobile, Bingman!

    After some googling, I found that "estimate statistics" does just that - it estimates statistics. How does it do that? Simple. Like a Gallup poll, it asks exactly 1064 rows of each database table for their preferred tooth paste (or actually, their composition and usage patterns). Then, Oracle says, modern statistics theory says that this is usually enough and can usually be extrapolated for the entire table.

    Guess what? That's why "estimate statistics" is extremely fast, and also totally wrong.


    A solution (of 30% Automic, 20% Oracle, 15% SQL and 35% Chlorine Tri-Fluoride)

    You want to generate ACTUAL statistics for your tables. For ALL the rows, not just a supposedly representative sample. You do this by using "compute", NOT "estimate":

    analyze table XAO compute statistics;
    analyze table XREQ compute statistics;
    analyze table XRO compute statistics;


    This takes a lot longer than "estimate". A few seconds for most tables, but about 10-15 minutes for my "AH" table with seven million or so rows. That's just how it is - can't have your Automic and eat it too, as grandma always said. However, after doing "compute" over all tables, the executions fetch query runs in mere milliseconds - faster than before, but MUCH, MUCH faster than the six to seven minutes that borked our Automic system to smithereens.

    I do believe that "compute" is non-blocking, and as such, should not block your Automic system itself (except of course if your system is CPU or IO-bound to a degree where either the compute statement can use the resource or Automic, and it all goes downhill). But unless your Oracle server is loaded 100% all of the time, it SHOULD be okay, and the statement non-blocking. Any actual Oracle DBA: please correct or confirm this, thanks.


    TL;DR:

    Analyze. Don't estimate.


    It wouldn't be the same without a plea to Broadcom

    @Shannon Hebert et. al. I strongly believe your fine folks should look into this and strongly consider changing that SQL file that you ship, as outlined above. Technically, we're not your direct customer anymore; CA and then Broadcom made sure of that. And the reseller has no leverage in these things either. So I can't (won't) be filing any bug reports, and certainly no "ideas". However, I strongly believe you should follow this up with the internal experts nonetheless, and make it an action item, so this doesn't brick other customer's installations by equal chance. Thanks!



  • 2.  RE: PSA: Code shipped with Automic can devastate your Oracle performance (and Automic)

    Broadcom Employee
    Posted Jul 07, 2022 06:00 AM

    Carsten, thanks for your post, I have had someone look into this as a matter of urgency.

    We have left an outdated file within our installation media (utility/db/oracle/12.3/upd_stat.sql). I will arrange removal of this file from the download bundles.

    We would always recommend running administration database scripts past your DBA before execution, not just for content and site standards, but also timing, as Oracle admin tasks can have an adverse impact on performance.

    Our online documentation has a specific section, "Recommendations for Oracle", which explains how to properly maintain statistics.



    ------------------------------
    Kaj Wierda
    Sr. Product Line Manager | Automation

    Broadcom Software
    ------------------------------



  • 3.  RE: PSA: Code shipped with Automic can devastate your Oracle performance (and Automic)

    Posted Jul 07, 2022 06:19 AM
    Hey @Kaj Wierda.

    Thanks for the response and the quick action.

    > ​We would always recommend running administration database scripts past your DBA before execution

    This a generally a very good recommendation with a gigantic BUT (no pun intended): One, our core DBA team didn't catch this. I initially didn't catch it either. We (well, me, put on track by a remark a backup DBA made in discussion) only catched it AFTER the fact, once we wondered why the software was down and our monitoring freaked out.

    Also, quite frankly, our DBA would probably tell me off in short time if I asked them to validate every SQL we execute that comes from a vendor.

    (Also, and sorry, I can't spare you this anecdote either: I once received validated SQL from an Automic support entity that would have wiped my production database clean, if I had executed it. So we DO catch things. But not this thing).

    So in short, I think you're under-estimating the level of care (where ever such care is possible) that we, but also likely other clients, expect from this particular vendor construct.

    > Our online documentation has a specific section, "Recommendations for Oracle", which explains how to properly maintain statistics.

    This documentation calls dbms_stats.gather_schema_stats with the option "GATHER AUTO" and "estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE".

    In my understanding, this basically turns "dbms_stats.gather_schema_stats" into "estimate statistics", NOT "compute statistics", because you are leaving it to an Oracle algorithm to determine the correct guess for "estimate_percent". That algorithm is a black box. So the documentation you refer to might just be as dangerous as the original premise.

    Kind regards.





  • 4.  RE: PSA: Code shipped with Automic can devastate your Oracle performance (and Automic)

    Posted Jul 07, 2022 07:31 AM
    Oh, also:

    I always think these things are implicit. But many competing or otherwise similar products, when they include a database loader and a schema, many of them ship a proper DB maintenance tool specific to that application. A well documented, modern tool with possibly a few options that either is a closed binary, or (more preferable to me due to it's openness) a tool that iterates over vendor-supplied SQL files. Some vendors may not, but in my experience, most set themselves apart from the competition by providing this convenience. In fact, you even have such a tool in the form of ucybdbld (albeit sans the "modern" part). In fact, you have multiple (somewhat fragmented into multiple cleanup and export tools).

    You (i.e. Broadcom) removing that SQL file does (kind of, see above) solve the immediate problem. But it also just pushes the burden more to the customer. That is the opposite route of great customer care.

    Lastly, some may think I'm being harsh on Broadcom or Automic here. Not so. Yes, I don't hold back opinions much anymore. But I am selectively harsh on everyone in this industry where I perceive a gap between how the product is advertised, or it's cost (mercantile and otherwise), and the inside perception I get of the product itself. But that is ultimately a subjective opinion.

    Best regards,