DX Unified Infrastructure Management

 View Only

 Max. restarts reached for probe 'qos_processor' (command =

Kevin Nelson's profile image
Kevin Nelson posted May 03, 2023 06:38 AM

Hello, I have UIM 20.3.3 (due to be upgraded in a couple of months) and no changes have been made however qos_processor (version 20.10) has failed with an error of "Max. restarts reached for probe 'qos_processor' (command = <startup java>)".

I increased the logging and I have the following entries in the log;

I have tried restarting the probe and also re-deploying but it still fails. Any suggestions on this issue appreciated.
Thank you
Kevin

Kevin Nelson's profile image
Kevin Nelson

Looking at one of the archived qos_processor log files I see errors relating to when I first saw the probe go offline as follows which shows it has got to the configured number of connection attempts and then failed;



Jason Allen's profile image
Broadcom Employee Jason Allen

Are there any issues with data_engine?  Can you successfully test the data_engine connection to the database, and/or are there any errors in the data_engine logs?

The qos_processor asks the data_engine for the connection string to reach the database and it appears that there is either a problem with this connection string, or a problem connecting to the database.  You may be able to get more clues from the data_engine's logs/behavior.

Kevin Nelson's profile image
Kevin Nelson

Thanks for the answer Jason, data_engine connection works fine and there are no errors in the data_engine logs. What I have noticed is when I open IM I have some messages in the messages list as follows (even though the data_engine is working and connecting);

I have seen this before but was told not to worry about these messages if the data_engine is connecting to our DB server okay.

Thank you
   

Kevin Nelson's profile image
Kevin Nelson

Just found this post with a very similar error message;

UIM v20.0 installation | DX Unified Infrastructure Management (broadcom.com)
But does not look like there is an answer to it

Stephen Danseglio's profile image
Broadcom Employee Stephen Danseglio
Hi Kevin,

What java_jre version is deployed on the Primary hub?
 
Try qos_processor 20.4.4 but keep the current version in the archive in case you need to downgrade.
 
1. Rt-click and Deactivate qos_processor
2. Rt-click and delete
3. Delete the qos_processor folder from the filesystem (you can back up the folder if you want to)
4. Deploy qos_processor 20.4.4 (attached)
 
Set the loglevel to 5 and logsize to 50000.
 
Set the java min/max under startup->opt to at least 2048 and 4096 respectively.
 
Steve

Stephen Danseglio's profile image
Broadcom Employee Stephen Danseglio

Here is the attachment.

Attachment  View in library
Kevin Nelson's profile image
Kevin Nelson

Thanks Steve, I will give that a go. Java version as follows;


Thanks for the help.
Kevin

Kevin Nelson's profile image
Kevin Nelson

Steve, I have done as you suggested but it still fails with the same error. The interesting thing is that the qos_processor log entry before the probe fails lists the database connection string as one of the nodes of our MS SQL cluster whereas the data_engine probe has a data source of the SQL cluster availability group to connect to the database.
Kevin

Britta Hoffner's profile image
Britta Hoffner

Hi Kevin, java -version might not be the one used for the probe. Nimsoft usually configures it´s own java environment. You can check this in the controller configuration of the robot where qos_processor probe is installed.

 
Kind regards,
Britta

Kevin Nelson's profile image
Kevin Nelson

Thanks Britta, yes I forgot that :) it is;

Britta Hoffner's profile image
Britta Hoffner

Hi Kevin,

which DB Provider is configured in the data_engine Probe and which DB Server and Release do you use ?

 
because right after the db connect qos_processor tries to execute a sql script which seams to fail in your environment.


Kind regards,
Britta


Kevin Nelson's profile image
Kevin Nelson

Hi Britta, details as follows;


Using Microsoft SQL 2019

Where does the probe pull the details of the DB name and other details? If  it is the data_engine, I am not sure where it is getting the data source from.

Britta Hoffner's profile image
Britta Hoffner

Hi Kevin, the qos_processor probe uses the data_engine callback get_connection_string. You can test this callback in the data_engine probe utility.


Kind regards,
Britta

Kevin Nelson's profile image
Kevin Nelson

Hi Britta, I have this;

Britta Hoffner's profile image
Britta Hoffner

Hi Kevin, not sure if this matters but in my output there is one driver more:

Kind regards,
Britta

Jason Allen's profile image
Broadcom Employee Jason Allen

I think I would try the following steps in this circumstance-
1. deactivate qos_processor, then right-click and 'Delete'

2. On the filesystem there will be a leftover folder at /Nimsoft/probes/slm/qos_processor/  -- delete this folder and its contents

3. Deactivate data_engine, wait for it to lose port/pid, then activate again

4. now deploy qos_processor from the archive and check the result

If this still doesn't help, my next thought would be--

1. determine which node of the database cluster is currently active

2. reconfigure data_engine so that its connection info points to this cluster node specifically and restart it, confirm that it can connect

3. now restart qos_processor and check the result.

If this resolves the issue then I would try putting data_engine back to the availability group and try again, perhaps this "reset" will get everyone on the same page...

Kevin Nelson's profile image
Kevin Nelson

Hi Jason, thank you for the information, I had already done the first suggestion, which did not work, however the second option pointing directly to the active node in data_engine saving and then restarting qos_processor has worked. After doing this I reset the data_engine to point to the availability group and the qos_processor continued working. I need to test if restarting the qos_processor probe would break it again (because data_engine is now pointing to AG) but at least qos_processor now working.

Thank you
Kevin