Some of you have helped me recently in writng a LUA script to pull CDM thesholds/parameters from all hosts that we are monitoring. Well It works n our development environment with some strange results. So let me break down my results between development and production, keep in mind that both run the same CDM and NAS probe versions:
Attached is the script and the screenshot of the error. Anyone have any ideas what could be causing the error or the inconsitancy between the NMS installs?
After you call the nimbus.request() function, you should really check the value of rc. If it is 0, everything is fine, and you can move on. If it is any other value, it contains the error code returned by the request. If you can out which error the request returns, that will help a lot. There is a list of error code values and their translations here:
Let us know what you get for an error code, and we can help figure out what is wrong.
Ok, lets start with my development environment. This morning I ran the script and it failed with the same error. Like I said before, it was only failing once and then I couldn't repeat the error. So to account for this inconsistency, I decided to reboot the box where the NAS probe was running. After rebooting, I added a few "print(rc)" for debugging and reran the script. The strange thing is that I am getting a new error, and the error happens after successfully outputting thresholds from certain boxes and then fails mid-script.
Error getrobots:2Error in line 14: attempt to index local 'r_resp' (a nil value)
So an Error 2 is related to the connection. I am just lost as to why there is no consistency with the errors... Attached is the revised script with "print(rc)" and the output from the script. One thing I did notice after reviewing the log was that it ran successfully in the security zones but no where else...
More to come when I run the script in production.
Ok, so I made a rookie mistake, the reason for the error 2 was becasue I rebooted the hub and it needed time to reconnect to everyone. So I'm going to continue to try to recreate the error in dev and capture the error code. This evening I will be able to run the script in production.
You might want to print the robot name right before you try to reach each robot, rather than after. That would probably make it easier to correlate the error codes to the robots that are causing them. The error you were getting most recently (error 2, communication error) is very dependent on which robot you are trying to reach, as you noticed by the fact that it worked for some before giving the error. That might be the case with the other issues you have been seeing too.
Rather than only printing, the return codes, I recommend you have the script check if each request was successful. If you do that, your script will move on to the next attempt rather than dying with a script error. That might help you identify which robots work and which do not. Otherwise if the script just dies when there is an issue, you never know what would have happened with the rest of the robots in the list.
I added the following error checking routine and it seems to work, although I still havent run it in production yet.
cfg,rc = nimbus.request ( r_entry.addr,"probe_config_get",args)
if rc == 0 then
-- Main Search Block
print ("\nCDM Failed on ",r_entry.name, ", Error=", rc,"\n")
io.write ("\n\tCDM Failed on ",r_entry.name, ", Error=", rc)
I attached the full script with the error checking.
On a side note, what s the risk in running this in a production environment with 500+ robots? More specifically, is it going hog memory or anything like that? I have done performance tests in my development environment and saw no loss of performance but I have less than 50 robots in development.
Are you running the script in the NAS? I think when you kick off a script like this, it runs in a separate thread. In that case, it should not affect the performance of the NAS even if it takes along time to get the data from all of the robots.
As the script is running, you can watch the nas queue on our hub and make sure the NAS is able to keep up with the incoming messages. If it is, you probably have nothing to worry about. I do not think you can cancel a running script, but in the event of a problem that requires it to be killed, you could stop and start the NAS.
As Keith points out, the NAS runs the script in its own thread. Since your script mostly does its work "outside" the NAS data area, its performance wont drop the NAS. The LUA interpreter built into the NAS will keep track of non-LUA data-structures (like PDS, sessions etc.) and will remove this upon completion (unless the user does so). As a final control mechanism, the NAS will kill the script if an uncontrolled loop will take place, by counting the number of executed lines of code. If you exceed 10mill lines of executed code, it will simply kill the script.