Hoang,
I think you have 2 issues here. First, you are sending alarm messages before you can be sure the restart actually failed (because there are more lines to process). Second, your clear alarm message is not suppressing with the other alarm message.
Alarm messages only suppress into a single alarm when they either have
identical message text or a common suppression key. Since you are not using a suppression key in your script, both alarm messages would have to match for them to suppress but do not because of the "OK" on the clear message. The automatic suppression based on message text is not very useful for clear messages, since you normally want the clear to be at least a little different. I would recommend a suppression key for your script. In that case, the script would look more like this:
local svr_name = "PLSPS5_1"
buf = action.command
("D:\\WIT\\Commands\\Restart\\plink.exe -v -i
D:\\WIT\\jobs\\wit\\.ssh2\\nanimbus5-rsa-key-1024-20080513-x86.ppk
-batch aiadm@ds11 /www/PLSPS5_1/BOUNCESERVER 2>&1")
supp_key = "nasscript/bounceserver/"..svr_name
for i=1,#buf do
if buf == "Server sent command exit status 0" then
nimbus.alarm(1, "AutoRestart "..svr_name.." OK", supp_key)
printf("Successful Restart of "..svr_name)
else
nimbus.alarm(4, "AutoRestart "..svr_name.." FAILED", supp_key)
printf("Failed to Restart "..svr_name)
printf("Take a look into this Restart ASAP")
end
end
Of course, this would not completely solve your problem yet. This would generate 2 alarms. The first would clear after several repeats, and the second would not clear because there is a non-matching line after the matching line. I think you are looking for something more like this:
local svr_name = "PLSPS5_1"
buf = action.command
("D:\\WIT\\Commands\\Restart\\plink.exe -v -i
D:\\WIT\\jobs\\wit\\.ssh2\\nanimbus5-rsa-key-1024-20080513-x86.ppk
-batch aiadm@ds11 /www/PLSPS5_1/BOUNCESERVER 2>&1")
successful = false
supp_key = "nasscript/bounceserver/"..svr_name
for i=1,#buf do
if buf == "Server sent command exit status 0" then
nimbus.alarm(1, "AutoRestart "..svr_name.." OK", supp_key)
printf("Successful Restart of "..svr_name)
successful = true
break
end
end
if not successful then
nimbus.alarm(4, "AutoRestart "..svr_name.." FAILED", supp_key)
printf("Failed to Restart "..svr_name)
printf("Take a look into this Restart ASAP")
end
This should only generate an alarm if the success message is completely missing from the output. Another alternative is to generate an alarm every time the script needs to restart the server but then clear the alarm when it succeeds. This could be useful if you wanted to have the option of finding these in the alarm history later. That could be done like this:
local svr_name = "PLSPS5_1"
supp_key = "nasscript/bounceserver/"..svr_name
nimbus.alarm(4, svr_name.." requires restart", supp_key)
buf = action.command
("D:\\WIT\\Commands\\Restart\\plink.exe -v -i
D:\\WIT\\jobs\\wit\\.ssh2\\nanimbus5-rsa-key-1024-20080513-x86.ppk
-batch aiadm@ds11 /www/PLSPS5_1/BOUNCESERVER 2>&1")
successful = false
for i=1,#buf do
if buf == "Server sent command exit status 0" then
nimbus.alarm(1, "AutoRestart "..svr_name.." OK", supp_key)
printf("Successful Restart of "..svr_name)
successful = true
break
end
end
Either of these 2 options should leave you with an open alarm if the restart fails, but they get there different ways. The 2nd option would be a bad idea if the alarm message would prompt someone to take immediate action, since the script is still trying a restart for a little while after creating the alarm.
Regards,
Keith