DX NetOps

 View Only
  • 1.  Automate the stopSS.pl script at reboot on Linux/Unix based OS

    Posted Dec 17, 2015 01:39 PM

    All, we had patches installed on a subset of our landscapes the other day, that required a planned reboot.

     

    What happened is that out of the 12 landscapes that were rebooted, 10 had corrupted databases at startup, and we had to restore the last db backup.

     

    I believe that if we could somehow run the stopSS.pl script as part of the /etc/init.d/processd script to gracefully shut down the Spectro Server processes, we could prevent the database corruption.

     

    In our environment we created a wrapper script to stop/start/status the Spectro Server processes as the root user, and I may be able to leverage that, at least the commands that run the stopSS.pl script as the spectrum user from the root account. But I wonder if anyone has already worked on this, and has a solution that works for them?

     

    Our script (for the stop option), looks like this:

    stop () {

      case "$myhost" in

        *ocm*)

          /bin/su - spectrum -c "/opt/ca/spectrum/src/tomcat/bin/stopTomcat.sh"

          ;;

        *)

          /bin/su - spectrum -c "/opt/ca/spectrum/src/bin/stopSS.pl"

          ;;

      esac

      /sbin/service processd stop

    }

     

    Basically it determines if the server is the One Click server or not, and runs the appropriate stop commands. I'm wondering if in the /etc/init.d/processd script I could add the:

    /bin/su - spectrum -c "/opt/ca/spectrum/src/bin/stopSS.pl"

    line, and gracefully shutdown the Spectro Server processes during a reboot, to prevent the db corruption.

     

    Appreciate idea's or suggestions

    Regards,

    SteveT



  • 2.  Re: Automate the stopSS.pl script at reboot on Linux/Unix based OS

    Posted Dec 17, 2015 07:44 PM

    I have to admit I've never had any issues like this.

     

    If you just use processd and configured the SpectroServer and ArchiveManager processes to startup/shutdown this shouldn't happen (At least it hasn't for me).

     

    You also need to configure the archive manager to shut down first (give it a higher metric). One thing I did notice is that if you use other tools dependant on Spectrum you need to set it up that they shut down before Spectrum. This can cause corruption issues. E.g. we used processd for all spectrum-related processes and scripts so they had to shut down before Spectrum.

     

    The only time I see DB corruption is if you switch off at the power or someone just kills spectrum without stopping it first.


    Regards,

     

    Frank



  • 3.  Re: Automate the stopSS.pl script at reboot on Linux/Unix based OS

    Posted Dec 18, 2015 08:30 AM

    Your idea with your own init.d script to stop the servers is probably the best one.

     

    The SpectroSERVER will stop gracefully when given a standard stop signal.  So what probably happened during your reboots was that the SpectroSERVERs were probably attempting to stop, but didn't complete the shutdown process before the system went down.  We have had major problems with how long it takes for the SpectroSERVERs to stop due to how long it takes for CORBA shutdowns (we have seen it sitting at the stage where it is shutting down CORBA communications for 30 minutes or more, very aggravating when the secondary doesn't take over until this is complete). 

     

    This also leads to a problem with using processd to shutdown the SpectroSERVER.  Processd has a timeout where it will complete even if all the processes have not stopped.  I would assume you would want to know for sure that the SpectroSERVER is down before completing your system shutdown or reboot.  So processd is not ideal for that.

     

    The one drawback of your script is the human piece.  Many people get frustrated when a system shutdown takes a long time and will end up forcing a power cycle (physical or VM) if it takes too long.  So you would need to also educate the people performing the shutdown to be patient.  Suggestion would be to present the output from the VMN.OUT during the shutdown so that the person can feel comfortable that it is actually stopping.

     

    Due to that long delay, we have instituted our own procedures.  While manual, they have eliminated the accidental system shutdown related crashes.  That is to document to people that they are to do the stopSS.pl and then a processd.pl stop.  The do a ps -ef | grep {spectrum user} to verify all Spectrum related processes are down.  We do this because sometimes mysql may hang, or someone had a script out there on the server that is not letting things stop.  So we make sure all processes owned by the Spectrum user are down before we complete any shutdown or reboot.  Painful as it may be, it has been the safest way to reboot.



  • 4.  Re: Automate the stopSS.pl script at reboot on Linux/Unix based OS

    Posted Dec 18, 2015 09:15 AM

    Thank you for the replies.

    After doing some additional troubleshooting with our Unix team, it turns out that these servers are VM slices, and that the HA and DRS were acting up, which caused the reboot. So I'm now not sure that these were "graceful" reboots as originally thought.

     

    It looks now, more likely that the servers went down due to VM issues.

     

    I will test the suggestions above in our QA environment that we can reboot at will, and see what our impact is.