Philadelphia Security User Group

 View Only
  • 1.  Reboot to production failing on some machines (post MP1)

    Trusted Advisor
    Posted Nov 20, 2012 08:03 AM

    On a few machines, I can't reboot to production, which is breaking imaging.  On other machines (same model) the same reboot to production task works fine.

    On the computers with the issue, the task completes ok, but it just goes to PXE instead.  On the machines it happens to, it is consistent, a reimage does not fix it.  Once in PXE (when not imaging), if I do a reboot to production when nothing else is happening, it still boots to PXE.

    I have a ticket in but am still waiting for help there.  This never happened prior to MP1, which may be a coincidence, who knows.

    Has anyone else seen this?  Any suggestions of what to try?  

    Thanks.



  • 2.  RE: Reboot to production failing on some machines (post MP1)

    Trusted Advisor
    Posted Nov 21, 2012 09:09 AM

    For anyone else who happens to see this, support helped me figure out that deleting the computer record from the console, waiting an hour (or clearing SBSStore), and reimaging as a new computer solves the issue.

    Apparently it's a known issue fixed in 7.5



  • 3.  RE: Reboot to production failing on some machines (post MP1)

    Posted Nov 21, 2012 12:46 PM

    Well, sort of.

    Depending on what is happening here in the background, this is not 100% correct.

    First off, clearing the SBS Store folder makes no difference to any process at all (other than being able to find files in there more easily).  It is historical only, which is why it's 100% safe to clear at any time.

    Second, imiging a computer makes no difference to how it responds when in automation, because we're not booting to the production image.  It could be a Linux image, Windows image, or simply a big Virus in production, and it makes no difference at all while in automation.

    Third, the SBS updates come almost immediately, because they are sent as tasks, so we don't have to wait an hour.  The processes involved are not policy based.  What IS policy based are updates to WinPE if you have to add drivers or whatever.  When you click on a preboot environment and tell it to rebuild, then you have to wait for the PXE server to check in and get the policy update, which by default may take up to an hour.  It's actually possibly a bit longer because you also have to wait for the Delta update to run and THEN the PXE server to check in.  Or you do it manually.

    But again, this is for rebuilding automation environments, which has nothing to do with this.  For this, we're talking about SBS files, and those are sent immediately by Task Server.

     

    As for this issue, there are essentially 3 possible reasons a system doesn't boot to production:

    • If the task is assigned to the wrong computer.  For instance, maybe we THINK the computer is called DougAlways but is really known as MiniNT-5647 in the console.  If you assign the task to DougAlways, obviously the MiniNT system is not going to get the task.  This is unlikely the scenario here, because the system DOES reboot, but goes the wrong direction.
    • "Race to Reboot".  This is a scenario in which the computer being rebooted gets instructions to reboot, but the PXE services do NOT get the instructions - fast enough.  Literally, this task is actually 2 tasks: one sent to the client, and one sent to the PXE server.  If the one sent to the PXE server is slower than the one sent to the client (hence "race"), then the client boots to automation again.  A manual reboot once back in WinPE often "resolves" this issue, but not really.  It just gets the system back into production, because by the time you manually reboot, the SBS file has arrived and PXE now knows what to do.  THIS issue IS resolved in 7.5.
    • SBS Replication can cause this.  By default, we only send the SBS file with "marching orders" to one SBS server, which may not in fact be the PXE server to which you're rebooting.  In this scenario, one PXE server knows what to do, but another does not, and the one that does not happens to be the one to which you're booting.  This is ALSO fixed in 7.5.  There is a one-off fix for this issue we're working on a KB for.

    So, in case anyone else runs into this, please be aware of what is actually happening before you delete the record and hope for perfect results.  As an item of note, generally speaking, there's only one reason to delete a record, and that's if you WANT us to treat this as a new computer even though it is not.

     

    Good luck!



  • 4.  RE: Reboot to production failing on some machines (post MP1)

    Trusted Advisor
    Posted Nov 21, 2012 01:00 PM

    I am not sure my case falls into any of the 3 scenarios above (was right machine, we only have one PXE server, which is our NS server).  Maybe I misunderstood, but I thought my issue was related to a known issue with the machine hash and when the problematic machines rebooted, it continually treated that machine based on the hash (corrupted?) like a 'new' machine and my new machines are set to by default boot to PXE.. so they were stuck in that loop until I deleted them from the console.  It's certainly possible I misunderstood what the tech was explaining yesterday.

    Ideally I'd like to not delete them from console, but not sure I have a choice.

    Interesting what you say about not having to wait an hour.  I'll definitely try that.  I've been told until now by support if I have to delete a machine for whatever reason (we almost never do), we had to restart PXE services in order for that machine to be treated like a new machine.  The hour wait came up because I didn't want to have to restart PXE everytime my techs deleted a machine if above issue grows in scale for us. 

    Thanks, as always, Thomas



  • 5.  RE: Reboot to production failing on some machines (post MP1)

    Posted Nov 21, 2012 04:21 PM

    If you don't restart the services, the system is NEVER removed from the PXE Server Service.  The ONLY way to make a known system be treated as unknown is to delete it from the console (now it's unknown to NS) and then restart the PXE services (now it's unknown to the services as well).  Otherwise, it's remembered by the service ... forever.  When the service restarts, it clear's it's RAM "memory" and when it restarts it knows NO systems.  Then it downloads a list of "known" clients, and if your system is NOT in the DB, it will not be in the "known" list, and thus, the service has "forgotten" the system.

    To me, I see either a race to reboot, or something as yet undiscovered if you have no site servers, but I dkno't know all the details either...



  • 6.  RE: Reboot to production failing on some machines (post MP1)

    Trusted Advisor
    Posted Nov 27, 2012 03:14 PM

    Just wanted to udpate the thread - just had this on another machine.  Seems like it's going to be common post MP1 issue for us (we never saw this before MP1, not once).  Machine booted to PXE, I imaged it, booted to PXE instead of production which is the task I sent.

    Deleted machine out of console, didn't touch services or SBS.  Rebooted machine, it booted to initial deployment window (as if new machine(, first/default choice for me in initial deployment is reboot to production, chose that, now booting to production ok.



  • 7.  RE: Reboot to production failing on some machines (post MP1)

    Trusted Advisor
    Posted Nov 29, 2012 12:35 PM

    For the machine above that I deleted 2 days ago to make it stop booting to PXE, today I realized on that same machine reboot to PXE isn't working.  I can't create an image if the machine can't succesfully boot to PXE without me deleting it first.



  • 8.  RE: Reboot to production failing on some machines (post MP1)

    Posted Dec 10, 2012 07:00 AM

    Hm .. sounds to me like your firm.exe is missing. 

    I had (if I understood your problem correctly) the same issue after I upgraded also.

    Firm.exe starts the deployanywhere thingy, and it seems to fail after the image have been ghosted over.

    Check to see if your firm.exe is under ..Altiris\Altiris Agent\Agents\Deployment\Task Handler\

    I might be off on this, but it does sound very similar to a problem I had =)