We had an issue with our FT SS setup we got resolved and see TCP 14002 listening on FT SS. Now I see in OC Admin page the Landscape view shows the Secondary status NOT READY. Revalidated the .hostrc and .locrc file setup on both hosts. SSdb synchronization does work. During failover of primary OneClick client does not show red border nor does it show yellow border. Connection status icon shows green, not yellow nor read although OC client cannot ping any device. So it looks like OC client in limbo.
See that OC does talk to FT SS when we shutdown the primary SS (spectroserver and processd down) to FT SS on 14004 but not connection initiated to 14002.
Question is, what is required from OC so it will consider FT SS READY for failover in the Landscape view??
Check if the hostname resolution between the OneClick and secondary SpectroSERVER is working fine, try to add them to the local host file and see if that helps
Do you have a firewall between both SS? Usually all ports are from OC to SS, but one is from SS to OC (14001/TCP).
Does the Secondary SpectroSERVER have the .vnmrc setting secondary_polling=yes included so that it functions as a hot standby?
If this is not the case I would suggest to open a support issue so that we can check further.
When you run $SPECROOT/SS/MapUpdate -view on both Primary and Secondary what do you see?
Do you have OneClick Server hostname on both Primary and Secondary's .hostrc file?
Looks more like a host resolution issue between the servers. Try pinging the servers and see if you are getting a response. If not, adding the host name to "hosts" file will help.
Maybe Todd_Kornely could give a detailed overview of that process?
No firewall, thankfully, on same subnet.
No, does it have to? We are trying for warm standby. Doc says hot, warm, cold modes. Already have a case open also for 2nd week now.
wait_active=no on secondary SS but right now only 1 model is not activated but we see TCP 14002 listening.
Just don't know the exact steps OC does to determine secondary status is READY so I can debug those specific steps, same question in case too.
We did not have OC server in .hostrc file but I added OC server on both now. Instead we had + in both .hostrc files.
From Secondary FT SS
> ./MapUpdate.exe -vConnecting to Location Server running at: ITGC2W000212 port 0xdaff
Landscape Service Type Name Hostname Port Precedence Notes---------- ------------------ ------------------------------ ------ ---------- ------------------0x500000 Landscape Default itgc2w000145 0xbeef 10 1M model mask0x500000 Landscape itgc2w000145 0xbeef 10 1M model mask0x500000 Landscape Default itgc2w000212 0xbeef 20 1M model mask0x500000 Landscape itgc2w000212 0xbeef 20 1M model mask0x500000 Events itgc2w000212 0xbafe 20 1M model mask0x500000 Events itgc2w000145 0xbafe 10 1M model mask
It's not case sensitive is it? DNS resolves whether upper or lower it seems. Assume case in .hostrc does should not matter.
Already had local hosts file on OC server and already has FT SS in it. Resolution is good. Wireshark showed OC server talking to FT SS on 14004 but never on 14002 so resolution must have worked for that communication at least from local hosts.
If the problem was occurring when a + sign was the only thing in the
.hostrc file, then that itself is not the issue anyway.
On 9 March 2018 at 14:30, JonV <email@example.com>
The OneClick web server machine should be connected to both SpectroSERVERs. Here are the list of established connections:
OC Port <Random Port> --------> secondary SS Port 14002 (SpectroSERVER)
OC Port <Random Port> --------> secondary SS Port 14003 (Archive Manager)
OC Port 14001 <-------- primary SS Port <Random Port> (SpectroSERVER to push alarm to OC)
OC Port <Random Port> --------> primary SS Port 14002 (SpectroSERVER)
OC Port <Random Port> --------> primary SS Port 14003 (Archive Manager)
OC Port <Random Port> --------> primary SS Port 14004 (Location Server)
From the OneClick web server machine, what happens if you run telnet?
C:\> telnet itgc2w000212 14002
Was the connection successful?
Just asking because its not mentioned here , is port 14006 too open from OneClick to secondary SS?
Did you try to take a look at the tomcat log file before and after switching to secondary SS to see if there are any errors written there?
Was a tomcat restart also tried ?
Would you mind trying the MapUpdate command on the other server? Just to see if that works in both direction? To me, the "NOT READY" status is something preventing the communication between both SS, nothing related to OC.
Yes, the secondary is listening on 14002, 14003. Here's a wireshark from the OC server. Used Putty to generate a connection to 14002. The port is responding to connection requests.
What I really need is the step-by-step of how the OC determines that the secondary SS determines the secondary SS is status READY. And the steps OC takes to determine to connect to the SS.
The instructions in doc tells you how to configure but the actual steps the codes takes is not explained. I think I need those steps so I can use wireshark to see what step is not working besides "Failover is not working" generally. So I can connect to the port 14002 from OC server. But what happens before that as a prereq or after that so I can validate whether those steps are occurring. We just upgrade to 10.2.3 but same issue so it must be in setup some where, some step or steps is/are failing.
I do see 14006 listening on the secondary SS. We have restarted tomcat for numerous other reasons but NOT READY status doensn't change. Good idea on the Tomcat log before and after I can do that test. I have wireshark trace yesterday on OC Server, primary and secondary SS after restarting Tomcat to try to find related communication that would be OC Server checking the status of secondary SS but I don't know the exact steps to check so don't know what I'm looking for thus the question to support and the this thread on what exact steps OC server takes to check secondary SS status READY or NOT READY.
One thing is our secondary SS has never ever went to 100% models activated. On 10.2.1 it was one model that had voice that was our slowest model to activate. It always activates on the primary. Now on 10.2.3 there's always some different model still on secondary that doesn't activate completely but activation is always 100% primary. We have wait_active=no on secondary to make 14002 go to listening state.
The NOT READY message is not because of this is it?
ITGC2W000145%/d/CA_Spectrum/SS-Tools> ./MapUpdate.exe -vConnecting to Location Server running at: ITGC2W000145 port 0xdaff
Secondary Fault Tolerant Spectroserver
ITGC2W000212%/d/CA_Spectrum/SS-Tools> ./MapUpdate.exe -vConnecting to Location Server running at: ITGC2W000212 port 0xdaff
Do you mind running this syntax on the OC host?
It is not clear the way you tested the connection using Putty.
Its possible the 99% Activation is also contributing to the NOT READY state of the Secondary - there are other symptoms that occur if the models are not 100% activated, such as the VNM showing "initial" state and polling is paused until activation is complete.
If you are not sure which device is causing the activating delay, you could enter this line in the .vnmrc and restart the server, it should log model activation status in the VNM.OUT
Telnet is not something that is on my Windows server. So I have Putty.exe client. Simply use SSH option and change the port to 14002 from 22 then enter the hostname or IP that you want to connect to. Same concept as telnet hostname port#. Works the same.
Thanks Jay! We do have mdlact_debug=true in both SS's. It is very helpful to see if it's same model getting stuck and helpful in the past. What's interesting is that we can stop the secondary SS and it always goes inactive does not hang on "waiting on model activates to complete". We just change wait_active=yes and secondary SS went into RUNNING state from Starting. I do see 14002 listening now and the other required ports listening.
VNM model on the primary always shows 100% models activated so not sure if VNM model tracks the secondary SS model activation.
We never make it to 100% model activation on secondary SS. However, primary SS is always 100% active.
Today this is what we see from the Spectrum stop/start on the secondary SS.
It's basically a different model each time. Wonder if this is just how it is and the VNM.out on secondary never shows 100%model activation. Not sure about this.
Mar 14 13:35:35 Model activation status - 429639/429640 activated( 99% ), 1 problematic.currently processing 1 model activate triggers:Model 0x50ec18 "SWHQRTR2.vfc.com"/Rtr_Cisco - class CsIHCiscoIfConfig - 749 seconds
Agreed. If Putty works, that basically means the TCP connection is allowed.
Can you also check the DNS resolution? What if you do a nslookup on the name? What if you do a nslookup on the IP address? (I know /etc/hosts is filled-in, but if a dns resolution is attempted, sometimes, that might play a role).
But did you supply the hostname? Or you supplied the IP Address?
We found the issue! In the Spectrum admin page there is a "Spectrum Configuration" section. In that section there is a field for "Main Location Server" name and "Backup Location Server" name. The MLS field had a hostname that was of the OC Server not of the Spectroserver. Hostname should have been itgc2acaspectrum2 but it was itgc2acaspectrum1 (OC Server itself). So after correcting this hostname and bouncing Tomcat OC found the MLS I suppose and after that our status changed from NOT READY to READY. Do not recall going into this "Spectrum Configuration" section before so guessing this may have been typo from our setup from CA consultant but hard to know. At any rate failover works, everything works as it should now. Honestly, cannot find anything in the Spectrum wiki doc on this "Spectrum Configurtion" page.
All the nslookup by name and IP on primary SS, secondary SS, and OC server resolve correctly. We do have etc hosts on each machine so I don't see any typos in them.
It would be nice to know exact steps that OC server is performing to determine secondary SS is "READY" versus "NOT READY". So, example, every 120 seconds OC server does X, then Y, so I can figure out what debug to turn on or how to look at Wireshark to determine if each step is being done.
See OC server talking to TCP 1404 but never TCP1402 so if it needs to talk to TCP 1402 as part of counting the server ready then what steps is it doing before that that we can look at to see if they're failing?
used hostname for the Putty Test.
Manager found this. This is our issue.. how do we check #3 in this tech article on primary?
Spectrum OneClick Administration -> Landscapes pag - CA Knowledge
This is from Secondary, but primary shows same. Is this significant?
\> ./HostUpdate.exe -v
Landscape Service Type Name Hostname Port Precedence Notes---------- ------------------ ------------------------------ ------ ---------- ------------------0 Process Daemon itgc2w000145 0xfeeb 10 1M model mask0 Process Daemon SS itgc2w000145 0xfeeb 10 1M model mask0 Process Daemon itgc2w000212 0xfeeb 10 1M model mask0 Process Daemon SS itgc2w000212 0xfeeb 10 1M model maskITGC2W000212%/d/CA_Spectrum/SS-Tools
#3 is about the result of "MapUpdate -v" on MLS. I believe you got it right that Secondary SpectroSERVER is listed there. Another implication is on your OneClick Server, you have your MLS hostname set into locServerName parameter at the bottom Resource tag in the content of $SPECROOT/tomcat/webapps/spectrum/META-INF/context.xml file. Additionally on the same Resource tag I would also check if you have set adminUserName parameter value to the 'spectrum' install owner user correctly.
Oh yes, the 'spectrum' install owner user should be common to both Primary and Secondary SpectroSERVERs.
Here are details of our setup plus info on context.xml.
The adminUserName is the Spectrum owner account which is correct.
All nslookup of hostnames resolve correctly. Pinging all hostnames from command prompt resolve correctly.
On primary and fault tolerant SS in the hosts there is plus sign as well as hostnames.
Primary SS Host Security Button from Spectrum Control Panel
Secondary SS Host Security Button from Spectrum Control Panel
Primary SS - itgc2w000145 itgc2acaspectrum2 <--DNS hostnames itgc2caspectrum2 is host we installed SS with
Secondary SS - itgc2w000212 <-- DNS hostname
OC Server - itgc2w000144 itgc2acaspectrum1 <--DNS hostnames itgc2caspectrum1 is host we installed OC with
Primary SS etc/hosts
10.1.9.141 itgc2acaspectrum2 itgc2acaspectrum2.vfc.com itgc2w000145 itgc2w000145.vfc.com10.1.9.140 itgc2acaspectrum1 itgc2acaspectrum1.vfc.com itgc2w000144 itgc2w000144.vfc.com10.1.9.60 itgc2w000212 itgc2w000212.vfc.com
OC Server etc/hosts
10.1.9.140 itgc2w000144 itgc2acaspectrum1 itgc2acaspectrum1.vfc.com itgc2w000144.vfc.com10.1.9.141 itgc2acaspectrum2 itgc2acaspectrum2.vfc.com itgc2w000145 itgc2w000145.vfc.com10.1.9.16 itgc2acacabi itgc2acacabi.vfc.com10.1.9.60 itgc2w000212 itgc2w000212.vfc.com
Primary SS .locrc
Secondary SS .locrc
Primary SS .locregfile
LocationServiceHost=ITGC2W000145LocationServicePort=0xdaffApplicationLocServiceHost=ApplicationLocServicePort=BACKUP_MAIN_LOCATION_HOST_NAME=ITGC2W000212 <--- CA Chat case support instructed me to put FT SS hereBACKUP_MAIN_LOCATION_SOCKET_NUMBER=0xdaff
Secondary SS .locregfile