VMware GemFire

 View Only

 Gemfire server can't join the locator

Amar Das's profile image
Amar Das posted Jun 13, 2019 07:35 PM

Hi,

 

I am trying out the instruction for gemfire quickstart at https://gemfire.docs.pivotal.io/96/gemfire/getting_started/15_minute_quickstart_gfsh.html. My locator starts fine. But start server --name=server1 --server-port=40411 fails with the following message

The Cache Server process terminated unexpectedly with exit status 1. Please refer to the log file in /Users/username/gemfire1/server1 for full details.\\n \\nException in thread "main" org.apache.geode.SystemConnectException: Unable to join the distributed system in 60054ms\\n \\n at org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.join(GMSJoinLeave.java:404)\\n \\n at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.join(GMSMembershipManager.java:659)

The server1.log does not have any error either. Any idea what's stopping the server joining the lcoator?

[info 2019/06/13 11:37:35.276 EDT <main> tid=0x1] received FindCoordinatorResponse(coordinator=10.1.30.142(locator1:4867:locator)<ec><v0>:41000, fromView=true, viewId=2, registrants=[10.1.30.142(server1:4896):41001], senderId=10.1.30.142(locator1:4867:locator)<ec><v0>:41000, network partition detection enabled=true, locators preferred as coordinators=true)\\n \\n[info 2019/06/13 11:37:35.276 EDT <main> tid=0x1] Locator's address indicates it is part of a distributed system so I will not become membership coordinator on this attempt to join\\n \\n[info 2019/06/13 11:37:35.276 EDT <main> tid=0x1] findCoordinator chose 10.1.30.142(locator1:4867:locator)<ec><v0>:41000 out of these possible coordinators: [10.1.30.142(locator1:4867:locator)<ec><v0>:41000]\\n \\n[info 2019/06/13 11:37:35.276 EDT <main> tid=0x1] Discovery state after looking for membership coordinator is locatorsContacted=1; findInViewResponses=0; alreadyTried=[10.1.30.142(locator1:4867:locator)<ec><v0>:41000]; registrants=[10.1.30.142(server1:4896):41001]; possibleCoordinator=10.1.30.142(locator1:4867:locator)<ec><v0>:41000; viewId=2; hasContactedAJoinedLocator=true; view=View[10.1.30.142(locator1:4867:locator)<ec><v0>:41000|2] members: [10.1.30.142(locator1:4867:locator)<ec><v0>:41000] crashed: [10.1.30.142(server1:4896)<v1>:41001]; responses=[]\\n \\n[info 2019/06/13 11:37:35.276 EDT <main> tid=0x1] found possible coordinator 10.1.30.142(locator1:4867:locator)<ec><v0>:41000\\n \\n[info 2019/06/13 11:37:35.277 EDT <main> tid=0x1] Probable coordinator is still 10.1.30.142(locator1:4867:locator)<ec><v0>:41000 - waiting for a join-response\\n \\n[info 2019/06/13 11:37:47.282 EDT <main> tid=0x1] Stopping membership services\\n \\n[info 2019/06/13 11:37:47.283 EDT <Geode Failure Detection Server thread 1> tid=0x1f] GMSHealthMonitor server thread exiting

 

 

Juan Cassella's profile image
Juan Cassella

Hello Amar,

 

Thanks for contacting the Pivotal Support Community!.

This is certainly weird, I've tried the steps on my local machine and it works just fine. Do you have any firewall or something that might be limiting the communication between the locator and the server?. Maybe starting your server with --log-level=fine will give you more information as to why it can't join the distributed system?.

Best regards.

Amar Das's profile image
Amar Das

Thanks Juan. Let me try --log option.

 

These commands used to work few days ago. As you pointed out there might be some network configuration changes. While server1 comes up, I see these messages in the locator log. Most interesting line is

 

[warn 2019/06/13 13:41:06.746 EDT <TransferQueueBundler,MacBook-Pro-11919> tid=0x1b] JGRP000034: 10.1.30.142<v0>:41000: failure sending message to /10.1.30.142: java.io.IOException: Operation not permitted (sendto failed)

 

[info 2019/06/13 13:41:06.531 EDT <locator request thread 1> tid=0x23] Peer locator: coordinator from view is 10.1.30.142(locator1:6068:locator)<ec><v0>:41000   [info 2019/06/13 13:41:06.544 EDT <unicast receiver,MacBook-Pro-11919> tid=0x1c] Received a join request from 10.1.30.142(server1:6183):41001   [warn 2019/06/13 13:41:06.746 EDT <TransferQueueBundler,MacBook-Pro-11919> tid=0x1b] JGRP000034: 10.1.30.142<v0>:41000: failure sending message to /10.1.30.142: java.io.IOException: Operation not permitted (sendto failed)   [info 2019/06/13 13:41:06.848 EDT <Geode Membership View Creator> tid=0x24] View Creator is processing 1 requests for the next membership view ([JoinRequestMessage(10.1.30.142(server1:6183):41001) failureDetectionPort:44906])   [info 2019/06/13 13:41:06.849 EDT <Geode Membership View Creator> tid=0x24] preparing new view View[10.1.30.142(locator1:6068:locator)<ec><v0>:41000|1] members: [10.1.30.142(locator1:6068:locator)<ec><v0>:41000, 10.1.30.142(server1:6183)<v1>:41001{lead}]   [info 2019/06/13 13:41:18.542 EDT <locator request thread 1> tid=0x23] Peer locator: coordinator from view is null   [info 2019/06/13 13:41:18.543 EDT <locator request thread 1> tid=0x23] Peer locator: coordinator from registrations is 10.1.30.142(locator1:6068:locator)<ec><v0>:41000   [info 2019/06/13 13:41:19.889 EDT <Geode Membership View Creator> tid=0x24] finished waiting for responses to view preparation   [warn 2019/06/13 13:41:19.890 EDT <Geode Membership View Creator> tid=0x24] these members failed to respond to the view change: [10.1.30.142(server1:6183)<v1>:41001]   [info 2019/06/13 13:41:19.891 EDT <Geode View Creator verification thread 1> tid=0x4b] checking state of member 10.1.30.142(server1:6183)<v1>:41001   [info 2019/06/13 13:41:19.891 EDT <Geode View Creator verification thread 1> tid=0x4b] member 10.1.30.142(server1:6183)<v1>:41001 failed availability check   [info 2019/06/13 13:41:30.554 EDT <locator request thread 1> tid=0x23] Peer locator: coordinator from view is null   [info 2019/06/13 13:41:30.554 EDT <locator request thread 1> tid=0x23] Peer locator: coordinator from registrations is 10.1.30.142(locator1:6068:locator)<ec><v0>:41000   [info 2019/06/13 13:41:32.331 EDT <Geode Membership View Creator> tid=0x24] adding these unresponsive members to the crash-set for the next view: [10.1.30.142(server1:6183)<v1>:41001]   [info 2019/06/13 13:41:32.332 EDT <Geode Membership View Creator> tid=0x24] preparing new view View[10.1.30.142(locator1:6068:locator)<ec><v0>:41000|2] members: [10.1.30.142(locator1:6068:locator)<ec><v0>:41000] crashed: [10.1.30.142(server1:6183)<v1>:41001]

 

Amar Das's profile image
Amar Das

Juan,

 

Turning on --log-level=fine generated the following DEBUG information. I checked my firewall. It is off. So I am not sure why TcpClient calls are rejected.

 

[debug 2019/06/13 15:38:49.132 EDT <main> tid=0x1] received no join response   [debug 2019/06/13 15:38:49.132 EDT <main> tid=0x1] searching for the membership coordinator   [debug 2019/06/13 15:38:49.132 EDT <main> tid=0x1] sending FindCoordinatorRequest(memberID=10.1.30.142(server1:6429):41001, rejected=[10.1.30.142(locator1:6401:locator)<ec><v0>:41000], lastViewId=2) to [LocatorAddress [socketInetAddress=/10.1.30.142:10334, hostname=10.1.30.142, isIpString=true]]   [debug 2019/06/13 15:38:49.132 EDT <main> tid=0x1] TcpClient sending FindCoordinatorRequest(memberID=10.1.30.142(server1:6429):41001, rejected=[10.1.30.142(locator1:6401:locator)<ec><v0>:41000], lastViewId=2) to /10.1.30.142:10334   [debug 2019/06/13 15:38:49.134 EDT <main> tid=0x1] received response: FindCoordinatorResponse(coordinator=10.1.30.142(locator1:6401:locator)<ec><v0>:41000, fromView=true, viewId=2, registrants=[10.1.30.142(server1:6429):41001], senderId=10.1.30.142(locator1:6401:locator)<ec><v0>:41000, network partition detection enabled=true, locators preferred as coordinators=true)   [info 2019/06/13 15:38:49.135 EDT <main> tid=0x1] received FindCoordinatorResponse(coordinator=10.1.30.142(locator1:6401:locator)<ec><v0>:41000, fromView=true, viewId=2, registrants=[10.1.30.142(server1:6429):41001], senderId=10.1.30.142(locator1:6401:locator)<ec><v0>:41000, network partition detection enabled=true, locators preferred as coordinators=true)

 

Juan Cassella's profile image
Juan Cassella

Hello Amar,

 

There's certainly something weird going on at the network level.

Please go ahead an open a Support Ticket through the Pivotal Support Network [1] so one of our engineers can assist you further.

Best regards.

 

[1]: https://pivotal.io/support

Vishal Naik's profile image
Vishal Naik

Hello,

I'm also getting similar exception

org.apache.geode.SystemConnectException: Unable to join the distributed system in 60250ms

 

My locator is running on my windows machine and server on docker container\

 

Can someone please help me out here

 

 

My locator logs says

[info 2020/06/08 17:36:38.100 IST locator <unicast receiver,WSG2108013-47087> tid=0x1d] Received a join request from 172.17.0.2(server:1)<ec>:32768

[info 2020/06/08 17:36:38.401 IST locator <Geode Membership View Creator> tid=0x25] View Creator is processing 1 requests for the next membership view

[info 2020/06/08 17:36:38.401 IST locator <Geode Membership View Creator> tid=0x25] preparing new view View[172.31.112.45(locator:15708:locator)<ec><v0>:1024|1] members: [172.31.112.45(locator:15708:locator)<ec><v0>:1024, 172.17.0.2(server:1)<ec><v1>:32768{lead}]

[info 2020/06/08 17:36:50.073 IST locator <locator request thread[1]> tid=0x24] This member is becoming coordinator

[info 2020/06/08 17:36:51.402 IST locator <Geode Membership View Creator> tid=0x25] finished waiting for responses to view preparation

[warning 2020/06/08 17:36:51.402 IST locator <Geode Membership View Creator> tid=0x25] these members failed to respond to the view change: [172.17.0.2(server:1)<ec><v1>:32768]

[info 2020/06/08 17:36:51.403 IST locator <Geode View Creator verification thread 1> tid=0x59] checking state of member 172.17.0.2(server:1)<ec><v1>:32768

[info 2020/06/08 17:36:51.403 IST locator <Geode View Creator verification thread 1> tid=0x59] member 172.17.0.2(server:1)<ec><v1>:32768 failed availability check