Api portal: 5.0.2
database: mysql 5.7
Gateways: 10.1.00_20220503
Platform : EKS
Hi Community, I have and issue with the portal that is showing that is NOT connecting the proxy I have set up. this has been working fine for few months, but We have notice that since last month the Portal is not connecting to the proxy.
I have check, netowkring and I can connect to each other doing curl. I have also check certificates, and they looks good.
In the Api gateways i get this error.
2022-11-09T10:51:12.579+0000 SEVERE 2558 com.l7tech.external.assertions.portaldeployer.server.client.PortalDeployerClient: {"message":"Failed connecting to Broker: wss://dev-portal-broker.portal.whapi.aws-eu-west-1.nonprod.williamhill.plc:443/"} │
│ MqttException (0) - java.net.SocketException: Connection reset
So I went to check into the broker statefulset. and I see that looking for a value en coordinator statefulset:
2022-11-09T10:38:35,679 WARN [main] org.apache.druid.java.util.common.RetryUtils - Retrying (1 of 2) in 944ms. │
│ org.apache.druid.java.util.common.IOE: Retries exhausted, couldn't fulfill request to [http://coordinator-1.coordinator.dev-portal.svc.cluster.local:8081/druid/coordinator/v1/lookups/config/__default?detailed=true]. │
│ at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:251) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.discovery.DruidLeaderClient.go(DruidLeaderClient.java:145) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.query.lookup.LookupReferencesManager.fetchLookupsForTier(LookupReferencesManager.java:569) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.query.lookup.LookupReferencesManager.tryGetLookupListFromCoordinator(LookupReferencesManager.java:422) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.query.lookup.LookupReferencesManager.lambda$getLookupListFromCoordinator$4(LookupReferencesManager.java:400) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:86) [druid-core-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:114) [druid-core-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.java.util.common.RetryUtils.retry(RetryUtils.java:104) [druid-core-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.query.lookup.LookupReferencesManager.getLookupListFromCoordinator(LookupReferencesManager.java:390) [druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.query.lookup.LookupReferencesManager.getLookupsList(LookupReferencesManager.java:367) [druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.query.lookup.LookupReferencesManager.loadAllLookupsAndInitStateRef(LookupReferencesManager.java:350) [druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.query.lookup.LookupReferencesManager.start(LookupReferencesManager.java:156) [druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_212] │
│ at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_212] │
│ at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_212] │
│ at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_212]
in the coordinator is something relted to "apim_metrics_hour"
2022-11-09T10:40:24,348 WARN [KafkaSupervisor-apim_metrics_hour] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Could not fetch partitions for topic/stream [apim_metrics]: java.lang.IllegalStateException: No entry fo │
│ und for connection 1 │
│ 2022-11-09T10:40:25,438 WARN [KafkaSupervisor-apim_metrics] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Could not fetch partitions for topic/stream [apim_metrics]: java.lang.IllegalStateException: No entry found f │
│ or connection 1 │
│ 2022-11-09T10:40:54,404 WARN [KafkaSupervisor-apim_metrics_hour] org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor - Exception in supervisor run loop for dataSource [apim_metrics_hour] │
│ org.apache.druid.indexing.seekablestream.common.StreamException: org.apache.druid.java.util.common.ISE: Previous sequenceNumber [125] is no longer available for partition [0] - automatically resetting sequence │
│ at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.getOffsetFromStorageForPartition(SeekableStreamSupervisor.java:2526) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.generateStartingSequencesForPartitionGroup(SeekableStreamSupervisor.java:2499) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.createNewTasks(SeekableStreamSupervisor.java:2397) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.runInternal(SeekableStreamSupervisor.java:1066) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor$RunNotice.handle(SeekableStreamSupervisor.java:293) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.indexing.seekablestream.supervisor.SeekableStreamSupervisor.lambda$tryInit$3(SeekableStreamSupervisor.java:749) ~[druid-indexing-service-0.16.0-incubating.jar:0.16.0-incubating] │
│ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212] │
│ at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212] │
│ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212] │
│ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212] │
│ at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212] │
│ Caused by: org.apache.druid.java.util.common.ISE: Previous sequenceNumber [125] is no longer available for partition [0] - automatically resetting sequence │
│ ... 11 more
and in the historical is load some data that is not there in the path anymore, I have go to the path and I see that the file that should be in druid/segments/apim_metrics_hour/, this should be saved in a PV, but is not there.. despite I see in the DB that there is an entry with that file name
2022-11-09T11:02:32,808 ERROR [ZKCoordinator--0] org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager - Failed to load segment in current location [/var/druid/segments], try next location if any: {class=org.apache.druid.segment.loading.S │
│ egmentLoaderLocalCacheManager, exceptionType=class org.apache.druid.segment.loading.SegmentLoadingException, exceptionMessage=IndexFile[s3://api-metrics/druid/segments/apim_metrics_hour/2022-03-08T00:00:00.000Z_2022-03-09T00:00:00.000Z/2022-03-08T1 │
│ 4:20:07.097Z/0/2203fa7d-6dff-4ac7-a66f-6335e3966a6e/index.zip] does not exist., location=/var/druid/segments} │
│ org.apache.druid.segment.loading.SegmentLoadingException: IndexFile[s3://api-metrics/druid/segments/apim_metrics_hour/2022-03-08T00:00:00.000Z_2022-03-09T00:00:00.000Z/2022-03-08T14:20:07.097Z/0/2203fa7d-6dff-4ac7-a66f-6335e3966a6e/index.zip] does │
│ not exist. │
│ at org.apache.druid.storage.s3.S3DataSegmentPuller.getSegmentFiles(S3DataSegmentPuller.java:81) ~[?:?] │
│ at org.apache.druid.storage.s3.S3LoadSpec.loadSegment(S3LoadSpec.java:60) ~[?:?] │
│ at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.loadInLocation(SegmentLoaderLocalCacheManager.java:236) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.loadInLocationWithStartMarker(SegmentLoaderLocalCacheManager.java:224) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.loadSegmentWithRetry(SegmentLoaderLocalCacheManager.java:185) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegmentFiles(SegmentLoaderLocalCacheManager.java:164) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.segment.loading.SegmentLoaderLocalCacheManager.getSegment(SegmentLoaderLocalCacheManager.java:131) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.server.SegmentManager.getAdapter(SegmentManager.java:196) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.server.SegmentManager.loadSegment(SegmentManager.java:155) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.server.coordination.SegmentLoadDropHandler.loadSegment(SegmentLoadDropHandler.java:259) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.server.coordination.SegmentLoadDropHandler.addSegment(SegmentLoadDropHandler.java:307) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.server.coordination.SegmentChangeRequestLoad.go(SegmentChangeRequestLoad.java:49) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at org.apache.druid.server.coordination.ZkCoordinator.lambda$childAdded$2(ZkCoordinator.java:148) ~[druid-server-0.16.0-incubating.jar:0.16.0-incubating] │
│ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_212] │
│ at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212] │
│ at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212] │
│ at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212] │
│ at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
All pod seem to be running and I can access to the web portal but I cannot sync any keys with the API Gateways.
I have redeploy removing all components just leaving the same external DB, and the issue still.
I would like not to clean the db at this point, to find out why is this issue. The only thing I can Think is that some pv needed to be removed when doing EKS AMI upgrade as part of the maintenance of the EKS Platform, but should not the DB keep that info?
Any hint will be great.
Thanks
------------------------------
Carlos Alberto Pimentel NAvarro
Cloud/System Engineer
William hill
https://www.williamhill.com------------------------------