Recently we have seen an increasing number of deployments failing in the Pre-Deployment, Distribute to execution server stage. When the deployment fails we see an error message displayed like "Failed to distribute artifact [artifact_store/releaseId_955482/zip/<path_to_artifact_file>] to execution server [es_<exec_server_hostname>]", see screenshot
When we look at the execution server which the artifact is not able to be distributed to, in execution.log we can see the following error:
2019-02-06 07:54:01,234 [http-nio-8443-exec-10] ERROR (com.nolio.platform.server.execution.ExecutionEngineImpl:929) - Error while distributing artifact [artifact_store/releaseId_954895/zip/<path_to_artifact_file>]com.nolio.platform.shared.communication.postoffice.FileDownloadException: Can not download file to [artifact_store\releaseId_954895\zip\<path_to_artifact_file>]. could not get file [fid:36CB6B0F4BE764D79E7433001860D592] : Did not find any source. at com.nolio.nimi.NimiPostOffice.downloadFile(NimiPostOffice.java:118)
When this error appears on a deployment, re-running the artifact distribution doesn't seem to clear the error. On the execution server the file 36CB6B0F4BE764D79E7433001860D592 is not present in either the files_cache folder or the files_registry folder. One workaround way that we've found is that if we copy the file (in its MD5 renamed state) from somewhere else to files_cache folder (i.e. copy it from the retrieval agent which still has the file in files_cache for exampe) then re-run the artifact distribution the deployment will get past this error and progress.
I'm not able to replicate this error in testing so I'm confused about what is the root cause of this error or what it means.
As part of our standard restart process we delete the following folders on the execution servers, under the execution server root directory: activemq-data\nes\LevelDB, persistency, files_cache, temp. So therefore the files_cache folder will frequently get cleared out.
We also have a setup customised from the default for our execution servers so that we use a different folder called files_action_cache for the other cache, of the action lib content. This is done by changing <execution_server_root>\webapps\execution\WEB-INF\execution-servlet.xml and setting <constructor-arg value="files_action_cache"/> for <bean id="fileCacheFile" class="java.io.File">. We have had things setup this way for years and we originally did this following the advise of CA.
As per my understanding files get deleted from files_registry by a cleanup job which runs every 30 minutes and deletes any files found which are older than 30 minutes. Is this customisable in config, I tried to update this using the JMX on a NES but the property value change didn't seem to get applied. If we kept files longer in files_registry then we might see fewer occurrences of this issue as the
Should we remove the cleanup of files_cache from the jobs which do janitor/housekeeping as part of our restart process?
Should we also increase the size of the files_cache limit? I think this is done in <execution_server_root>\conf\nimi_config.xml? Is there a downside to doing this, maybe on performance?
Also is there anyway to interrogate an execution servers cache other than looking at files_cache on the server (JMX command or otherwise)?
We are running CARA 22.214.171.12409.
I have also looked at Unable to distribute Artifact to Execution Server but there wasn't any helpful information on how to resolve the issue in that post.