I am going to try to provide as much info as I can, as well as what we have already tried thus far.
I have a Sandbox environment, version 13.1 sp8. The server specifications are as below:
App: Windows Server 2008 R2 Standard Virtual Machine SP1
RAM: 12GB (Allocated 4GB to the app in the properties.xml)
CPU0 Cores: 4 Processors 4
CPU1 Cores: 4 Processors 4
C: 40GB (10GB Free)
D: 40GB (26GB Free - Clarity is installed on D:)
DB: Windows Server 2008 R2 Standard Virtual Machine SP1
RAM: 32GB (In SQL Studio, I have set the minimum server memory to 10GB and the max server memory to 28GB)
C: 40GB (19GB Free)
D: 700GB (60GB Free - contains DB files and log file)
N: 2GB (1.77 Free - used for SSIS)
Usually our Sandbox is just used for testing fixes and small pieces of Development. But we recently upgraded our Development environment to 14.2. We cannot develop in 14.2 version and promote to Production, as Production is still 13.1. So we have been using our Sandbox for Development work.
It is true that more development work is taking place but to me it does look like it should be adequately resourced.
However the development team have reported the following:
Intermittently the system is responding normally, but most of the time we are seeing the below issues:
Jobs are taking longer than normal to run e.g. the 'Annuities Marketing - Disable Unrequired Notification for all users' job normally takes 3 seconds. It took 1 minute 56 seconds to run last night.
Database updates are taking longer than normal to run e.g. An update statement to a trigger normally takes 1 second. An update statement ran for 8 minutes yesterday before being cancelled
General navigation is very slow. e.g. logging in, navigating to projects, opening resources on the admin side, checking the jobs log or process engine
As of this morning the process engine has become stuck. No new processes are starting and the queue length is not reducing - (I have restarted bg service)
Steps I have taken so far:
Increase the memory allocation available to the app from 2.5GB to 4GB (<applicationServerInstance id="app" serviceName="Niku Server" rmiPort="23791" jvmParameters="-Xms4096m -Xmx4096m)
Increased the RAM on the DB server from 16GB to 32GB.
Also applied some of the tips suggested in the performance tuning webinar.
Is there anything else I can try?
I should also add that occasionally when users try to access the url they get a "page cannot be displayed" type of error.
I check and all the services are up.
Usually I need to reboot the app VM and it's ok again.
With the architecture it doesn't look to be an issue, if its specific to process slowness and job we will have to looks for orphans as we had a known bug. Also assuming you have trace dsabled as it pages the database heavily
<logger alternateDirectory="/opt/ca/clarity/logs" dynamicConfigurationEnabled="true" traceEnabled="false" traceJDBCEnabled="false"/>
Ensure you add traceEnabled="false" traceJDBCEnabled="false" to ensure traces are turned off.
Do let me know how it goes.
Thank you Suman.
So I stopped all services - service stop all.
Opened properties.xml in edit mode and the config now looks like this:
<logger alternateDirectory="" dynamicConfigurationEnabled="true" multitenantErrorReportingEnabled="false" traceEnabled="false" traceJDBCEnabled="false"/>
Then ran admin general upload-config
Restarted services again - service start all.
I have handed back to the Development Team now for further monitoring. Hopefully that will do the trick.
PS We also checked for orphans but none were present.
Thanks again for your help.
Was it non tomcat instance as admin general upload-config is not required for tomcat.
Can you send me your properties.xml so that I can take a look.
just emailed you the properties.xml file
Updated the hprof extension to the heap setting so that the heap is generated properly.
after we have made the change and saved I restarted the services.
Logging in is taking a very long time.
Is this expected?
For the first time after restart it is, but it will settle down.
Logins are timing out for most of our developers.Is there anything else we can try?
Ah, I just got logged in.
Logged out and back in again. Looks ok now thanks!
Can you try hitting the application server and see if you can login, this change doesn't impact login
OK, so now logins appear to be fine.
Users can log in no problem but navigation is quite slow. I tried going opening timesheets and then moving to the dashboards but it is taking a long time.
Suman - just opened a case as our Developers are currently unable to do any work.
Another idea we had was to check the bg logs for any errors or jobs that were failing and retrying over and over:
Yesterday 4 bg.log files were created and I see the following error occurring repeatedly:
ERROR 2015-07-23 18:31:25,203 [Dispatch pool-4-thread-7 : bg@SERVER (tenant=clarity)] xql2.xbl (clarity:process_admin:199902280__BD3CCFF4-46D4-4119-B0A6-453C2AF32827:Import Financial Actuals) ****IMPORT WIP ACTUALS: Failed to create assignment for WIP record ID = 23
ERROR 2015-07-23 18:32:41,019 [Post Condition Transition Pipeline 0 (tenant=clarity)] bpm.engine (clarity:process_admin:199899659__EC83C8FF-9981-4531-9B58-0052069CEFAB:none) Error (will retry) caused by Step Instance: com.niku.bpm.engine.objects.StepInstance@501df7fd [Id: 12973610 Process Instance Id: 7373537 Step Id: 5106919 State: BPM_SIS_READY_TO_TRANSITION Step Name: null Start Date: 2014-12-23 14:32:22.113 Expected End Date: null Percent Complete: 0.25 Warned: false Retry Count: 106 No of Pre Conditions: -1 No of Post Conditions: -1 Last Condition Eval Time: 1437658361162 Pre Condition Wait Events: null Post Condition Wait Events: null Pass Conditions:  Error Id: -1
Process Thread: com.niku.bpm.engine.objects.ProcessThread@5f842c22 [Id: 7385671 Parent Step Instance Id: -1 Join Step Instance Id: -1]
Split Threads: null
All four log files are filled with this error.
But when I check the performance dashboard it does look like the job eventually completes. I'm just wondering if all the failures and retries could be leading to the performance problem.
Just a small correction for future clarification: If you modify the properties.xml directly then running admin general upload-config is still advised even for Tomcat environments.
Essentially it's syncing the changes made to the file with the copy that's held in the CMN_CONFIG table in Clarity. Although this table and contents were primarily added for the purposes of Websphere and Weblogic, there are some instances where some common code that executes regardless of the app vendor you're using may pick up the data from the table instead of the file. So keeping them in sync is advisable even if it seems like most of Clarity appears to work as expected without doing this.
What is performance dashboard? It could be some issue with assignment for WIP record ID = 23 which got fixed and then it processed.
with regard to our virtual environment and the tuning document:
Dedicate CPU and memory resources to the VMs running CA Clarity PPM.
We have dedicated resources to the VM running Clarity and our performance is still poor.
As there is only 1 app server in the environment the other tips are not applicable to us.
Is there anything else we could try?
Should we also dedicate resources to the DB server?
As this is development box and 1 app server I need to know what kind of activities are performed? Your database looks quiet good. Also please brief what sort of performance issue you are facing?
according to our Developers:
If the jobs are taking longer you could try adding another server and deploy BG but if the updates at the database level is taking longer you should consult your DBA and do a health check to see if any improvements can be done. Also see the I/O stats between the app and database if its longer the transaction do take time.
as this is a Sandbox environment that we manage ourselves, technically I am the DBA!
So I have checked the SQL Error logs and can see a lot of instances of the error below:
2015-07-29 01:15:38.86 spid4s SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [D:\SQL_Database\niku_2.ndf] in database [niku] (7). The OS file handle is 0x000000000000080C. The offset of the latest long I/O is: 0x000008a7a4e000
These errors point to problems with disk I/O. So we are taking a closer look at how the storage and disk are configured on the database server. I will update further with any progress we make this afternoon hopefully.
Thanks Colin, the I/O really brings down performance drastically
Just to update you. We previously had 1 large drive to host the database files as well as the log.
Now we have created separate .vmdk(s) for data and another for logs. We hope that this will allow sequential writes to run quicker and thus boost I/O performance.
We also had another look through the bg logs.
We see the following error or similar occurring repeatedly. Can you please help us to analyse?
WARN 2015-07-31 08:49:00,130 [Post Condition Transition Pipeline 0 (tenant=clarity)] bpm.engine (clarity:process_admin:199923659__78EA89F4-90A1-4996-B49A-FFA0A42462B8:none) Step Instance has be retried 50 times. Step: com.niku.bpm.engine.objects.StepInstance@5c953e9e [Id: 12974738 Process Instance Id: 7374675 Step Id: 5106919 State: BPM_SIS_READY_TO_TRANSITION Step Name: null Start Date: 2014-12-23 15:24:07.08 Expected End Date: null Percent Complete: 0.25 Warned: false Retry Count: 50 No of Pre Conditions: -1 No of Post Conditions: -1 Last Condition Eval Time: 1438332235933 Pre Condition Wait Events: null Post Condition Wait Events: null Pass Conditions:  Error Id: -1
Process Thread: com.niku.bpm.engine.objects.ProcessThread@3cf15ab8 [Id: 7386718 Parent Step Instance Id: -1 Join Step Instance Id: -1]
ERROR 2015-07-31 08:49:00,134 [Post Condition Transition Pipeline 0 (tenant=clarity)] bpm.engine (clarity:process_admin:199923659__78EA89F4-90A1-4996-B49A-FFA0A42462B8:none) Error (will retry) caused by Step Instance: com.niku.bpm.engine.objects.StepInstance@19e7f5f3 [Id: 13040868 Process Instance Id: 7398100 Step Id: 5106919 State: BPM_SIS_READY_TO_TRANSITION Step Name: null Start Date: 2015-01-12 23:34:27.2 Expected End Date: null Percent Complete: 0.25 Warned: false Retry Count: 50 No of Pre Conditions: -1 No of Post Conditions: -1 Last Condition Eval Time: 1438332016388 Pre Condition Wait Events: null Post Condition Wait Events: null Pass Conditions:  Error Id: -1
Process Thread: com.niku.bpm.engine.objects.ProcessThread@4bc431ce [Id: 7411170 Parent Step Instance Id: -1 Join Step Instance Id: -1]
After addition of hardware did you check the I/O stats to see if it had improved. This error in process shouldn't cause navigation issue.
That log entry looks very similar to my post see
How do you read a bg log error
I should be very interested to hear the interpretation and the cause.
We have slow performance when the log gets filled with that.
If you feel that this is causing the performance, can you please stop the BG so the process engine don't interfere and you should have better performance. Can you please test that. The interpretation of the above error is there is a process whose step ID is 5106919 so you have to go to BPM_RUN_STEPS and from there you can get process Id and see what exactly the step is doing and why it is retrying for 50 times.
Hope that helps and have a nice weekend
How many of those you have? tens, hundreds, thousands, tens of thousand?
How big is you bg log? meg, tens of meg, hundred of meg? Gigs?
We do restart the bg and move the log to another folder.
That helps until at some point of time the errors start coming again and the situation reoccurs.
we maintain our bg logs at the default setting so once they grow over 5MB a new log file is created (e.g. bg-ca.log1, bg-ca.log2 etc)
Currentky we have 4 bg files and each contains the error messages described above.
We were able to identify the process the step 5106919 belongs to. Even though there are no active instances of this process in the troublesome environment, there were some old instances that had caused an error, but these instances were aborted so the engine should not be retrying them. I have deleted these old instances from the environment so BPM_RUN_STEPS is now clear of step id 5106919.
We will monitor the logs and performance over the next few hours to see if there is any improvement.
Thanks for the tip!
Just curious. I did further analysis of my case - see the thread referenced above. I found that the process instances were found in the BPM_RUN tables, but did not see them in the initiated process instances to the same number.
Did you see those processes in the GUI in the initiated instances?
What is you policy on deleting initiated instances?