Hello guys, need help to understand this
We are seeing stall counts happening at a method we instrumented on our custom pbd, but the response times for that method (we checked max/min) are never greater than 15k ms.
Here is what our metrics look like:
We have a max of 8 threads that could invoke this class/method so we believe that the stall toping a count of 7 is not a coincidence, but we just do not understand why the stall is showing up on the method that is invoked and not on the caller.
We are also not seeing stall error (error tab) for this method/class, anyway to force that to show up there?
Are you sure that the call actually ends? I mean, Avg response time shows up when the call ends, but stall count reports when transactions are active. So, if there is anything that just kills the stalled transaction, but does not let it end correctly, we might lose contact with the call, and might not see a response time.
It's a guess, I don't know if this sounds logical to your situation
You must use create an entry in your PBD using the tracer type "ExceptionErrorReporter" in order to increment "Errors Per Interval". Otherwise, EPI will always report zero.
Does the target application use AJAX or reverse proxy AJAX (comet)?
In either case the AJAX will return a response to the request, thus allowing APM to register the Average Response Time, but the thread is still opened and could be forming additional responses/data, or browser call back along the AJAX path. This would keep the thread alive which would trigger the stall counts.
Hope this helps,
Thanks for the insights,
This is basically how the app is structured:
there is a pool of threads that are always up (sleep/running when it makes sense) when running, each of these threads might call that class/method we are instrumenting.
The threads are also being monitored and actually show the endless stalls and huge response times. But I still dont understand why the method that is called on another class behaves like that.
There should not be anything killing the transaction (at least not that we know of so far )
We are actually already using the ExceptionErrorReporter, we do see the stall errors from the threads we are monitoring, but none from this class/method.
We do not have an ajax/comet running but we do have some threads that we keep on, but from the graphics, the weird thing is that we see the stalls coming and going, so I expected that at some point the response time would be computed. Now if we look at the threads we are monitoring they show a constant number of stalls (as expected).
I'm on shaky ground on this but from what I know, the average response time is the java agent/auto probe listening for the request, method call, and the response or return of the method call.
The Stall metric is based on the active thread id and monitors threads within the thread pool of the JVM. Autoprobe/java agent will capture the thread ids within the thread pool and compare the active thread ids to the previously captured thread ids.
This would be like a J2EE data source or EJB pool.
I'm going to try to provide an example of what I think I know...again shaky ground ahead.
1. Request is sent from client to server
2. Server uses a process thread, web container thread from the web container pool. Thread 1234
3. Autoprobe - stall captures that thread 1234 is within the current metric measurement cycle
4. Autoprobe - average response time clocks request on 1234
5. Web container thread processes the request which includes a call to your process thread pool
6. Process thread pool currently has no threads within the thread pool so creates thread 5678
7. The process thread pool assigns thread 5678 to the thread 1234 transactional context
8. Autoprobe - average response time clocks request on 5678
9. Autoprobe - stall captures that thread 5678 is within the current metric measurement cycle
10. Thread 5678 processes the request and returns the data response
11. Autoprobe - clocks response on 5678, determines average response time on thread 5678
12. Process thread pool returns response to thread 1234
13. At this point, the thread 5678 is returned to the process thread pool and still listed within the threads of the JVM. Since your custom thread pool isn't one of the standard JVM thread pools, auto-probe would not recognize that the thread is a thread in a pool and monitors it like any other JVM thread.
14. Web Container thread 1234, returns response to original client thread
15. Autoprobe - clocks response to 1234 determines average response time on thread 1234
16. Autoprobe enters next measurement cycle
15. There are no client threads but the thread enlisted at step 11 is still active. Stall captures that thread 5678 is still active
16. Now, if that thread ages out of the pool or gets used by another then the stall capture, I'm guessing, would then understand that the thread is processing a different request thus resetting the stall marker. If not each cycle that the pool exists, it becomes another stall tick
17. If the thread is still marked for presence within the JVM thread pool for one more cycle then it is reported as a stall.
Hoping Hiko_Davis will correct me so I can understand how the ART and stall stuff works.
Stalls come from BlamePointTracer.
Incrementing EPI comes from ExceptionErrorReporter.
Make sure you have both in your PBD.