[Author's Note: While I use APM examples below, this approach can apply to any system or application software.]
It was 2 p.m. on a overbearingly hot August day. I had been roused out of a wonderful dream. It appears that that there was a matter needing serious investigation. There had been a heinous crime and the likely culprit was being detained. Within twenty minutes my eyeballs were staring hard into the tearful youthful offender. Their crime? Crashing a fellow eight year old's birthday party and snatching a piece of cake without permission. Oh what an egregious deed! I could see the signs of their nefarious activity throughout this person including cake crumbs on the chin, smushed icing on one cheek, and much more that I refrain from telling. With as icy a glare that I could muster, I coldly spoke. "Well young Tommy Sherman, things look bad for you. Perhaps I can get a deal if you confess. Let's cut to the chase, what took place this afternoon at the birthday that you were unauthorized to attend? Spill it"
"Why mister nothing. Nothing happened."
While not as dramatic as the above scene, the response "nothing changed" does occur a good deal when initially looking at a case. This blog discusses why this is and what to do to counter it.
Why Do People Say Nothing Changed?
There are variety of reasons for this
- Customer sites have many networks and applications. So it can be hard to determine what was done and which networks/applications were impacted.
- People have limited scope. Someone in a network group may only know about some networks and not about applications. So they may not be aware of other changes.
- A complete analysis has not been done to date. So all changes/errors may not have been uncovered.
- Frog in boiling water. A rapidly growing environment that is not thoroughly monitored may have finally hit some capacity or metric limit.
- Understanding the impact of a single change. Someone may have been aware of a change but not its impact
How does one counter "Nothing Changed?"
- Take a larger view and look at change control approvals and results for that time period.
- Review operating system logs and events to see what was taking place. Perhaps it was an operating system error or third-party software was installed at that time.
- Use metric graphs to see what capacity and throughput looked like from the time of the issue and thirty days back. Is there a sudden spike in the present or all throughout? In some cases such as MTP logs, you can determine the total number of packets across seven days to determine if there is a sudden increase or decrease of traffic.
- See if new errors are only in the log at the time period of interest
- Check the date of when configuration files were last updated. If done around the same time as the issue of concern, there may be something worth investigating
- Implement a health check to assess the APM cluster health.
- Implement after action reviews (AAR) to limit negative impact of configuration changes.
In conclusion, while investigating, do not jump to conclusions whether something changed or not. Do your due diligence by analyzing logs, metric graphs, traffic etc. and a clearer determination of root cause and likely events will be probably reached.
As for young Tommy Sherman, he got off with a stiff warning. As far as I know, he never crashed another birthday party ever again.