CA APM Upgrade - How we successfully upgraded Huge APM Production Infrastructure

View Only

CA APM Upgrade - How we successfully upgraded Huge APM Production Infrastructure

By Srikant_Noorani posted Mar 26, 2015 02:34 PM

Recommend

Want to share some experience when I was a customer of CA APM and was required to upgrade a fairly large deployment. It was the beginning of 2013 when we were getting indications that we would finally get the approval to the much-awaited upgrade project. We were both excited and bit nervous as we have been working with various team to communicate the value/benefits for sometime but also were under no false pretence that this is no small task as we were looking at around 20 APM clusters with about 200 or so EMs spread across multiple geographies monitoring some business critical and mission critical services.

All this started around 2nd half of 2012 we were growing rapidly and deploying services all over. It was hard to keep up with the demand and at the same time we were running into some major product limitations with the current release - 8.x version that was already few years old. We were stretching the product to its limits. Given the scale of our deployment it wasn’t surprising that we dragged our foot on the upgrade but at the same time we definitely not blind to the fact that we were running some major performance and stability issues because of that. There were capabilities that would only be available in newer release and then there is that support and end of life issue.

Why Upgrade

As a monitoring team it was pretty clear to us that to overcome some of the product limitations and to scale we needed to upgrade but our first challenge was to convince our mgmt. and other key stakeholders. We spent next few months articulating the benefits and importance of upgrade along with the challenges and limitations we faced without that. As part of that effort we also had to allay fears over service impact and clearly communicate backup strategy. Important thing here was to quantify the cost of not upgrading and its business impact. After numerous interactions with various stakeholders we were beginning to get some traction.

The Planning And Task List

Now fast-forward to 2013 and as mentioned things were starting to take shape. We were given a go ahead to start planning for a complete 200 or so EM APM infrastructure. The first step towards that was to start making an exhaustive list of all the clusters - Single Source of Truth. For each cluster identify the following

All the APM components deployed ( JS calc, extensions, type views etc)
All the services its monitoring
Any quite period/black out dates for both company wide as well as service related
All the APM Admins, business owner, service owner and other stakeholders
Upstream and downstream dependencies (if its feeding into 3rd party tool like mgmt console or pulling data from other places)
Any special configuration
Identify Hardware and OS configuration along with any external ports
Prioritize list of cluster from biz perspective

I called this the cluster dependency doc and it became a "go to” doc for everything related to upgrade.

Our next task was to come up with a priority list of these clusters i.e which one needs upgrade first and why. Number of factors went into determining that including how critical the upgrade was to that service (stability, leveraging new features of APM etc), blackout dates ( if we can upgrade immediately or not), how quickly the APM team will be ready for the upgrade ( if they have to get approvals on new port and that would take time they get pushed back in the list) etc. Outcome of this was a complete timeline based on the prioritized list that was communicated to the stakeholders. This also included timeline for raising tickets for any networks ports to be opened up (believe me this was the worst and took the most time), any scheduling related tickets etc.

Testing

In parallel we started setting up testbeds and coming up with a complete test plan. The test itself was broken down into two - functional and performance. The functional test included testing all the components based on the APM components deployed in various clusters (from cluster dependency doc). Because it was an upgrade from 8.x to 9.x we had to make provision for testing additional components like app map and DB and had to request additional hardware, additional ports opened etc. This was a huge task and we ran into some challenges early on. During our testing we quickly realized that certain configuration that we had in 8.x like active-active MOM would no longer be supported. There were few others like config file was broken into various XMLs, load balancing was different now but for the most part CA had done a good job of backward compatibility and that helped us a lot.

Testing took just over a month to complete and the result was presented to key stakeholders

Automation

During the testing phase and looking at the timeline we came to the realization based that upgrade of this scale would not possible without automation. We are talking of 100’s of EMs across multiple DC’s. Most of the DC’s were full capacity i.e. 10 collectors and multiple components deployed. The upgrade window itself was for few hours on Sundays (barring blackout Sundays) and it was important that we get the upgrade done or revert back to old working state. Either case it has to be in a working state at the end of the window. Doing this manually would have been error prone leading to service outage.

A separate effort was put in automation. We found out that CA APM is very friendly to automated upgrade. With its silent install feature, mgmt module portability, SS etc allows us to write simple scripts that would easily upgrade any complex deployment. Upgrade included stopping of any agent connection at IPtables level (simple shutting down of EM doesn’t cut as agent will keep hammering the network), taking a backup of key components, deploying APM DB, converting configuration to new config, upgrading the EMs and enabling the service and testing for any failures along the way.

Communication Protocol

Another important aspect of the upgrade was coming up with a communication protocol - this included

Listing steps that need to happen before, during and after the upgrade. Along with when it should happen
Who to communicate and what to communicated.
It listed situations (success, failures, network issue, hardware issue etc) and persons/group who needs be contacted
It clearly laid out the escalation path

Pre-Upgrade

The entire prep lasted for about a month to month and half. We tested, re-tested every aspect multiple times. By now we had all the pieces falling in place. We were feeling much comfortable. By now

We had our complete cluster dependency doc with clear prioritized cluster upgrade list
We had a clear time line on what tickets to raise and when to. We already started doing that
We had our automated upgrade test script along with all the testing done and ready to go.
We had our complete communication plan in place

Upgrade

The actual upgrade itself was broken down into two parts

1. Pre-upgrade tasks that we would complete a day or two prior to the actual upgrade

2. The actual upgrade itself

As part of pre-upgrade tasks we would complete everything that could be done without service disruption. Things like backups (mgmt mod, configs etc), ensuring our SSH keys were properly setup for automated login, ensuring disk space, reconfirming all the components etc. So we had a pre-upgrade checklist of items that we needed to do in prep for the real upgrade. Once we have everything set we check marked the last checkbox and ready for the upgrade.

With everything ready to go we just waited for the first in a series of D-Days. As a first upgrade we focused on a cluster that was considered non-critical (we had this at the top our priority list as an exception). It was a production cluster but had services deployed that were considered non-essential at that point and was relatively small - just two collectors. We were ready with our automated script as well as manual checklist in case something went wrong. We had gotten the go ahead from apm admins, business owners, service owners and other stakeholders. We pushed the automated script “button” and things started rolling. I could see it spitting logs with lots of success messages. I was particularly looking for any failures. For the most part it wasn’t bad we did identify few issues with the script itself esp. around Postgres DB that was quickly fixed.

With every upgrade we learned few new things and was the opportunity for some improvements. The whole upgrade process took us couple of months mainly because we were allowed only few hours on Sundays and there were few Sundays that we could not work. For the most part it was smooth but did run into some issues that really put us in a spot for a bit. One such issue was related to config XML and how it would behave in case of an error were it would ignore all values and not provide a proper message. We were all scratching our heads in the middle of prod upgrade were things would not work as expected. We were able to quickly fix and got past that. All in all I think it was a smooth and successful upgrade that was done on time and without any major issue.

Lessons Learned

When I look back I think there were number factors that contributed to the success

A clear communication strategy and proper planning
Buy in from the key stakeholders partnering with them- need to outline clearly the benefit, the value of upgrade
Keeping the stakeholders informed and up to date
Test, test, test and automation ( big ones). Can’t emphasize enough but this is one of the most critical steps. We identified quite a few technical issues like
- Agents have to be stopped at network level (iptables) otherwise you are cooked.
- Older config files, baseline strategies have changed
- MOM active-active config is no more valid
Back up strategy

Conclusion

As much as it was important to plan and bring various teams together I would say the tool - CA APM - itself made our lives much easier. It allowed and enabled easy upgrade in a zero touch and automated manner. The components including mgmt module, SS, configuration etc are portable. Apart from technology I think setting up a process and partnering with all the stakeholders goes a long way in getting this to fruition.

4 comments

2 views

ValueOps ConnectALL Product Community

CA APM Upgrade - How we successfully upgraded Huge APM Production Infrastructure

By Srikant_Noorani posted Mar 26, 2015 02:34 PM