Introduction
What does APM database corruption really mean and what can I do about it?
Database corruption is a concern for APM Administrators. It can range from the database being completely inaccessible to part of a record being partially destroyed. This is a brief introduction to this area and what to do about it
Overall
APM databases support various releases of Postgres and Oracle. Both of these can be subject to database corruption. Typical reasons for this include
- Hardware failures including disks, Raid controllers, CPU, etc.
- Operating System issues including bugs or corruption.
- Third-party software such as virus checkers, security hardening, etc.
- Database Misconfiguration or reaching database limits
- Other factors (Such as operator mistakes.)
Troubleshooting
Step 1: Determine Scope and Extent of Problem with APM Database
Before contacting Support, it is always helpful to better understand what happened. This includes:
- When the issue first happened
- The scope of the issue -- Is it the entire database, an index, or one or more tables impacted?
- Did the database run out of disk or table space?
- Are there error(s) and Java Exceptions in the database and APM logs that may help understand what the issue may be?
- Checking when my last database or Configuration/business transaction export is from.
Step 2: Backup the database directory, Do a Configuration/Business Transaction Export to save what you can.
- Grab a copy of the database and configuration as soon you can. This includes
* APM database backup via command line or pgbackup
* APM Configuration Export
* Screenshots of APM CE Configuration
* APM Business Transaction Export
* Screenshots of Appmap in Investigator
* Save individual APM database tables
Step 3: Open a Support Case
- Open a support case reviewing what was discovered in Step 1 and 2
- Upload a copy of the database so recovery scripts can be created by Engineering.
- Test the recovery scripts on a non-production database and assess results
- Then deploy recovery scripts on production database and assess results
- Use database recovery tools as directed by APM Support
- Run an APM vacuuming as needed.
Step 4: Proactive Steps
In the future consider doing the following:
- Evaluating all the applications, defects, metrics being monitored by APM to see if really needed.
- Raise defect thresholds
- APM vacuuming runs each night but do additional vacuuming as needed
- Periodically check APM database size and investigate reasons for increase
- Perform frequent APM backups and Exports
Additional Information:
Database Maintenance and Administration References
https://communities.ca.com/thread/116071656
Postgres
http://thebuild.com/presentations/worst-day-fosdem-2014.pdf Presentation on Postgres database corruption (Non-APM).
https://wiki.postgresql.org/wiki/Corruption Postgres & Corruption
https://bucardo.org/wiki/Check_postgres Nice free utility to check Postgres.
Oracle
https://oracle-base.com/articles/misc/detect-and-correct-corruption Tools to repair Oracle databases
Discussion Questions:
1. Are there other things that should be added as steps to do?
2. Are you using Database Failover/High Availability solutions? If so, please share your experience
2. What other topics should I cover?