DX Application Performance Management

 View Only

APM Tech Tip: APM Database Corruption

  • 1.  APM Tech Tip: APM Database Corruption

    Broadcom Employee
    Posted Sep 10, 2015 09:17 AM



    What does APM database corruption really mean and what can I do about it?


    Database corruption is a concern for APM Administrators. It can range from the database being completely inaccessible to part of a record being partially destroyed. This is a brief introduction to this area and what to do about it




    APM databases support various releases of Postgres and Oracle. Both of these can be subject to database corruption. Typical reasons for this include


    - Hardware failures including disks, Raid controllers, CPU, etc.
    - Operating System issues including bugs or corruption.
    - Third-party software such as virus checkers, security hardening, etc.
    - Database Misconfiguration or reaching database limits
    - Other factors (Such as operator mistakes.)




    Step 1: Determine Scope and Extent of Problem with APM Database


    Before contacting Support, it is always helpful to better understand what happened. This includes:


    - When the issue first happened
    - The scope of the issue -- Is it the entire database, an index, or one or more tables impacted?
    - Did the database run out of disk or table space?
    - Are there error(s) and Java Exceptions in the database and APM logs that may help understand what the issue may be?
    - Checking when my last database or Configuration/business transaction export is from.


    Step 2: Backup the database directory, Do a Configuration/Business Transaction Export to save what you can.


    - Grab a copy of the database and configuration as soon you can. This includes
       * APM database backup via command line or pgbackup
       * APM Configuration Export
       * Screenshots of APM CE Configuration
       * APM Business Transaction Export
       * Screenshots of Appmap in Investigator
       * Save individual APM database tables


    Step 3: Open a Support Case
    - Open a support case reviewing what was discovered in Step 1 and 2
    - Upload a copy of the database so recovery scripts can be created by Engineering.
    - Test the recovery scripts on a non-production database and assess results
    - Then deploy recovery scripts on production database and assess results
    - Use database recovery tools as directed by APM Support
    - Run an APM vacuuming as needed.


    Step 4: Proactive Steps
    In the future consider doing the following:


    - Evaluating all the applications, defects, metrics being monitored by APM to see if really needed.
    - Raise defect thresholds
    - APM vacuuming runs each night but do additional vacuuming as needed
    - Periodically check APM database size and  investigate reasons for increase
    - Perform frequent APM backups and Exports


    Additional Information:


    Database Maintenance and Administration References

    http://thebuild.com/presentations/worst-day-fosdem-2014.pdf Presentation on Postgres database corruption (Non-APM).
    https://wiki.postgresql.org/wiki/Corruption  Postgres & Corruption
    https://bucardo.org/wiki/Check_postgres  Nice free utility to check Postgres.


    https://oracle-base.com/articles/misc/detect-and-correct-corruption Tools to repair Oracle databases


    Discussion Questions:

    1. Are there other things that should be added as steps to do?

    2. Are you using Database Failover/High Availability solutions? If so, please share your experience

    2. What other topics should I cover?