DX NetOps

Expand all | Collapse all

Lost data for period of DR backup

  • 1.  Lost data for period of DR backup

    Posted Jul 09, 2013 07:21 PM
    I was under the impression the Data Collector would continue to collect data and cache locally when the Data Repository or the Data Aggregator were down.

    I have found I have no data for any graphs for the period that vertica is backing up its database.

    My backup starts at 3:00am and finishes at 4:16am. That in itself I thought was slow. That time frame matches exactly the period missing from the graphs each day.

    version 2.2.2

    I was under the impression it worked as follows:

    - if the DR is down, the DA will shut itself down after a period of not being able to access the DR. I cron job on the DA tries to start the DA every x minutes if it is not already running
    - if the DA is down, the DC will continue to do its polling and cache it locally until the DA comes back online. When DA back online the DC will forward its backlog.

    So my questions are and this is asked of CA direct:

    - do you expect a loss of data during DR backup?
    - is an hour and 15 minute an excess amount of time for DR backup on a system with only 58000 polling items / 5 min polling for everything except about 5 interfaces?
    - should the DC cache the data and if so for how long? Disk free is not an issue.
    - ddoes CA consider loosing 1.25 hours of data per 24 hours acceptable?


  • 2.  RE: Lost data for period of DR backup

    Posted Jul 09, 2013 10:05 PM
    Looking at the Data Aggregator logs it seems to just keep slamming transactions across to vertica no matter how often they are failing. There is one VJDBC error after another during the time the vertica backup is taking place.

    Partial error message:
    org.springframework.transaction.TransactionSystemException: Could not commit JDBC transaction; nested exception is java.sql.SQLException: [Vertica][VJDBC](5157) ROLLBACK: Unavailable: [Txn 0xa000000160068d] X lock Global Catalog - timeout error Timed out X locking Global Catalog. S held by [user ****** (Object Snapshot)]. Your current transaction isolation level is READ COMMITTED
    So why doesn't the DA shutdown until vertica becomes available again? I am assuming the vertica backup brings it down to back it up. Note this is a single node DR.


  • 3.  RE: Lost data for period of DR backup

    Posted Jul 11, 2013 09:07 PM
    The DA should attempt to shut itself down if it can't communicate with Vertica. Check the DA's shutdown.log file to see status messages regarding its connection with Vertica.

    But why is the DR being shutdown? You shouldn't need to shut it down to backup the data. Or is the DR staying up but just getting incredibly slow during the backup period?


  • 4.  RE: Lost data for period of DR backup

    Posted Jul 11, 2013 09:44 PM
    Unfortunately its not shutting itself down. When I looked through the DA logs for the period of the Vertica backup I see failed transcations where the DA is trying to post to the database for the whole 1 hour 15 minutes that the backup is taking. What is the method that the DA uses to detect if Vertica is available. This is obviuosly where the problem is.

    When the DR does its backup does it halt the database? I am using the standard script that comes with the product following the instructions in the DR manual. Isn't this script really just doing an rsync? Iwould have thought that could be done with the DB staying live. If that is the case then something else is causing the DA not to be able to post during the backup.

    So in summary I gues my questions are:

    - when the backup script runs, does it stop the database?
    - if it does stop the database, why doesn't the DA recognise this and act appropriately?
    - if it doesn't stop the database, why does the DA have problems posting?

    The way I would have expected it to work (assuming there is a requirement to stop the db to do a backup):

    - backup starts
    - stops the database
    - DA detect this and goes to sleep
    - DC detects DA gone to sleep so caches locally it data
    - DR backup finishes and start DB as last step
    - DA wakes up (as per cronjob that continually tries to start dadaemon)
    - DC detects DA available and send of cached data

    I do have a support ticket in as this is a major issue for us. I could accept a five minute loss of data a day but an hour and a quarter across all devices every day is a bit much.


  • 5.  RE: Lost data for period of DR backup

    Posted Jul 11, 2013 09:58 PM
    Will get some details from developers tomorrow on how this should work, but I don't think what you're seeing is normal. The DB should stay live and there shouldn't be any problems continuing to log data while the backup occurs.


  • 6.  RE: Lost data for period of DR backup

    Posted Jul 11, 2013 10:02 PM
    If possible, please email me the support case #. kyle.pause@ca.com


  • 7.  RE: Lost data for period of DR backup

    Posted Jul 12, 2013 04:15 PM
    I confirmed that the DB should be live while the backup is running. So you are experiencing some sort of performance issue on the DR causing this problem. A workaround for the issue would be to have the DA shutdown before the backup begins, which will cause the DC(s) to cache their data, and then restart the DA once the backup resumes.


  • 8.  RE: Lost data for period of DR backup

    Posted Jul 14, 2013 06:30 PM
    Definetly the vertica backup causing the problems. I moved the time it does it and the data loss follows it.

    Our backup is done on a scheduled basis and in the middle of the night. How is the DA going to know when to shutdown or startup? We certainly aren't going to do a remote login as root from the DR to tell it to do that? There are security implication.

    The performance issues as you believe they are have only come about since CA decided to move to Vertica 6. We are not monitoring a large environment compared to what a lot of organisation are running. I don't see why something like a backup process (which is really just a glorified rsync) should have performance problem. Note there are no database posting errors at any other time than when the vertica backup occurs.


  • 9.  RE: Lost data for period of DR backup

    Posted Jul 15, 2013 10:07 PM
    I wonder if your backup config needs tweaking. Not sure if the config needs to be regenerated after upgrading to Vertica 6. Will check with the team tomorrow.


  • 10.  RE: Lost data for period of DR backup

    Posted Aug 01, 2013 11:47 PM
    Hi Andrew,

    Now that my team has been working with you on this issue I wanted to check in to se how its going. Are you still experiencing the gaps in data when the DR backup runs? I believe that we identified a problem in the docs regarding the backup config, and that once we rectified the issue the backups are running faster. What else was learned about the product or your system/SAN configuration?

    Regards,
    Kyle


  • 11.  RE: Lost data for period of DR backup

    Posted Aug 02, 2013 01:27 AM
    One thing we also identified is there is a gap the documentation in relation to the installation of the database. eg. the catalog and database should be on separate file systems.

    What we have done so far to fix the problem can be summarised as follows:

    - removed the 'Objects=***' line from the backup config. That reduced the backup time from 1.5 hours to 15 minutes.
    - ran the post_install scripts (will be made generally available in next release - was given an advanced copy). This help the set the disk scheduler and some other setting. running the vioperf script before and after show increases in speed with the new settings.
    - backup location is in the same file system as database. I am in the process of moving that however it doesn't seem to have any impact right now.

    Was advised at this stage to not worry about the catalog and data being on the same file system. Not sure of what the impact of that is but doesn't seem to be having impact right now. That will be a killer for customers as only solution is reinstall. I am still waiting on an answer to some question incase we do need to go to the extreme of a reinstall This is from the email I didn't get answers to:

    Questions:
    - can vertica be installed to /opt/CA/vertica or does it have to be under /opt/vertica?
    The /opt/vertica file system is my large filesystem that was setup for the vertica database.
    My proposed file system layout would be:
    /opt/CA - this would have 2 subfolders: IMDataAggregator and vertica. The IMDataAggregator contains just the installs and half a dozen utils like now. The vertica directory would be the vertica home containing all the normal vertica folders except database and the backup. I would like to locate the catalog in the /opt/CA/vertica directory as well
    /opt/vertica - this where the database is store. just the data folder not the catalog
    /opt/vertica_backup - this is where the backup will be stored.

    - does this require a reinstall of the DA?
    - if it require a reinstall of the DA I assume we would also have to reinstall the DC
    - if it requires a reinstall of the DA, can we detach the DA from CAPC and simply add the new DA as a data source.
    - if the old DA was removed from CAPC and the new DA add, what is the impact

    I ask the questions about CAPC because a ton of work on dashboards and group creation and design collections has gone in. This would result weeks of work reproducing this.

    - Most important how do I specify the place where the catalog and data should be located when the install vertica script only asks for a data location???????????
    I even looked at the DA install thinking maybe its specified at database creation time however in the DA install you just specify a database name. The installer seems to just create a folder of the db name under the folder you specified as the data folder in the vertica install. under the db name folder it creates the v_{dbname}_node001_data and v_{dbname}_node001_catalog directories.
    So how can I put them on separate file system?