VMware vSphere

 View Only
Expand all | Collapse all

linux postmaster high cpu

  • 1.  linux postmaster high cpu

    Posted Jun 05, 2007 10:04 PM
    I have a postmaster proc on a linux server that starts using 100% of the cpu within a few hours after a reboot.

    Is this normal?

    I am running 2gb memory with 2HT intel CPU's, but the charts seem to take 1 - 2 minutes to generate.

    --Thanks Garry


  • 2.  RE: linux postmaster high cpu

    Posted Jun 05, 2007 10:17 PM
    I am running the internal db and HQ v3.0.4 (build #389 - Apr 27, 2007 - Relase Build)
    --Thanks


  • 3.  RE: linux postmaster high cpu

    Posted Jun 06, 2007 05:28 PM
    Hey Garry,

    Does the high load ever subside? It's possible that HQ is performing
    a vaccuming operation which usually porks the DB pretty hard.
    Eventually, though it should go back down.

    -- Jon




    On Jun 5, 2007, at 3:04 PM, Garry Anderson wrote:

    > I have a postmaster proc on a linux server that starts using 100%
    > of the cpu within a few hours after a reboot.
    >
    > Is this normal?
    >
    > I am running 2gb memory with 2HT intel CPU's, but the charts seem
    > to take 1 - 2 minutes to generate.
    >
    > --Thanks Garry
    >





  • 4.  RE: linux postmaster high cpu

    Posted Jun 06, 2007 05:59 PM
    Jon,
    No, it has been running hard for days. Load avg never below 1.25.

    I have, the offending, postmaster process that has used 32668:25 (22 days) of cpu in 22 days of uptime for the system.

    --thanks Garry


  • 5.  RE: linux postmaster high cpu

    Posted Jun 07, 2007 12:07 AM
    Weird. Normal operation shouldn't give you a heavy load unless
    you're monitoring a huge amount of things. How many platforms are
    you monitoring? I assume that you're running the DB and HQ on the
    same box?

    -- Jon


    On Jun 6, 2007, at 10:58 AM, Garry Anderson wrote:

    > Jon,
    > No, it has been running hard for days. Load avg never below 1.25.
    >
    > I have, the offending, postmaster process that has used 32668:25
    > (22 days) of cpu in 22 days of uptime for the system.
    >
    > --thanks Garry
    >





  • 6.  RE: linux postmaster high cpu

    Posted Jun 11, 2007 04:06 PM
    Jon,
    Thanks. Yes, HQ and DB on same box, I have about 21 servers with 803 services
    --Thanks Garry


  • 7.  RE: linux postmaster high cpu

    Posted Jun 11, 2007 08:26 PM
    Maybe there is a query that is spinning? You can check the current running queries by running:

    bash> server-3.0.x/bin/db-psql.sh
    hqdb> select * from pg_stat_activity where current_query not like '%IDLE%';

    Attach the output from that command here. You can also check the hqdb.log file for anything suspicious. What effect does this have on your monitored systems in HQ? Does everything appear normal from the UI?

    -Ryan


  • 8.  RE: linux postmaster high cpu

    Posted Jun 11, 2007 09:31 PM
    Ryan,
    Thanks, I have attached the output from the sql.

    The only thing in the logs is:


    [2007-06-11 15:20:52.410 MDT] ERROR: duplicate key violates unique constraint "eam_measurement_data_pkey"
    [2007-06-11 15:20:52.419 MDT] ERROR: duplicate key violates unique constraint "eam_measurement_data_pkey"


    but I think this is normal, seems to me I came across this in another forum

    Thanks!
    --Garry


  • 9.  RE: linux postmaster high cpu

    Posted Jun 11, 2007 09:33 PM
    Sorry forgot to attach the file to the last reply


  • 10.  RE: linux postmaster high cpu

    Posted Jun 11, 2007 10:16 PM
    Also, The UI is very slow, takes 2-4 minutes to bring up basic graphs for the 24 hours. The monitored systems don't seem to be affected, seem to be getting all data.
    --Thanks Garry


  • 11.  RE: linux postmaster high cpu

    Posted Jun 11, 2007 10:24 PM
    Thanks for the additional info. I need you to do one more thing though to make it useful. Can you edit: hqdb/data/postgresql.conf

    And uncomment the line that says:

    # stats_command_string = on

    Then restart the HQ server? This will enable the statement stats so we can see them from the above command. Once the server starts spinning again, rerun the SQL above and attach it to this thread.

    -Ryan


  • 12.  RE: linux postmaster high cpu

    Posted Jun 12, 2007 02:35 PM
    Thanks here is the query... this has been running since a few minutes after I restarted the HQ-Server, and has used 14 hours of cpu since then... about how long the server has been running.

    --Thanks Garry


    hqdb=# select * from pg_stat_activity where current_query not like '%IDLE%';
    datid | datname | procpid | usesysid | usename | current_query | query_start | backend_start | client_addr | client_port
    -------+---------+---------+----------+---------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+------------------+-------------
    16384 | hqdb | 4071 | 16385 | hqadmin | BEGIN;DELETE FROM EAM_MEASUREMENT_DATA WHERE timestamp BETWEEN 1178496000000 AND 1178499600000 | 2007-06-11 17:36:44.903943-06 | 2007-06-11 16:32:11.340559-06 | ::ffff:127.0.0.1 | 50046
    (1 row)


  • 13.  RE: linux postmaster high cpu

    Posted Jun 12, 2007 03:37 PM
    That query is what is run when the metric compression routine runs each hour. It's trying to delete 1 hour's worth of data from your detailed measurement table. Has your HQ installation ever run without the high CPU?

    I'm wondering of somehow that table has become corrupted. The delete should not be taking so long. What type of storage is backing the database?

    If you just want to get things up and functional again, it's pretty easy to just truncate that table:

    hqdb=# truncate table eam_measurement_data;
    hqdb=# vacuum analyze eam_measurement_data;
    hqdb=# reindex table eam_measurement_data;

    -Ryan


  • 14.  RE: linux postmaster high cpu

    Posted Jun 12, 2007 06:26 PM
    hmmm havn't had any issues, but tried the following:

    hqdb=# select count(*) from eam_measurement_data;
    ERROR: could not access status of transaction 576717946
    DETAIL: could not open file "pg_clog/0226": No such file or directory

    sure enought that file is not to be found.

    The disk is a raid5 dell perc 5i controller. I havn't seen any errors in the logs.

    Should I still run the cmds to truncate that table?

    I wonder if their are any other tables that are corrupt, assuming that the above error is "corruption"

    I really havn't done anything with this server, it has not crashed and the raid is nominal. I have upgraded the HQ-Server 4 times and that has all gone ok.

    Their is plenty of disk space.

    --Thanks Garry


  • 15.  RE: linux postmaster high cpu
    Best Answer

    Posted Jun 12, 2007 06:52 PM
    > hqdb=# select count(*) from eam_measurement_data;
    > ERROR: could not access status of transaction
    > 576717946
    > DETAIL: could not open file "pg_clog/0226": No such
    > file or directory
    >

    That looks like the problem. I wonder if maybe it's just the index that's out of sync? Before we go down the path of truncating lets try to reindex that table to see if that fixes the problem:

    hqdb=# reindex table eam_measurement_data;
    hqdb=# select count(*) from eam_measurement_data;

    If you continue to have problems, we'll have to do some surgery on your DB. (Assuming you don't have a recent backup).

    I just ran across this thread on the PG user forums with a similar issue:

    http://archives.postgresql.org/pgsql-general/2006-07/msg01061.php

    I'll do some more digging.

    Thanks Gary,
    -Ryan


  • 16.  RE: linux postmaster high cpu

    Posted Jun 12, 2007 07:23 PM
    Thanks.

    Funny thing is that the files in pg_clog look sequential with the highest number being: 0050 (hex sequence) and this file is 0226. Why such a big gap?

    I started the reindex.... it has been running over an hour... wonder if it will finish!

    --Thanks Garry


  • 17.  RE: linux postmaster high cpu

    Posted Jun 12, 2007 10:45 PM
    Ryan,
    Thank you for all your help. I followed the link that you gave and it said to create empty files for the missing pg_clog\* files.

    I had 4 missing. Seems to be running a lot better, but still wonder if I should just rebuild the DB.

    The system seems much better, the vaccuming seems to be much better.
    Deleted some 25 million rows, still have 8 million, GUI is faster.

    I have one row with the timestamp<0 ... should I delete it?

    Am missing data for all the servers, but don't really care.


    I was talking to a co-worker and he remined me that the prior owner of this project "accidently" deleted some data... that must have been what happened.

    --Thanks everyone for all your help

    --Garry


  • 18.  RE: linux postmaster high cpu

    Posted Jun 12, 2007 11:35 PM
    Great! Glad you were able to get to the bottom of the problem.

    I would continue to monitor it to make sure everything is working properly. Rebuilding the DB can take some time, so I would only resort to that if it's really necessary.

    Curious how that row with a negative timestamp got in there, but it should be safe to remove it. :)

    -Ryan