|
|
@ -93,14 +93,14 @@ Tasks |
|
|
|
to kill iddle statements |
|
|
|
|
|
|
|
- problem is that pg does not recognize blocked processes like the |
|
|
|
ones which are iddle in transaction. an iddle in transaction proces |
|
|
|
ones which are idle in transaction. an iddle in transaction proces |
|
|
|
blocks the whole db (no selects, no updates, no nothing). the only way |
|
|
|
out of it is to kill -9 the process |
|
|
|
|
|
|
|
- killing postgresql processes |
|
|
|
- postgresql has 2 kinds of processes (the parent postmaster |
|
|
|
process which is allways active and child postmaster processes |
|
|
|
which are born from the master when it receives a statement |
|
|
|
which are born from the master when it receives a statement) |
|
|
|
- problem is that childs are forks so whenever i kill a child, |
|
|
|
all childs die that sucks somehow |
|
|
|
- so instead of killing childs we usualy restart the postmaster but |
|
|
@ -126,12 +126,39 @@ Backup System |
|
|
|
|
|
|
|
- backup is done by shell scripts and cron (change? not likely) |
|
|
|
- backups on other machines (over scp for other linux hosts, samba |
|
|
|
for 'secure' Windows2000???, eventually a pro backup app although i |
|
|
|
don't think it is necessary, maybe for whole system recovery) |
|
|
|
for `secure` Windows2000???, eventually a pro backup app although i |
|
|
|
don`t think it is necessary, maybe for whole system recovery) |
|
|
|
|
|
|
|
System Monitoring and Notification |
|
|
|
---------------------------------- |
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
|
|
|
|
|
Abstract |
|
|
|
^^^^^^^^ |
|
|
|
|
|
|
|
in order to prevent catastrofic events we need propper informations |
|
|
|
about the state of our servers. |
|
|
|
our catastrophic events fall into three categories: |
|
|
|
|
|
|
|
- hardware failures (for which we need early warning), especially hard drive |
|
|
|
failures (RAM testing would not be a bad ideea, but i think we must pass on |
|
|
|
it due to the horrific time that a full RAM test takes). |
|
|
|
- OS related failures (which are not failures per se, but states of incapacity |
|
|
|
to fullfil the demand), in this category we need to monitor the hard drive |
|
|
|
usage of our backups, postgres runaway processes, establish CPU and RAM usage |
|
|
|
patterns (aka CPU and RAM usage peek points and their reasons). |
|
|
|
- postgres related failures, in which category we include unpredictable DB |
|
|
|
behaviours like shutdowns, incapacity of serving certain querries, and most |
|
|
|
important queries (aka processes) in dubious state (aka IDLE IN TRANSACTION). |
|
|
|
|
|
|
|
our monitoring solution must gather information about all these issues, plus |
|
|
|
as a bonus we like to gather some infos about DB patterns, like most used |
|
|
|
queries, longest running sql statements, some info about connections IP, time |
|
|
|
and so on. Once this info is gathered it must be delivered in a form of a |
|
|
|
report via mail or other means in a prioritized fashion (in order to easily) |
|
|
|
recognize the most important ones. On the other hand we want that info stored |
|
|
|
localy on the server too in a DB for profiling reasons. |
|
|
|
|
|
|
|
|
|
|
|
- monitoring disk space, backup files dimensions, excerpts from |
|
|
|
certain log files, system status and stats (cpu usage, ram etc), monitoring |
|
|
|
pg sessions because the internal pg monitor sucks, we need a good policy on |
|
|
|