* added initial info about the monitoring requirements of TarDiBa

git-svn-id: svn+ssh://svn.opensde.net/home/users/karasz/x5/tardiba/trunk@22 471cc25b-571d-0410-8e29-1f4f9b506cef
18 years ago · 0b12dcebc5
1 changed files with 32 additions and 5 deletions
--- a/doc/tardiba.txt
+++ b/doc/tardiba.txt
@ -93,14 +93,14 @@ Tasks
  to kill iddle statements

 - problem is that pg does not recognize blocked processes like the
  ones which are iddle in transaction. an iddle in transaction proces
  ones which are idle in transaction. an iddle in transaction proces
  blocks the whole db (no selects, no updates, no nothing). the only way
  out of it is to kill -9 the process

 - killing postgresql processes
 	- postgresql has 2 kinds of processes (the parent postmaster
 	  process which is allways active and child postmaster processes
 	  which are born from the master when it receives a statement
 	  which are born from the master when it receives a statement)
 	- problem is that childs are forks so whenever i kill a child,
 	  all childs die that sucks somehow
 	- so instead of killing childs we usualy restart the postmaster but
@ -126,12 +126,39 @@ Backup System

 - backup is done by shell scripts and cron (change? not likely)
 - backups on other machines (over scp for other linux hosts, samba 
 for 'secure' Windows2000???, eventually a pro backup app although i 
 don't think it is necessary, maybe for whole system recovery)
 for `secure` Windows2000???, eventually a pro backup app although i 
 don`t think it is necessary, maybe for whole system recovery)

 System Monitoring and Notification
 ----------------------------------
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Abstract
 ^^^^^^^^

 in order to prevent catastrofic events we need propper informations
 about the state of our servers.
 our catastrophic events fall into three categories:

 - hardware failures (for which we need early warning), especially hard drive
  failures (RAM testing would not be a bad ideea, but i think we must pass on
  it due to the horrific time that a full RAM test takes).
 - OS related failures (which are not failures per se, but states of incapacity
  to fullfil the demand), in this category we need to monitor the hard drive 
  usage of our backups, postgres runaway processes, establish CPU and RAM usage
  patterns (aka CPU and RAM  usage peek points and their reasons).
 - postgres related failures, in which category we include unpredictable DB
  behaviours like shutdowns, incapacity of serving certain querries, and most 
  important queries (aka processes) in dubious state (aka IDLE IN TRANSACTION).

 our monitoring solution must gather information about all these issues, plus 
 as a bonus we like to gather some infos about DB patterns, like most used 
 queries, longest running sql statements, some info about connections IP, time
 and so on. Once this info is gathered it must be delivered in a form of a 
 report via mail or other means in a prioritized fashion (in order to easily) 
 recognize the most important ones. On the other hand we want that info stored
 localy on the server too in a DB for profiling reasons.

 
 - monitoring disk space, backup files dimensions, excerpts from 
 certain log files, system status and stats (cpu usage, ram etc), monitoring 
 pg sessions because the internal pg monitor sucks, we need a good policy on