Tardiba - The Tasmanian Devil Database Server
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

179 lines
5.6 KiB

  1. TarDiBa - Database Server
  2. =========================
  3. Operating System
  4. ----------------
  5. Kernel
  6. ~~~~~~
  7. Patches
  8. ^^^^^^^
  9. - linux-vserver (included in vserver package and applied automatically)
  10. - grsecurity (we have to use stable Linux-VServer + grsecurity patch)
  11. Filesystem
  12. ~~~~~~~~~~
  13. Filesystem Layout
  14. ^^^^^^^^^^^^^^^^^
  15. mountpoint fstype size
  16. /
  17. /boot
  18. /usr
  19. /home
  20. /data
  21. - filesystem, probably reiserfs (fast)
  22. - tweaked boot for high performance
  23. - tweaked kernel, highmem, disk access and stuff
  24. Database System
  25. ---------------
  26. - various maintanace scripts (for now using them for backups
  27. and vacuuming)
  28. - jump to pg8
  29. Autovacuuming
  30. ~~~~~~~~~~~~~
  31. - autovacuuming has to be disabled with the postgresql.conf
  32. Logging
  33. ~~~~~~~
  34. Abstract
  35. ^^^^^^^^
  36. postgresql has a logging facility that drops every statement to log
  37. before it get's executed.
  38. in certain conditions those logs are of very big interest.
  39. legal issues and profiling issues.
  40. Problem
  41. ^^^^^^^
  42. problem is that if i enable that systems goes 10 times slower because
  43. of the huge statements that has to be written to disc.
  44. On a bussy server they can easily reach 10th of MB per day.
  45. 30-40 but that is quite rare
  46. usualy under 10MB
  47. Solution
  48. ^^^^^^^^
  49. - implementing a logging facility to RAM, flushing logs in
  50. background to disk
  51. - usually i am only interested in those statements
  52. we have to consider to flush the logs from the ramdisk if we restart
  53. pg perhaps
  54. Tasks
  55. ^^^^^
  56. - we need to test a postgresql.conf option that only logs statements
  57. that take more time than a specified period (i.e. 5min.)
  58. - problem is that that option doesn't work too well like the option
  59. to kill iddle statements
  60. - problem is that pg does not recognize blocked processes like the
  61. ones which are idle in transaction. an iddle in transaction proces
  62. blocks the whole db (no selects, no updates, no nothing). the only way
  63. out of it is to kill -9 the process
  64. - killing postgresql processes
  65. - postgresql has 2 kinds of processes (the parent postmaster
  66. process which is allways active and child postmaster processes
  67. which are born from the master when it receives a statement)
  68. - problem is that childs are forks so whenever i kill a child,
  69. all childs die that sucks somehow
  70. - so instead of killing childs we usualy restart the postmaster but
  71. that has to have another solution too, just that i was unable to
  72. find it
  73. - is there any possib to kill a forked child withoput killing all
  74. childs? (maybe there are some kill statement options for that)
  75. - normally each child should have an own PID
  76. - i use kill -9 $PID
  77. - most probably the parent "thinks" there are problems if one
  78. child dies and kills them all
  79. - next thing is that only the master postmaster is alive. i
  80. couldn't get a concludent answer from #postgresql. all that i
  81. got was 'Don't kill -9 postmaster.
  82. - maybe instead of trying to kill processes we should add pgpool
  83. capabilities to TarDiBa, harvesting in this case other benefits
  84. too like fixed number of connections and the posibility to play
  85. with replpication thingy and loadbalancing.
  86. - we have to implement somehow in the sv interface the ability to
  87. send SIGINT signal to the backend. SIGINT signal forces a fast
  88. shutdown, even if clients are conected.
  89. Backup System
  90. -------------
  91. - backup is done by shell scripts and cron (change? not likely)
  92. - backups on other machines (over scp for other linux hosts, samba
  93. for `secure` Windows2000???, eventually a pro backup app although i
  94. don`t think it is necessary, maybe for whole system recovery)
  95. - we should investigate synbak which seems nice and even if it
  96. doesn't have PostgreSQL capability that doesn't look too hard
  97. to implement.
  98. System Monitoring and Notification
  99. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  100. Abstract
  101. ^^^^^^^^
  102. in order to prevent catastrophic events we need propper informations
  103. about the state of our servers.
  104. our catastrophic events fall into three categories:
  105. - hardware failures (for which we need early warning), especially hard drive
  106. failures (RAM testing would not be a bad ideea, but i think we must pass on
  107. it due to the horrific time that a full RAM test takes).
  108. - OS related failures (which are not failures per se, but states of incapacity
  109. to fullfil the demand), in this category we need to monitor the hard drive
  110. usage of our backups, postgres runaway processes, establish CPU and RAM usage
  111. patterns (aka CPU and RAM usage peek points and their reasons).
  112. - postgres related failures, in which category we include unpredictable DB
  113. behaviours like shutdowns, incapacity of serving certain querries, and most
  114. important queries (aka processes) in dubious state (aka IDLE IN TRANSACTION).
  115. our monitoring solution must gather information about all these issues, plus
  116. as a bonus we like to gather some infos about DB patterns, like most used
  117. queries, longest running sql statements, some info about connections IP, time
  118. and so on. Once this info is gathered it must be delivered in a form of a
  119. report via mail or other means in a prioritized fashion (in order to easily)
  120. recognize the most important ones. On the other hand we want that info stored
  121. localy on the server too in a DB for profiling reasons.
  122. - monitoring disk space, backup files dimensions, excerpts from
  123. certain log files, system status and stats (cpu usage, ram etc), monitoring
  124. pg sessions because the internal pg monitor sucks, we need a good policy on
  125. killing iddle sessions and runaway processes.
  126. - monitoring network traffic is a good meassurement as well. From one side it
  127. helps to find possible network related bottlenecks on the other it also
  128. provides valuable information for intrusion detection.
  129. Possible tools:
  130. - sancp (prelude aware!)