Clearly I missed an entire week. I need to build a better system to make this easier…

Random notes.

  • Kicked out NFS from the Tool Labs proxies (with 1). Yay! This hopefully explains the lockup of tools-proxy-01 yesterday night, maybe? It’s been restarted since, and I hope to no longer have instances randomly locoking on me. Infrastructure standards of 2009, here we come! :D I’ve also removed NFS from tools-redis, and migrated them to Jessie as well.
  • Fixed up all the races in how kubernetes workers are setup with 2
  • Another instance is ‘stuck’ again. Sigh. AAAAAAAAAAAAAAAAAAAAAAAAA. Paravoid helped debug this, tracking it down to NFS client issues in the 4.2 kernel (See phab). I moved k8s nodes back to a working 3.19 kernel (after filing issue about the other 3.19 kernel package I tried that didn’t work).
  • Moved the tools proxies to 4.2 (lol?) after finding out huge ksoftirqd spikes in them. Let’s see if that improves things
  • I split up the individual components of PAWS, and have a working nbserve in there now! Exciting times. Need to fixup nbserve to use traitlets for config
  • Big Tool Labs outage (again!). Some tool accidentally sent about 12million job requests and crashed gridengine’s underlying backing store (BerkeleyDB). Reset to a clean slate after many hours (Thanks Coren!) and mostly things are back up now. I’m reading through the Berkeley DB reference manual now.
  • Persistance failed for ores’s redises again, mostly because vm_overcommit was turned off. Fixed in the core redis module so it does not happen again.