DevLog for 21-30 Dec 2015

Clearly I missed an entire week. I need to build a better system to make this easier…

Random notes.

Kicked out NFS from the Tool Labs proxies (with 1). Yay! This hopefully explains the lockup of tools-proxy-01 yesterday night, maybe? It’s been restarted since, and I hope to no longer have instances randomly locoking on me. Infrastructure standards of 2009, here we come! :D I’ve also removed NFS from tools-redis, and migrated them to Jessie as well.
Fixed up all the races in how kubernetes workers are setup with 2
Another instance is ‘stuck’ again. Sigh. AAAAAAAAAAAAAAAAAAAAAAAAA. Paravoid helped debug this, tracking it down to NFS client issues in the 4.2 kernel (See phab). I moved k8s nodes back to a working 3.19 kernel (after filing issue about the other 3.19 kernel package I tried that didn’t work).
Moved the tools proxies to 4.2 (lol?) after finding out huge ksoftirqd spikes in them. Let’s see if that improves things
I split up the individual components of PAWS, and have a working nbserve in there now! Exciting times. Need to fixup nbserve to use traitlets for config
Big Tool Labs outage (again!). Some tool accidentally sent about 12million job requests and crashed gridengine’s underlying backing store (BerkeleyDB). Reset to a clean slate after many hours (Thanks Coren!) and mostly things are back up now. I’m reading through the Berkeley DB reference manual now.
Persistance failed for ores’s redises again, mostly because vm_overcommit was turned off. Fixed in the core redis module so it does not happen again.

Contents