Back from an awesome vacation. Too awesome to write about, even :) Suffice it to say, England has some really pretty places.

Some Android app work, and lots of monitoring work

  • Fixed bugs causing the Wikipedia Android Alpha from building properly. Now it builds properly whenever there is a new commit. Hooray! This was primarily caused by me forgetting to give it lots of RAM (8G VMEM) to execute the mvn build commands (https://gerrit.wikimedia.org/r/#/c/159482/) and also not cleaning up previous .alpha subfolders (https://gerrit.wikimedia.org/r/#/c/159481/) – this causes a chain of .alpha.alpha.alpha.* subfolders, breaking the build.
  • Added a patch to the Android alpha app itself that checks for updates every day or so and notifies you if there’s a new one. Was fairly trivial to write, although I was hoping to make it more seamless (i.e. download the apk myself and just pop it up for people to tap). It now requires 4 clicks to install it, should be able to bring it down to 2 at some point in the future if people care enough.
  • Added a method to our check_graphite code that lets you individually check a bunch of metrics for thresholds (https://gerrit.wikimedia.org/r/#/c/159473/). This makes it much simpler to do icinga checks on a bunch of metrics that are all measuring the same thing but from different machines. BetaLabs and ToolLabs checks use this.
  • Cleaned up a bunch of minor things with our check_graphite script. Also fucked up trying to replace all double quotes in it with single quotes for consistency – it replaced the double quotes being used inside single quotes, and caused all checks to fail. Fixed shortly by https://gerrit.wikimedia.org/r/#/c/159711/
  • Added more monitoring for betalabs! Now checks for stale puppet runs (https://gerrit.wikimedia.org/r/#/c/159701/) and low space on the root partition (https://gerrit.wikimedia.org/r/#/c/159694/). All are green now, thanks to some work from bd808.
  • Added monitoring for ToolLabs! Now checks for stale puppet runs, low space on root and /var, and puppet failure events (https://gerrit.wikimedia.org/r/#/c/159709/). Also checks for high sustained CPU usage (https://gerrit.wikimedia.org/r/#/c/159751/). Then spent some time (with help from scfc_de (whose nick I kept spelling as scfe_de until today)) cleaning up the puppet failures. They are all green now as well.
  • Did a bunch of cleaning up work around the graphite role, removing the realm branching (https://gerrit.wikimedia.org/r/#/c/159759/). Ori says everytime realm branching code is removed, an angel gets its wings, so well done there.

Not a bad day, eh? I’ve been trying to wake up early, perhaps that is helping.