DevLog for 28 Aug 2014

Let’s see. I’m also going to attempt to include patch links wherever possible.

Might spend some more time with Hive over the next few days – figured out an approach for using it from Python, and it should be fun to do so!

DevLog for 25 and 26 Aug 2014

Not very code heavy

  • Couple of pull requests(#1 and #2) for the Atom Autosave plugin. One adds a preference to not autosave by when you are explicitly closing a window / pane, and the other just sets the ‘enabled’ preference to default. CoffeeScript isn’t too bad either! I should consider writing more plugins (I currently use Atom for CSS/JS/Puppet, should try other languages)
  • Added CSV, TSV and JSON download options to Quarry. There’s another Webinar tomorrow by the Grantmaking team, and J-Mo asked for it. Streaming TSV and CSV implemented in a neat way, will write a blog post tomorrow about it.
  • Started work on a ‘number of editors’ per country metric for WPDMZ, needs to be finished up.

I feel a bit exhausted (physically and mentally) from the intense coding over the last few weeks, might have a few chill days to recharge myself. I’m growing old! :(

DevLog for Aug 24, 2014

Wooooo!

  • Moved labsbooks (described in yesterday’s devlog) to use a shared readonly IPython virtualenv maintained by me. Also installed a bunch of modules people might want to use (SciPy, NumPy, Pandas, PyTables, matplotlib). Am considering just installing IPython notebook globally via puppet and using that, since that’ll enable users to just use the system packages. However, the version of IPython notebook from Ubuntu is ancient, so that’s probably a non-starter.
  • Have a basic version of the IPython publishing process working! Any toollabs user can create a notebook by:
    • Creating a ~/notebooks folder
    • Doing a chmod +x ~/notebooks
    • Doing a chmod +x ~
    • Putting IPython notebooks into ~/notebooks (as .ipynb) files
    • Going to https://tools.wmflabs.org/notebooks/<user-shell-name>/<path-to-ipynb-file> Will have to do a bit more work before it can be considered ‘production grade’ (such as user pages, a nicer theme, etc, etc) BUT YAY GOOD START. It already caches the html output in Redis and invalidates with the mtime of the file, so should be pretty quick.
  • Made the ssh tunneling process for labsbooks purely python, without requiring the ProxyCommand. This makes things simpler (and more portable!). I’ll need to work on securing this properly before I can publish it for broader use.
  • Wrote an email to the analytics mailing list about making public the ‘edits per country’ data. I hope to make this publicly available with enough granularity that not just me but others can use this for fun research as well.

I’ve been using Atom for puppet stuff, PyCharm for Python and IntelliJ for Java stuff, and that seems to be doing ok. They all have decent Vim keybindings as well, and good replacements for other functionality – and I might stick with Atom for a while to see how it goes :)

DevLog for 23 Aug 2014

Back in Glasgow! It was actually not very cold today, only cold! Progress.

Started working on my long-abandoned labsbook project, which aims to make Tool Labs a first class environment for people who want to run (and publish) iPython Notebooks while also being able to access the replica databases and the dumps. Doing this in a secure manner is kinda hard, but I think I’ve a neat solution that lets everyone run a personal iPython kernel on the Grid, access it from their local machine, and also publish it to the web from a standard location. So far, I’ve gotten my script to a point where it’ll setup an iPython environment for you if it doesn’t already exist, start the kernel if it isn’t already running, and tunnel the editing interface back to you to use! Things left to do include:

  1. Open up the browser when tunnel is open
  2. Find a sane way to kill a kernel that hasn’t been doing anything since forever
  3. Setup a shared iPython environment (just code, readonly) so people don’t have to setup their own environments everytime (this is primarily a performance enhancement)
  4. Find a nice and simple way for iPython notebooks to be published. I’m currently thinking of an URL such as tools.wmflabs.org/notebooks/<username>/<notebookname>' to display them, and an index attools.wmflabs.org/notebooks/`. This shouldn’t be too hard with appropriate permission munging.

I’m also using paramiko for this, which makes writing SSH related code with Python a breeze. It even supports proxycommand! Blunders I’ve done while getting up to this point include:

  1. Running jsub run.bash -mem 4G instead of jsub -mem 4G run.bash and wondering why my script kept getting killed with OOM.
  2. Trying to do a pip install on the Grid nodes (which don’t have build tools) instead of on tools-dev and wondering why running it from the commandline works but from jsub does not.
  3. Wondering why my SSH Tunnel kept dying and trying to debug that without realizing that it was dying because the iPython process was dying because it was OOMing because of my earlier jsub error
  4. Thinking that user accounts (rather than tool accounts) can not submit jobs to the grid, while the problem was that I had not set the execute bit on my script.

Once this is done (I suspect tomorrow), I’ll work on getting the data from my work with WPDMZ into a form good enough for publicizing (removing ways of de-anonymization), and then use iPython notebooks to make graphs! This should be fun :)

Source so far available on Github. Needs more work / documentation / cleaning up.

Devlog for Aug 22 2014

Was in Edinburgh again, missed writing it as I went along. Oh well.

  • Ported the code for the Wikipedia Android app automated builds to Python, and you can see it in action at http://tools.wmflabs.org/wikipedia-android-builds/ now. It lets you download the latest build, and notes the last successful build time. Good enough :) It was originally in bash, and porting it to python allowed me to create a ‘fake’ API (just JSON blobs written to known locations in the file system). Next step is to write a helper app
  • The Atom experiment is coming along well. Am using it for most of my Puppet work these days. Should give LightTable another go as well.

That’s it!

Devlog for 21 Aug 2014

In Edinburgh! I’ve finally stopped spelling it as Edinborough!

  • Added ‘user group’ functionality to Quarry, and added a sudo user group that does what you would think it does. Will be assigned super, super sparingly.
  • Found out that I’d have to explicitly specify charset of the database when I’m creating it or MySQL will default to a stupid charset. Forced all tables and columns to utf8 and that seems to have fixed a bunch of unicode issues. Yay?
  • Still facing occasional MySQL server has gone away errors with SQLAlchemy for local MySQL instance, despite asking SQLAlchemy to recycle connections every hour or so. Reduced the recycle time to 10m, hopefully that helps.
  • Read Tony Hoare’s Turing Award speech from 1980, titled “The Emperor’s Old Clothes”. I think I should read more of these papers / speeches, helps keep perspective and ‘learn from history’. Lots of warnings against complexity seem to be a very common theme, and one I’ve also personally encountered many times. Recommend reading :)
  • More DMZ work! Now running edits per country stats separated by mobile vs desktop for all countries for all wikipedias! EXCITING!

Devlog for 20 Aug 2014

Woke up at 11AM again! w000t!

  • Fixed a stupid bug in Quarry that made it fail if MySQL decided that a column you were selecting is a Decimal. Fixed this in time for…
  • Helped out with a webinar from J-Mo for people to learn SQL and do research against Wikimedia stuff with Quarry! Went without a glitch mostly, so yay :) Oren asked for an API for Quarry, I’ll investigate ways to get that done
  • Found that some queries against enwiki succeed on s1.labsdb but not on s4.labsdb. I attribute this to cache characteristics, and have switched Quarry to use s1 for now.
  • Did more work on DMZ, getting in place a runner / processor infrastructure and sqlite backed intermediary storage so I can write simpler code to get the data I want. Super excited to see how this goes. It’ll be much easier for me to add new queries now.
  • More work on DMZ! I had a late night sprint and split edits by mobile and desktop (and was surprised by seeing how much lesser edits came from mobile). Next would be to gather up population and internet penetration data and then GO MAKE SOME GRAPHS, probably with iPython. I’m returning to making graphs for real after, what, 6 years or so? :) I was using C# and Excel then :) /nick ExcitedPanda
  • Also made some progress on our makeshift Android CI-ish build system for the Wikipedia app. Should be done soon.

Devlog for 19 Aug 2014

Here we go again.

  • Wrote a small blog post talking about how to package python packages as debian packages.
  • More work on getting Graphite setup for Wikimedia Labs. Everything is up and running now (after a number of embarrassing typos). Need to figure out NAT rules to let labs machines talk to labmon1001, and then it can start receiving data!!!1
  • Got access to Wikimedia’s Hadoop cluster, and ran my first HIVE query ever! This is exciting :)
  • Made packages for pygeoip and ua-parser. Having to make both trusty and precise packages was fun. I ended up building them on separate machines, but am sure there’s a simpler way.
  • Started doing some SQL about editors from different countries. Did you know that there are 2x as many edits to english wikipedia from India as from Germany? And that Wikidata has had 10x as many edits from Toollabs as from Germany (the largest contributing country)? I didn’t either! This is fun.

Simple python packaging for Debian / Ubuntu

(As requested by Jean-Fred)

One of the ‘pain points’ with working on deploying python stuff at Wikimedia is that pip and virtualenvs are banned on production, for some (what I now understand as) good reasons (the solid Signing / Security issues with PYPI, and the slightly less solid but nonetheless valid ‘If we use pip for python and gem for ruby and npm for node, EXPLOSION OF PACKAGE MANAGERS and makes things harder to manage’). I was whining about how hard debian packaging was for quite a while without checking how easy/hard it was to package python specifically, and when I finally did, it turned out to be quite not that hard.

Use python-stdeb.

Really, that is it. Ignore most other things (until you run into issues that require them :P). It can translate most python packages that are packaged for PyPI into .debs that mostly pass lintian checks. Simplest way to package, that I’ve been following for a while is:

  1. Install python-stdeb (from pip or apt). Usually requires the packages python-all, fakeroot and build-essential, although for some reason these aren’t required by the debian package for stdeb. Make sure you’re on the same distro you are building the package for.
  2. git clone the package from its source
  3. Run python setup.py --command-packages=stdeb.command bdist_deb (or python3 if you want to make Python 3 package)
  4. Run lintian on it. If it spots errors, go back and fix them, usually by editing the setup.py file (or sometimes a stdeb.cfg file). This is usually rather obvious and easy enough to fix.
  5. Run dpkg -i <package> to try to install the package. This will error out if it can’t find the packages that your package depends on. This means that they haven’t been packaged for debian yet. You can mostly fix this by finding that package, and making a deb for it, and installing it as well (recursively making debs for packages as you need them). While this sounds onerous, the fact is that most python packages already exist as deb packages and you stdeb will just work for them. You might have to do this more if you’re packaging for an older distro (cough cough precise cough cough), but is much easier on newer distros.
  6. Put your package in a repository! If you want to use this on Wikimedia Labs, you should use Labsdebrepo. Other environments will have similar ways to make the package available via apt-get. Avoid the temptation to just dpkg -i it on machines manually :)

That’s pretty much it! Much simpler than I originally expected, and not much confusing / conflicting docs. The docs for stdeb are pretty nice and complete, so do read these!

Will update the post as I learn more.

Paying for IRCCloud

I’ve started paying for IRCCloud.

It is the first ‘service’ I am paying for as a subscriber, I think. I’ve considered doing that for a long time, but ‘paying for IRC’ just felt… odd. I’ve been using ZNC + LimeChat. It’s decent, but sucks on Mobile. Keeping a socket open all the time on a phone just kills the battery, plus the UX on most Android clients sucks.

So after seeing Sam Smith use IRCCloud during Wikimania, I made the plunge and paid for IRCCloud. It is still connecting to my bouncer, so I have logs under my control. It also has a very usable Android client, and syncs ‘read’ status across devices, and is quite fast.

Convenience and UX won over ‘Free Software’ this time.