Yuvi Panda

JupyterHub | MyBinder | Kubernetes | Open Culture

Freedoms for Open Source Software users in the Cloud

This post is from conversations with Matt Rocklin and others at the PANGEO developer meeting at NCAR

Today, almost all of ‘the cloud’ is run by ruthlessly competitive hypercapitalist large scale organizations. This is great & terrible.

When writing open source applications that primarily run on the cloud, I try to make sure my users (primarily people deploying my software for their users) have the following freedoms:

  1. They can run the software on any cloud provider they choose to
  2. They can run the software on a bunch of computers they physically own, with the help of other open source software only

Ensuring these freedoms for my users requires the following restrictions on me:

  1. Depend on Open Source Software with hosted cloud versions, not proprietary cloud-vendor-only software.

I’ll use PostgreSQL over Google Cloud Datastore. Kubernetes with autoscaling over talking to the EC2 api directly.

  1. Use abstractions that allow swappable implementations anytime you have to talk to a cloud provider API directly.

Don’t talk to the S3 API directly, but have an abstract interface that defines exactly what your application needs, and then write an S3 implementation for it. Ideally, also write a minio / ceph / file-system implementation for it, to make sure your abstraction actually works.

These are easy to follow once you are aware of them, and provide good design tradeoffs for Open Source projects. Remember these are necessary but not sufficient to ensure some of your users’ fundamental freedoms.

Aug 2018 Work Plan

I’m writing up monthly ‘work plans’ to plan what work I’m trying to do every month, and do a retrospective after to see how much I got done. I work across a variety of open source projects with ambiguous responsibilities, so work planning isn’t very set. This has proven to be somewhat quite stressful for everyone involved. Let’s see if this helps!

JupyterCon

JupyterCon is in NYC towards the end of August, and it is going to set the pace for a bunch of stuff. I have 2.5-ish talks to give. Need to prepare for those and do a good job.

Matomo (formerly Piwiki) on mybinder.org

mybinder.org currently uses Google Analytics. I am not a big fan. It has troubling privacy implications, and we don’t get data as granularly as we want to. I am going to try deploying Matomo (formerly Piwiki) and using that instead. Run both together for a while and see how we like it! Matomo requires a MySQL database & is written in PHP - let’s see how this goes ;)

The Littlest JupyterHub 0.1 release

The Littlest JupyterHub is doing great! I’ve done a lot of user tests, and the distribution has changed drastically over time. It’s also the first time I’m putting my newly found strong convictions around testing, CI & documentation to practice. You can already check it out on GitHub. I want to make sure we (the JupyterHub team) gets out a 0.1 release early August.

Pangeo Workshop

I despair about climate change and how little agency I seem to have around it quite a bit. I’m excited to go to the PANGEO workshop in Colorado. I’m mostly hoping to listen & understand their world some more.

Berkeley DataHub deployment

Aug 20-ish is when next semester starts at UC Berkeley. I need to have a pretty solid JupyterHub running there by then. I’d like it to have good CI/CD set up in a generic way, rather than something super specific to Berkeley. However, I’m happy to shortcut this if needed, since there’s already so many things on my plate haha.

UC Davis

I’m trying to spend a day or two a month at UC Davis. Partially because I like being on Amtrak! I also think there’s a lot of cool work happening there, and I’d like to hang out with all the cool people doing all the cool work.

Personal

On top of this, there’s ongoing medical conditions to be managed. I’m getting Carpel Tunnel Release surgery sometime in October, so need to make sure I do not super fuck up my hands before then. I’m also getting a cortisone shot for my back in early August to deal with Sciatica. Fun!

Things I’m not doing!

The grading related stuff I’ve been working on is going to the backburner for a while. I think I bit off far more than I can chew, so time to back off. I also do not have a good intuition for the problem domain since I’ve never written grading keys nor have I been a student in a class that got autograded.

In conclusion…

Shit, I’ve a lot of things to do lol! I’m sure I’m forgetting some things here that I’ve promised people. Let’s see how this goes!

Conda Constructor Thoughts

Inspired by conversations with Nick Bollweg and Matt Rocklin, I experimented with using conda constructor as the installer for The Littlest JupyterHub. Theoretically, it fit the bill perfectly - I wanted a way to ship arbitrary packages in multiple languages (python & node) in an easy to install self-contained way, didn’t want to make debian packages & wanted to use a tool that people in the Jupyter ecosystem were familiar with. Constructor seemed to provide just that.

I sortof got it working, but in the end ran into enough structural problems that I decided it isn’t the right tool for this job. This blog post is a note to my future self on why.

This isn’t a ‘takedown’ of conda or conda constructor - just a particular use case where it didn’t work out and a demonstration of how little I know about conda. It probably works great if you are doing more scientific computing and less ‘ship a software system’!

Does not work with conda-forge

I <3 conda-forge and the community around it. I know there’s a nice jupyterhub package there, which takes care of installing JupyterHub, node, and required node modules.

However, this doesn’t actually work. conda constructor does not support noarch packages, and JupyterHub relies on several noarch packages. From my understanding, more conda-forge packages are moving towards being noarch (for good reason!).

Looking at this issue, it doesn’t look like this is a high priority item for them to fix anytime soon. I understand that - they don’t owe the world free work! It just makes conda constructor a no-go for my use case…

No support for pip

You can pip install packages in a conda environment, and they mostly just work. There are a lot of python packages on PyPI that are installable via pip that I’d like to use. constructor doesn’t support bundling these, which is entirely fair! This PR attempted something here, but was rejected.

So if I want to keep using packages that don’t exist in conda-forge yet but do exist in pip, I would have to make sure these packages and all their dependencies exist as conda packages too. This would be fine if constructor was giving me enough value to justify it, but right now it is not. I’ve also tried going down a similar road (cough debian cough) and did not want to do that again :)

Awkward post-install.bash

I wanted to set up systemd units post install. Right off the bat this should have made me realize conda constructor was not the right tool for the job :D The only injected environment variable is $PREFIX, which is not super helpful if you wanna do stuff like ‘copy this systemd unit file somewhere’. I ended up writing a small python module that does all these things, and calling it from post-install. However, even then I couldn’t pass any environment variables to it, making testing / CI hard.

Current solution

Currently, we have a bootstrap script that downloads miniconda, & bootstraps from there to a full JupyterHub install. Things like systemd units & sudo rules are managed by a python module that is called from the bootstrap script.

Kubectl verbose logging tricks

Recently I had to write some code that had to call the kubernetes API directly, without any language wrappers. While there is pretty good reference docs, I didn’t want to go and construct all the JSON manually in my programming language.

I discovered that kubectl’s -v parameter is very useful for this! With this, I can do the following:

  1. Perform the actions I need to perform with just kubectl commands
  2. Pass -v=8 to kubectl when doing this, and this will print all the HTTP traffic (requests and responses!) in an easy to read way
  3. Copy paste the JSON requests and template them as needed!

This was very useful! The fact you can see the response bodies is also nice, since it gives you a good intuition of how to handle this in your own code.

If you’re shelling out to kubectl directly in your code (for some reason!), you can also use this to figure out all the RBAC rules your code would need. For example, if I’m going to run the following in my script:

kubectl get node

and need to figure out which RBAC rules are needed for this, I can run:

kubectl -v=8 get node 2>&1 | grep -P 'GET|POST|DELETE|PATCH|PUT'

This should list all the API requests the code is making, making it easier to figure out what rules are needed.

Note that you might have to rm -rf ~/.kube/cache to ‘really’ get the full API requests list, since kubectl caches a bunch of API autodiscovery. The minimum RBAC for kubectl is:

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: kubectl-minimum
rules:
- nonResourceURLs: ["/api", "/apis/*"]
  verbs: ["get"]

You will need to add additional rules for the specific commands you want to execute.

More Kubectl Tips

  1. Slides from the ‘Stupid Kubectl Tricks’ KubeCon talk
  2. On the CoreOS blog
  3. Terse but useful official documentation

Side Project: UA Emoji Firefox Extension

Note: I’m trying to spend time explicitly writing random side projects that are not related to what I’m actively working on as my main project in some form.

A random thread started by Ironholds on a random mailing list I was wearily catching up on contained a joke from bearloga about malformed User Agents. This prompted me to write UAuliver (source), a Firefox extension that randomizes your user agent to be a random string of emoji. This breaks a surprisingly large number of software, I’m told! (GMail & Gerrit being the ones I explicitly remember)

Things I learnt from writing this:

  1. Writing Addons for Firefox is far easier to get started with than they were the last time I looked. Despite the confusing naming (Jetpack API == SDK != WordPress’ Jetpack API != Addons != Plugins != WebExtension), the documentation and tooling were nice enough that I could finish all of this in a few hours!
  2. I can still write syntactically correct Javascript! \o/
  3. Generating a ‘string of emoji’ is easier/harder than you would think, depending on how you would like to define ’emoji’. The fact that Unicode deals in blocks that at least in this case aren’t too split up made this quite easy (I used the list on Wikipedia to generate them). JS’s String.fromCodePoint can also be used to detect if the codepoint you just generated randomly is actually allocated.
  4. I don’t actually know how HTTP headers deal with encoding and unicode. This is something I need to actually look up. Perhaps a re-read of the HTTP RfC is in order!

It was a fun exercise, and I might write more Firefox extensions in the future!

DNS servers, localhost and asynchronous code

localhost is always 127.0.0.1, right? Nope, can also be ::1 if your system only has IPV6 (apparently).

Asking a DNS server for an A record for localhost should give you back 127.0.0.1 right? Nope – it varies wildly! 8.8.8.8 gives me an NXDOMAIN which means it tells you straight up THIS DOMAIN DOES NOT EXIST! Which is true, since localhost isn’t a domain. But if you ask the same thing of any dnsmasq server, it’ll tell you localhost is 127.0.0.1. Other servers vary wildly – I found one that returned an NXDOMAIN for AAAA but 127.0.0.1 for A (which is pretty wild, since NXDOMAIN makes most software treat the domain as not existing and not attempt other lookups). So localhost and DNS servers don’t mix very well.

But why is this a problem, in general? Most DNS resolution happens via gethostbyname libc call, which reads /etc/hosts properly, right? Problem there is that there is popular software that’s completely asynchronous (_cough_nginxcough) that does not use gethostbyname (since that’s synchronous) and directly queries DNS servers (asynchronously). This works perfectly well until you try to hit localhost and it tells you ‘no such thing!’.

I should probably file a bug with nginx to have them read /etc/hosts as well, and in the mean-time work around by sending 127.0.0.1 to nginx rather than localhost.

How did your thursday go?

Simple python packaging for Debian / Ubuntu

(As requested by Jean-Fred)

One of the ‘pain points’ with working on deploying python stuff at Wikimedia is that pip and virtualenvs are banned on production, for some (what I now understand as) good reasons (the solid Signing / Security issues with PYPI, and the slightly less solid but nonetheless valid ‘If we use pip for python and gem for ruby and npm for node, EXPLOSION OF PACKAGE MANAGERS and makes things harder to manage’). I was whining about how hard debian packaging was for quite a while without checking how easy/hard it was to package python specifically, and when I finally did, it turned out to be quite not that hard.

Use python-stdeb.

Really, that is it. Ignore most other things (until you run into issues that require them :P). It can translate most python packages that are packaged for PyPI into .debs that mostly pass lintian checks. Simplest way to package, that I’ve been following for a while is:

  1. Install python-stdeb (from pip or apt). Usually requires the packages python-all, fakeroot and build-essential, although for some reason these aren’t required by the debian package for stdeb. Make sure you’re on the same distro you are building the package for.
  2. git clone the package from its source
  3. Run python setup.py --command-packages=stdeb.command bdist_deb (or python3 if you want to make Python 3 package)
  4. Run lintian on it. If it spots errors, go back and fix them, usually by editing the setup.py file (or sometimes a stdeb.cfg file). This is usually rather obvious and easy enough to fix.
  5. Run dpkg -i <package> to try to install the package. This will error out if it can’t find the packages that your package depends on. This means that they haven’t been packaged for debian yet. You can mostly fix this by finding that package, and making a deb for it, and installing it as well (recursively making debs for packages as you need them). While this sounds onerous, the fact is that most python packages already exist as deb packages and you stdeb will just work for them. You might have to do this more if you’re packaging for an older distro (cough cough precise cough cough), but is much easier on newer distros.
  6. Put your package in a repository! If you want to use this on Wikimedia Labs, you should use Labsdebrepo. Other environments will have similar ways to make the package available via apt-get. Avoid the temptation to just dpkg -i it on machines manually :)

That’s pretty much it! Much simpler than I originally expected, and not much confusing / conflicting docs. The docs for stdeb are pretty nice and complete, so do read these!

Will update the post as I learn more.

Paying for IRCCloud

I’ve started paying for IRCCloud.

It is the first ‘service’ I am paying for as a subscriber, I think. I’ve considered doing that for a long time, but ‘paying for IRC’ just felt… odd. I’ve been using ZNC + LimeChat. It’s decent, but sucks on Mobile. Keeping a socket open all the time on a phone just kills the battery, plus the UX on most Android clients sucks.

So after seeing Sam Smith use IRCCloud during Wikimania, I made the plunge and paid for IRCCloud. It is still connecting to my bouncer, so I have logs under my control. It also has a very usable Android client, and syncs ‘read’ status across devices, and is quite fast.

Convenience and UX won over ‘Free Software’ this time.

“Write once, attempt to debug everywhere”

“Write once, attempt to debug everywhere” is probably way more accurate than “Write once, run anywhere” – at least for GUI apps. There’s something rather special about ideas that are theoretically amazingly wonderful yet end up being a major pain in the butt when you try to put them in practice, isn’t it?

The older Wikipedia App happened to be in PhoneGap, and I consider it one of my biggest blunders to not have torn it down on day 0 and rewritten it in something saner. I once started writing a blog post for the Wikimedia Blog about why we switched from PhoneGap to native apps, but it was killed for having too much profanity in it :) Someday, perhaps.