I walked for about 6 miles yesterday evening, after doing 3-4 miles each day for about 2-3 days before. The roads were empty, and it was lovely. I’ve listened to maybe 6-10 hours of Deborah Frances-White in the last week or so, split between The Guilty Feminist and Global Pillage. Next time though, I’m going to try listen to podcasts less & observe my surroundings more. I am doing this (plus PT) to rehab my knee mostly. I slept for 12h yesterday night :)
I don’t have enough experience with the pitfalls of concurrent programming where you have to use synchronization techniques. I had my first deadlock today, and after a few minutes realized it was actually an infinite recursion from a typo. I need to get a better theoretical understanding of both event loop based async programming and synchronization methods. I’ve used them with threading & Java, but feel shaky in asyncio.
simperviser now has 100% unit test coverage. But it combines asyncio,
processes & signals - so that doesn’t give me enough confidence as it might have otherwise. I’ll take it though :)
I’ve been re-reading the pragmatic programmer again (after initially reading it about 12 years or so ago). A lot of it still holds up, although some stuff is date (love for Broken Window policing & perl). It reminded me that I hadn’t really spent much time customizing and being more productive in VSCode, so am spending time today doing that.
I switch between vscode and my terminal quite a bit, mostly for git operations and running tests. I’m going to see if I can stay inside vscode comfortably for running pytest based tests.
It made me very sad that the only thing I wanted from the pytest integration in vscode is something the maintainers aren’t actively working on - shortcut to run test the cursor is currently at. I also can’t seem
to see test output directly.
I think I’ll be using the Terminal for pytest runs for now. I’m not even going to try with git.
If I was younger and not already full of projects to do, I’d have picked up and tried to make a PR for this. Boo time commitments.
I rallied late in the day & wrote some code around readyness checks in simpervisor. I don’t fully understand what I want it to do, but I’m going to look at what kubernetes does & try to follow that. Since this is being written for nbserverproxy, I am also going to try port nbserverproxy to simpervisor to see what kind of API affordances I’m missing. Primarily, I feel there should be a lock somewhere, but:
- I don’t have a unified theory of what needs locking & why
- I don’t know if I need a lock or another synchronization mechanism
- I don’t know how it’ll actually be used by the application
So designing by porting nbserverproxy seems right.
Gracefully exiting asyncio application
Continuing yesterday’s work on my simple supervisor library,
I continued trying to propagate signals cleanly to child processes
before exiting. I remembered that it isn’t enough to just propagate
signals - you also have to actually reap them. This meant waiting
wait calls on them to return.
I had a task running concurrently that is waiting on these
processes. So ‘all’ I had to do was make sure the application
does not exit until these tasks are done. This turned
out to be harder than I thought! After a bunch of reading, I
recognized that what I needed to do was make sure I wait
for all pending tasks before actually exiting the application.
This was more involved than I thought. It also must be done
at the application level rather than in the library - you don’t
want libraries doing sys.exit, and definitely don’t want them
closing event loops.
After a bunch of looking and playing, it looks like what I want is
in my application code is something like:
if __name__ == '__main__':
loop = asyncio.get_event_loop()
This waits for all tasks to complete before exiting, and seems
to make sure all child processes are reaped. However, I have a
few unresolved questions:
- What happens when one of the tasks is designed to run forever,
and never exit? Should we cancel all tasks? Cancel tasks after
a timeout? Cancelling tasks after a timeout seems most
- If a task schedules more tasks, do those get run? Or are they
abandoned? This seems important - can tasks keep adding more
tasks in a loop?
I am getting a better handle on what people mean by ‘asyncio is
more complicated than needed’. I’m going to find places to read
up on asyncio internals - particularly how the list of pending
tasks is maintained.
This series of blog posts
and this EuroPython talk
from Lynn Root helped a lot.
So did Saúl Ibarra Corretgé (one of the asyncio core devs)
talk on asyncio internals
Testing code that involves asyncio, signals and processes is
hard. I attempted to do so with
os.fork, but decided that is
super-hard mode and I’d rather not play. Instead, I wrote Python
code verbatim that is then spawned as a subprocess, and use
stdout to communicate back to the parent process. The child process’
code itself is inline
in the test file, which is terrible. I am going to move it to its
I also added tests for multiple signal handlers. I’ve been writing
a lot more tests in the last few months than I was before. I credit
Tim Head for a lot of this. It definitely
gives me a lot more confidence in my code.
I have enjoyed keeping running logs of my coding work (devlogs)
in the past, and am going to start doing those again now.
This ‘holiday’ season, I am spending time teaching myself
skills I sortof know about but do not have a deep understanding of.
I spent the first part of the day (before I started devlogging)
working on finishing up a jupyterlab extension
I started the day before. It lets you edit notebook metadata.
I got started since I wanted to use Jupytext
for my work on publishing mybinder.org-analytics.
TypeScript was easy to pick up coming from C#. I wish the
phospor / JupyterLab code had more documentation though.
I ran into a bug.
While following instructions to set up a JupyterLab dev
setup, I somehow managed to delete my source code.
Thankfully I got most of it back thanks to a saved copy
in vscode. It was a sour start to the morning though.
I’ll get back on to this once the sour taste is gone,
and hopefully the bug is fixed :)
asyncio: what’s next | Yuri Selivanov @ PyBay 2018
I’ve been trying to get a better handle on asyncio. I can
use it, but I don’t fully understand it - I am probably leaving
From one of the asyncio maintainers. Gave me some impetus
to push the default version of Python on mybinder.org to 3.7 :D
I’m most excited about getting features from Trio & Curio into
the standard library. Was good to hear that nobody can quite
figure out exception handling, and not just me.
I discovered aioutils
while searching around after this. I’ve copy pasted code that
theoretically does the same things as
aioutils, but I’ve no idea if they are right. I’ll be using
this library from now!
I’m writing a simple process supervisor library to replace the
janky parts of nbserverproxy. It should have the following
- Restart processes when they die
- Propagate signals appropriately
- Support a sense of ‘readiness’ probes (not liveness)
- Be very well tested
- Run on asyncio
This is more difficult than it seems, and am slowly working my
way through it. (1) isn’t too difficult.
(2) is a fair bit more difficult.
atexit is useless since it doesn’t do
anything with SIGTERM. So I need to manage my own SIGTERM
handlers. However, this means there needs to be a centralish
location of some sort that decides when to exit. This introduces
global state, and I don’t like that at all. But unix signals are
global, and maybe there’s nothing for me to do here.
I initially created a Supervisor class that holds a bunch of
SupervisedProcess’s, but it was still calling sys.exit in it.
Since signals are global, I realize there’s no other real
way to handle this, and so I made a global handler setup too.
This has the additional advantage of being able to remove
handlers when a SupervisedProcess dies, avoiding memory
leaks and stuff.
Testing this stuff is hard!
I also need to make sure I don’t end up with lots of races.
I’m still writing concurrent code, even without threads.
Gotta be careefull. Especially with signals thrown in. Although
I guess once you get a SIGTERM or SIGINT inconsistent state
is not particularly worrysome.
This post is from conversations with Matt
Rocklin and others at the
PANGEO developer meeting at
Today, almost all of ‘the cloud’ is run by
ruthlessly competitive hypercapitalist large scale
organizations. This is great & terrible.
When writing open source applications that primarily
run on the cloud, I try to make sure my users (primarily
people deploying my software for their users) have
the following freedoms:
- They can run the software on any cloud provider they
- They can run the software on a bunch of computers they
physically own, with the help of other open source software
Ensuring these freedoms for my users requires the following
restrictions on me:
- Depend on Open Source Software with hosted cloud versions,
not proprietary cloud-vendor-only software.
I’ll use PostgreSQL over Google Cloud Datastore. Kubernetes with
autoscaling over talking to the EC2 api directly.
- Use abstractions that allow swappable implementations anytime
you have to talk to a cloud provider API directly.
Don’t talk to the S3 API directly, but have an abstract
interface that defines exactly what your application needs,
and then write an S3 implementation for it. Ideally, also
write a minio / ceph / file-system implementation for it,
to make sure your abstraction actually works.
These are easy to follow once you are aware of them, and provide
good design tradeoffs for Open Source projects. Remember these are
necessary but not sufficient to ensure some of your users’ fundamental
I’m writing up monthly ‘work plans’ to plan what work I’m trying to do every
month, and do a retrospective after to see how much I got done. I work across
a variety of open source projects with ambiguous responsibilities, so work
planning isn’t very set. This has proven to be somewhat quite stressful for
everyone involved. Let’s see if this helps!
JupyterCon is in NYC towards the end of August, and it is going to set the pace
for a bunch of stuff. I have 2.5-ish talks to give. Need to prepare for those and
do a good job.
Matomo (formerly Piwiki) on mybinder.org
mybinder.org currently uses Google Analytics. I am not a big fan. It has troubling
privacy implications, and we don’t get data as granularly as we want to. I am going
to try deploying Matomo (formerly Piwiki) and using that
instead. Run both together for a while and see how we like it! Matomo requires
a MySQL database & is written in PHP - let’s see how this goes ;)
The Littlest JupyterHub 0.1 release
The Littlest JupyterHub is
doing great! I’ve done a lot of user tests, and the distribution has changed
drastically over time. It’s also the first time I’m putting my newly found
strong convictions around testing, CI & documentation to practice. You can already
check it out on GitHub. I
want to make sure we (the JupyterHub team) gets out a 0.1 release early August.
I despair about climate change and how little agency I seem to have around it
quite a bit. I’m excited to go to the PANGEO workshop
in Colorado. I’m mostly hoping to listen & understand their world some more.
Berkeley DataHub deployment
Aug 20-ish is when next semester starts at UC Berkeley. I need to have a pretty
solid JupyterHub running there by then. I’d like it to have good CI/CD set up
in a generic way, rather than something super specific to Berkeley. However,
I’m happy to shortcut this if needed, since there’s already so many things on
my plate haha.
I’m trying to spend a day or two a month at UC Davis. Partially because I like
being on Amtrak! I also think there’s a lot of cool work happening there,
and I’d like to hang out with all the cool people doing all the cool work.
On top of this, there’s ongoing medical conditions to be managed. I’m getting
Carpel Tunnel Release surgery sometime in October, so need to make sure I do
not super fuck up my hands before then. I’m also getting a cortisone shot
for my back in early August to deal with Sciatica. Fun!
Things I’m not doing!
The grading related stuff I’ve been working on is going to the backburner
for a while. I think I bit off far more than I can chew, so time to back off.
I also do not have a good intuition for the problem domain since I’ve never
written grading keys nor have I been a student in a class that got autograded.
Shit, I’ve a lot of things to do lol! I’m sure I’m forgetting some things
here that I’ve promised people. Let’s see how this goes!
Inspired by conversations with Nick Bollweg and
Matt Rocklin, I experimented with using
conda constructor as the installer for
The Littlest JupyterHub.
Theoretically, it fit the bill perfectly - I wanted a way to ship arbitrary
packages in multiple languages (python & node) in an easy to install self-contained way,
didn’t want to make debian packages & wanted to use a tool that people in the Jupyter
ecosystem were familiar with. Constructor seemed to provide just that.
I sortof got it working, but in the end ran into enough structural problems that
I decided it isn’t the right tool for this job. This blog post is a note to my
future self on why.
This isn’t a ‘takedown’ of conda or conda constructor - just a particular
use case where it didn’t work out and a demonstration of how little I know
about conda. It probably works great if you are doing more scientific computing
and less ‘ship a software system’!
Does not work with
I <3 conda-forge and the community around
it. I know there’s a nice jupyterhub
package there, which takes care of installing JupyterHub, node, and required
However, this doesn’t actually work. conda constructor does not support
and JupyterHub relies on several
noarch packages. From my understanding,
conda-forge packages are moving towards being
noarch (for good reason!).
Looking at this issue,
it doesn’t look like this is a high priority item for them to fix anytime soon.
I understand that - they don’t owe the world free work! It just makes conda
constructor a no-go for my use case…
No support for pip
You can pip install packages in a conda environment, and they mostly just work.
There are a lot of python packages on PyPI that are installable via pip that
I’d like to use. constructor doesn’t support bundling these, which is entirely
fair! This PR attempted something
here, but was rejected.
So if I want to keep using packages that don’t exist in
conda-forge yet but
do exist in pip, I would have to make sure these packages and all their dependencies
exist as conda packages too. This would be fine if constructor was giving
me enough value to justify it, but right now it is not. I’ve also tried going
down a similar road (cough debian cough) and did not want to do that again :)
I wanted to set up systemd units post install. Right off the bat this should
have made me realize conda constructor was not the right tool for the job :D
The only injected environment variable is
$PREFIX, which is not super helpful
if you wanna do stuff like ‘copy this systemd unit file somewhere’. I ended up
writing a small python module that does all these things, and calling it from
post-install. However, even then I couldn’t pass any environment variables to it,
making testing / CI hard.
Currently, we have a bootstrap script
that downloads miniconda, & bootstraps from there to a full JupyterHub install.
Things like systemd units & sudo rules are managed by a python module
that is called from the bootstrap script.
Recently I had to write some code that had to call the kubernetes API directly,
without any language wrappers. While there is pretty good reference docs,
I didn’t want to go and construct all the JSON manually in my programming language.
I discovered that
-v parameter is very useful for this! With this,
I can do the following:
- Perform the actions I need to perform with just
-v=8 to kubectl when doing this, and this will print all the HTTP traffic
(requests and responses!) in an easy to read way
- Copy paste the JSON requests and template them as needed!
This was very useful! The fact you can see the response bodies is also nice,
since it gives you a good intuition of how to handle this in your own code.
If you’re shelling out to
kubectl directly in your code (for some reason!),
you can also use this to figure out all the RBAC rules your code would need. For
example, if I’m going to run the following in my script:
kubectl get node
and need to figure out which RBAC rules are needed for this, I can run:
kubectl -v=8 get node 2>&1 | grep -P 'GET|POST|DELETE|PATCH|PUT'
This should list all the API requests the code is making, making it easier
to figure out what rules are needed.
Note that you might have to
rm -rf ~/.kube/cache to ‘really’ get the
full API requests list, since
kubectl caches a bunch of API autodiscovery.
The minimum RBAC for kubectl is:
- nonResourceURLs: ["/api", "/apis/*"]
You will need to add additional rules for the specific commands you
want to execute.
More Kubectl Tips
- Slides from the ‘Stupid Kubectl Tricks’ KubeCon talk
- On the CoreOS blog
- Terse but useful official documentation
Note: I’m trying to spend time explicitly writing random side projects that are not related to what I’m actively working on as my main project in some form.
A random thread started by Ironholds on a random mailing list I was wearily catching up on contained a joke from bearloga about malformed User Agents. This prompted me to write UAuliver (source), a Firefox extension that randomizes your user agent to be a random string of emoji. This breaks a surprisingly large number of software, I’m told! (GMail & Gerrit being the ones I explicitly remember)
Things I learnt from writing this:
- Writing Addons for Firefox is far easier to get started with than they were the last time I looked. Despite the confusing naming (Jetpack API == SDK != WordPress’ Jetpack API != Addons != Plugins != WebExtension), the documentation and tooling were nice enough that I could finish all of this in a few hours!
- Generating a ‘string of emoji’ is easier/harder than you would think, depending on how you would like to define ’emoji’. The fact that Unicode deals in blocks that at least in this case aren’t too split up made this quite easy (I used the list on Wikipedia to generate them). JS’s String.fromCodePoint can also be used to detect if the codepoint you just generated randomly is actually allocated.
- I don’t actually know how HTTP headers deal with encoding and unicode. This is something I need to actually look up. Perhaps a re-read of the HTTP RfC is in order!
It was a fun exercise, and I might write more Firefox extensions in the future!
localhost is always
127.0.0.1, right? Nope, can also be
::1 if your system only has IPV6 (apparently).
Asking a DNS server for an A record for
localhost should give you back
127.0.0.1 right? Nope – it varies wildly!
126.96.36.199 gives me an
NXDOMAIN which means it tells you straight up THIS DOMAIN DOES NOT EXIST! Which is true, since localhost isn’t a domain. But if you ask the same thing of any dnsmasq server, it’ll tell you localhost is 127.0.0.1. Other servers vary wildly – I found one that returned an
NXDOMAIN for AAAA but 127.0.0.1 for A (which is pretty wild, since
NXDOMAIN makes most software treat the domain as not existing and not attempt other lookups). So localhost and DNS servers don’t mix very well.
But why is this a problem, in general? Most DNS resolution happens via
gethostbyname libc call, which reads
/etc/hosts properly, right? Problem there is that there is popular software that’s completely asynchronous (_cough_nginxcough) that does not use
gethostbyname (since that’s synchronous) and directly queries DNS servers (asynchronously). This works perfectly well until you try to hit
localhost and it tells you ‘no such thing!’.
I should probably file a bug with nginx to have them read
/etc/hosts as well, and in the mean-time work around by sending
127.0.0.1 to nginx rather than localhost.
How did your thursday go?
(As requested by Jean-Fred)
One of the ‘pain points’ with working on deploying python stuff at Wikimedia is that
virtualenvs are banned on production, for some (what I now understand as) good reasons (the solid Signing / Security issues with PYPI, and the slightly less solid but nonetheless valid ‘If we use pip for python and gem for ruby and npm for node, EXPLOSION OF PACKAGE MANAGERS and makes things harder to manage’). I was whining about how hard debian packaging was for quite a while without checking how easy/hard it was to package python specifically, and when I finally did, it turned out to be quite not that hard.
Really, that is it. Ignore most other things (until you run into issues that require them :P). It can translate most python packages that are packaged for PyPI into .debs that mostly pass lintian checks. Simplest way to package, that I’ve been following for a while is:
python-stdeb (from pip or apt). Usually requires the packages
build-essential, although for some reason these aren’t required by the debian package for
stdeb. Make sure you’re on the same distro you are building the package for.
git clone the package from its source
python setup.py --command-packages=stdeb.command bdist_deb (or
python3 if you want to make Python 3 package)
lintian on it. If it spots errors, go back and fix them, usually by editing the
setup.py file (or sometimes a
stdeb.cfg file). This is usually rather obvious and easy enough to fix.
dpkg -i <package> to try to install the package. This will error out if it can’t find the packages that your package depends on. This means that they haven’t been packaged for debian yet. You can mostly fix this by finding that package, and making a deb for it, and installing it as well (recursively making debs for packages as you need them). While this sounds onerous, the fact is that most python packages already exist as deb packages and you
stdeb will just work for them. You might have to do this more if you’re packaging for an older distro (cough cough
precise cough cough), but is much easier on newer distros.
- Put your package in a repository! If you want to use this on Wikimedia Labs, you should use Labsdebrepo. Other environments will have similar ways to make the package available via
apt-get. Avoid the temptation to just
dpkg -i it on machines manually :)
That’s pretty much it! Much simpler than I originally expected, and not much confusing / conflicting docs. The docs for
stdeb are pretty nice and complete, so do read these!
Will update the post as I learn more.