I walked for about 6 miles yesterday evening, after doing 3-4 miles each day for about 2-3 days before. The roads were empty, and it was lovely. I’ve listened to maybe 6-10 hours of Deborah Frances-White in the last week or so, split between The Guilty Feminist and Global Pillage. Next time though, I’m going to try listen to podcasts less & observe my surroundings more. I am doing this (plus PT) to rehab my knee mostly. I slept for 12h yesterday night :)
I don’t have enough experience with the pitfalls of concurrent programming where you have to use synchronization techniques. I had my first deadlock today, and after a few minutes realized it was actually an infinite recursion from a typo. I need to get a better theoretical understanding of both event loop based async programming and synchronization methods. I’ve used them with threading & Java, but feel shaky in asyncio.
simperviser now has 100% unit test coverage. But it combines asyncio,
processes & signals - so that doesn’t give me enough confidence as it might have otherwise. I’ll take it though :)
I’ve been re-reading the pragmatic programmer again (after initially reading it about 12 years or so ago). A lot of it still holds up, although some stuff is date (love for Broken Window policing & perl). It reminded me that I hadn’t really spent much time customizing and being more productive in VSCode, so am spending time today doing that.
I switch between vscode and my terminal quite a bit, mostly for git operations and running tests. I’m going to see if I can stay inside vscode comfortably for running pytest based tests.
It made me very sad that the only thing I wanted from the pytest integration in vscode is something the maintainers aren’t actively working on - shortcut to run test the cursor is currently at. I also can’t seem
to see test output directly.
I think I’ll be using the Terminal for pytest runs for now. I’m not even going to try with git.
If I was younger and not already full of projects to do, I’d have picked up and tried to make a PR for this. Boo time commitments.
I rallied late in the day & wrote some code around readyness checks in simpervisor. I don’t fully understand what I want it to do, but I’m going to look at what kubernetes does & try to follow that. Since this is being written for nbserverproxy, I am also going to try port nbserverproxy to simpervisor to see what kind of API affordances I’m missing. Primarily, I feel there should be a lock somewhere, but:
- I don’t have a unified theory of what needs locking & why
- I don’t know if I need a lock or another synchronization mechanism
- I don’t know how it’ll actually be used by the application
So designing by porting nbserverproxy seems right.
Gracefully exiting asyncio application
Continuing yesterday’s work on my simple supervisor library,
I continued trying to propagate signals cleanly to child processes
before exiting. I remembered that it isn’t enough to just propagate
signals - you also have to actually reap them. This meant waiting
wait calls on them to return.
I had a task running concurrently that is waiting on these
processes. So ‘all’ I had to do was make sure the application
does not exit until these tasks are done. This turned
out to be harder than I thought! After a bunch of reading, I
recognized that what I needed to do was make sure I wait
for all pending tasks before actually exiting the application.
This was more involved than I thought. It also must be done
at the application level rather than in the library - you don’t
want libraries doing sys.exit, and definitely don’t want them
closing event loops.
After a bunch of looking and playing, it looks like what I want is
in my application code is something like:
if __name__ == '__main__':
loop = asyncio.get_event_loop()
This waits for all tasks to complete before exiting, and seems
to make sure all child processes are reaped. However, I have a
few unresolved questions:
- What happens when one of the tasks is designed to run forever,
and never exit? Should we cancel all tasks? Cancel tasks after
a timeout? Cancelling tasks after a timeout seems most
- If a task schedules more tasks, do those get run? Or are they
abandoned? This seems important - can tasks keep adding more
tasks in a loop?
I am getting a better handle on what people mean by ‘asyncio is
more complicated than needed’. I’m going to find places to read
up on asyncio internals - particularly how the list of pending
tasks is maintained.
This series of blog posts
and this EuroPython talk
from Lynn Root helped a lot.
So did Saúl Ibarra Corretgé (one of the asyncio core devs)
talk on asyncio internals
Testing code that involves asyncio, signals and processes is
hard. I attempted to do so with
os.fork, but decided that is
super-hard mode and I’d rather not play. Instead, I wrote Python
code verbatim that is then spawned as a subprocess, and use
stdout to communicate back to the parent process. The child process’
code itself is inline
in the test file, which is terrible. I am going to move it to its
I also added tests for multiple signal handlers. I’ve been writing
a lot more tests in the last few months than I was before. I credit
Tim Head for a lot of this. It definitely
gives me a lot more confidence in my code.
I have enjoyed keeping running logs of my coding work (devlogs)
in the past, and am going to start doing those again now.
This ‘holiday’ season, I am spending time teaching myself
skills I sortof know about but do not have a deep understanding of.
I spent the first part of the day (before I started devlogging)
working on finishing up a jupyterlab extension
I started the day before. It lets you edit notebook metadata.
I got started since I wanted to use Jupytext
for my work on publishing mybinder.org-analytics.
TypeScript was easy to pick up coming from C#. I wish the
phospor / JupyterLab code had more documentation though.
I ran into a bug.
While following instructions to set up a JupyterLab dev
setup, I somehow managed to delete my source code.
Thankfully I got most of it back thanks to a saved copy
in vscode. It was a sour start to the morning though.
I’ll get back on to this once the sour taste is gone,
and hopefully the bug is fixed :)
asyncio: what’s next | Yuri Selivanov @ PyBay 2018
I’ve been trying to get a better handle on asyncio. I can
use it, but I don’t fully understand it - I am probably leaving
From one of the asyncio maintainers. Gave me some impetus
to push the default version of Python on mybinder.org to 3.7 :D
I’m most excited about getting features from Trio & Curio into
the standard library. Was good to hear that nobody can quite
figure out exception handling, and not just me.
I discovered aioutils
while searching around after this. I’ve copy pasted code that
theoretically does the same things as
aioutils, but I’ve no idea if they are right. I’ll be using
this library from now!
I’m writing a simple process supervisor library to replace the
janky parts of nbserverproxy. It should have the following
- Restart processes when they die
- Propagate signals appropriately
- Support a sense of ‘readiness’ probes (not liveness)
- Be very well tested
- Run on asyncio
This is more difficult than it seems, and am slowly working my
way through it. (1) isn’t too difficult.
(2) is a fair bit more difficult.
atexit is useless since it doesn’t do
anything with SIGTERM. So I need to manage my own SIGTERM
handlers. However, this means there needs to be a centralish
location of some sort that decides when to exit. This introduces
global state, and I don’t like that at all. But unix signals are
global, and maybe there’s nothing for me to do here.
I initially created a Supervisor class that holds a bunch of
SupervisedProcess’s, but it was still calling sys.exit in it.
Since signals are global, I realize there’s no other real
way to handle this, and so I made a global handler setup too.
This has the additional advantage of being able to remove
handlers when a SupervisedProcess dies, avoiding memory
leaks and stuff.
Testing this stuff is hard!
I also need to make sure I don’t end up with lots of races.
I’m still writing concurrent code, even without threads.
Gotta be careefull. Especially with signals thrown in. Although
I guess once you get a SIGTERM or SIGINT inconsistent state
is not particularly worrysome.
This post is from conversations with Matt
Rocklin and others at the
PANGEO developer meeting at
Today, almost all of ‘the cloud’ is run by
ruthlessly competitive hypercapitalist large scale
organizations. This is great & terrible.
When writing open source applications that primarily
run on the cloud, I try to make sure my users (primarily
people deploying my software for their users) have
the following freedoms:
- They can run the software on any cloud provider they
- They can run the software on a bunch of computers they
physically own, with the help of other open source software
Ensuring these freedoms for my users requires the following
restrictions on me:
- Depend on Open Source Software with hosted cloud versions,
not proprietary cloud-vendor-only software.
I’ll use PostgreSQL over Google Cloud Datastore. Kubernetes with
autoscaling over talking to the EC2 api directly.
- Use abstractions that allow swappable implementations anytime
you have to talk to a cloud provider API directly.
Don’t talk to the S3 API directly, but have an abstract
interface that defines exactly what your application needs,
and then write an S3 implementation for it. Ideally, also
write a minio / ceph / file-system implementation for it,
to make sure your abstraction actually works.
These are easy to follow once you are aware of them, and provide
good design tradeoffs for Open Source projects. Remember these are
necessary but not sufficient to ensure some of your users’ fundamental
I’m writing up monthly ‘work plans’ to plan what work I’m trying to do every
month, and do a retrospective after to see how much I got done. I work across
a variety of open source projects with ambiguous responsibilities, so work
planning isn’t very set. This has proven to be somewhat quite stressful for
everyone involved. Let’s see if this helps!
JupyterCon is in NYC towards the end of August, and it is going to set the pace
for a bunch of stuff. I have 2.5-ish talks to give. Need to prepare for those and
do a good job.
Matomo (formerly Piwiki) on mybinder.org
mybinder.org currently uses Google Analytics. I am not a big fan. It has troubling
privacy implications, and we don’t get data as granularly as we want to. I am going
to try deploying Matomo (formerly Piwiki) and using that
instead. Run both together for a while and see how we like it! Matomo requires
a MySQL database & is written in PHP - let’s see how this goes ;)
The Littlest JupyterHub 0.1 release
The Littlest JupyterHub is
doing great! I’ve done a lot of user tests, and the distribution has changed
drastically over time. It’s also the first time I’m putting my newly found
strong convictions around testing, CI & documentation to practice. You can already
check it out on GitHub. I
want to make sure we (the JupyterHub team) gets out a 0.1 release early August.
I despair about climate change and how little agency I seem to have around it
quite a bit. I’m excited to go to the PANGEO workshop
in Colorado. I’m mostly hoping to listen & understand their world some more.
Berkeley DataHub deployment
Aug 20-ish is when next semester starts at UC Berkeley. I need to have a pretty
solid JupyterHub running there by then. I’d like it to have good CI/CD set up
in a generic way, rather than something super specific to Berkeley. However,
I’m happy to shortcut this if needed, since there’s already so many things on
my plate haha.
I’m trying to spend a day or two a month at UC Davis. Partially because I like
being on Amtrak! I also think there’s a lot of cool work happening there,
and I’d like to hang out with all the cool people doing all the cool work.
On top of this, there’s ongoing medical conditions to be managed. I’m getting
Carpel Tunnel Release surgery sometime in October, so need to make sure I do
not super fuck up my hands before then. I’m also getting a cortisone shot
for my back in early August to deal with Sciatica. Fun!
Things I’m not doing!
The grading related stuff I’ve been working on is going to the backburner
for a while. I think I bit off far more than I can chew, so time to back off.
I also do not have a good intuition for the problem domain since I’ve never
written grading keys nor have I been a student in a class that got autograded.
Shit, I’ve a lot of things to do lol! I’m sure I’m forgetting some things
here that I’ve promised people. Let’s see how this goes!
Inspired by conversations with Nick Bollweg and
Matt Rocklin, I experimented with using
conda constructor as the installer for
The Littlest JupyterHub.
Theoretically, it fit the bill perfectly - I wanted a way to ship arbitrary
packages in multiple languages (python & node) in an easy to install self-contained way,
didn’t want to make debian packages & wanted to use a tool that people in the Jupyter
ecosystem were familiar with. Constructor seemed to provide just that.
I sortof got it working, but in the end ran into enough structural problems that
I decided it isn’t the right tool for this job. This blog post is a note to my
future self on why.
This isn’t a ‘takedown’ of conda or conda constructor - just a particular
use case where it didn’t work out and a demonstration of how little I know
about conda. It probably works great if you are doing more scientific computing
and less ‘ship a software system’!
Does not work with
I <3 conda-forge and the community around
it. I know there’s a nice jupyterhub
package there, which takes care of installing JupyterHub, node, and required
However, this doesn’t actually work. conda constructor does not support
and JupyterHub relies on several
noarch packages. From my understanding,
conda-forge packages are moving towards being
noarch (for good reason!).
Looking at this issue,
it doesn’t look like this is a high priority item for them to fix anytime soon.
I understand that - they don’t owe the world free work! It just makes conda
constructor a no-go for my use case…
No support for pip
You can pip install packages in a conda environment, and they mostly just work.
There are a lot of python packages on PyPI that are installable via pip that
I’d like to use. constructor doesn’t support bundling these, which is entirely
fair! This PR attempted something
here, but was rejected.
So if I want to keep using packages that don’t exist in
conda-forge yet but
do exist in pip, I would have to make sure these packages and all their dependencies
exist as conda packages too. This would be fine if constructor was giving
me enough value to justify it, but right now it is not. I’ve also tried going
down a similar road (cough debian cough) and did not want to do that again :)
I wanted to set up systemd units post install. Right off the bat this should
have made me realize conda constructor was not the right tool for the job :D
The only injected environment variable is
$PREFIX, which is not super helpful
if you wanna do stuff like ‘copy this systemd unit file somewhere’. I ended up
writing a small python module that does all these things, and calling it from
post-install. However, even then I couldn’t pass any environment variables to it,
making testing / CI hard.
Currently, we have a bootstrap script
that downloads miniconda, & bootstraps from there to a full JupyterHub install.
Things like systemd units & sudo rules are managed by a python module
that is called from the bootstrap script.
This idea comes from brainstorming along with Lindsey
Head & Nick
Bollweg at the Jupyter Team Meeting 2018. Most
of the good ideas are theirs! The name is inspired by one of favorite TV
series of one of my
I really love the idea of JupyterHub distributions - opinionated combination
of components that target a specific use case. The Zero to JupyterHub
distribution is awesome & works for most people. However, it requires
Kubernetes - a distributed system with inherent complexities that is not
worth it below a certain threshold.
This blog post lays out ideas for implementing a simpler, smaller distribution called
The Littlest JupyterHub. The Littlest JupyterHub serves the long tail of potential JupyterHub users
who have the following needs only.
- Support a very small number of students (around 20–30, maybe 50)
- Run on only one node, either a cheap VPS or a
VM on their favorite cloud provider
- Provide the same environment for all students
- Allow the instructor / admin to easily modify the environment for students with no specialized knowledge
- Be extremely low maintenance once set up & easily fixable when it breaks
- Allow easy upgrades
- Enforce memory / CPU limits for students
The target audience is primarily educators teaching small classes with
Jupyter Notebooks. It should be an extremely focused distribution, with new
feature requests facing higher scrutiny than usual. It has a legitimate
chance of actually reaching 1.0 & being stable, requiring minimal ongoing
run as standard systemd services.
Systemd spawner is used - it is lightweight, allows JupyterHub
restarts without killing user servers & provides CPU / memory isolation.
Something like First Use
+ a user whitelist might be good enough for a large number of users. New
authenticators are added whenever users ask for them.
The JupyterHub system is in its own root owned conda environment or
virtualenv, to prevent accidental damage from users.
There is a single
conda environment shared by all the users. JupyterHub admins have
write access to this environment, and everyone else has read access. Admins
can install new libraries for all users with conda/pip. No extra steps
needed, and you can do this from inside JupyterHub without needing to ssh.
Each user gets their own home directory, and can install packages there if
they wish. systemdspawner puts each user server in a systemd service, and
provides fine grained
over memory & cpu usage. Users also get their own system user, providing an
additional layer of security & standardized home directory locations.
YAML is used for config - it is the
least bad of all the currently available languages, IMO. Ideally, something
would exist for editing & applying this config. It’ll
open the config file in an editor, allow users to edit it, and apply it only
if it is valid. Advanced users can sidestep this and edit files directly. The
YAML file is read and processed directly in
simplifies things & gives us fewer things to break.
Upgrading the distribution
Backwards compatible upgrading will be supported across one
minor version only - so you can go from 0.7 to 0.8, but not 0.9. Upgrades
should not cause outages.
Users run a command on a fresh server to install this distribution. This could use
conda constructor (thanks to Nick Bollweig & Matt
Rocklin for convincing me!)
or debian packages (with fpm or dh-virtualenv). The user environments will be
curl <some-url> | sudo bash command is available in a
nice looking website for the distribution that users can copy paste into
their fresh VM. This website also has instructions for creating a fresh VM in
popular cloud providers & VPS providers.
All systems exist in a partially degraded state all the time. Good systems
self-heal & continue to run as well as they can. When they can’t, they break
cleanly in known ways. They are observable enough to debug the issues that
cause 80% of the problems.
The Littlest JupyterHub should be a good system. Systemd captures
logs from JupyterHub, user servers & the proxy. Strong validation of the
config file catches fatal misconfigurations. Reboots actually fix most issues
and never make anything worse. Screwed up user environments are recoverable.
We’ll discover how this breaks as users of varying skill levels use it, and
update our tooling accordingly.
But, No Docker?
Docker has been explicitly excluded from this tech stack. Building custom
docker images & dealing with registries is too complex most educators. A good
distribution embraces its constraints & does well!
Are you a person who would use a distribution like this? We would love to
hear from you! Make an issue on GitHub,
tweet at me, or send me an email.
Recently I had to write some code that had to call the kubernetes API directly,
without any language wrappers. While there is pretty good reference docs,
I didn’t want to go and construct all the JSON manually in my programming language.
I discovered that
-v parameter is very useful for this! With this,
I can do the following:
- Perform the actions I need to perform with just
-v=8 to kubectl when doing this, and this will print all the HTTP traffic
(requests and responses!) in an easy to read way
- Copy paste the JSON requests and template them as needed!
This was very useful! The fact you can see the response bodies is also nice,
since it gives you a good intuition of how to handle this in your own code.
If you’re shelling out to
kubectl directly in your code (for some reason!),
you can also use this to figure out all the RBAC rules your code would need. For
example, if I’m going to run the following in my script:
kubectl get node
and need to figure out which RBAC rules are needed for this, I can run:
kubectl -v=8 get node 2>&1 | grep -P 'GET|POST|DELETE|PATCH|PUT'
This should list all the API requests the code is making, making it easier
to figure out what rules are needed.
Note that you might have to
rm -rf ~/.kube/cache to ‘really’ get the
full API requests list, since
kubectl caches a bunch of API autodiscovery.
The minimum RBAC for kubectl is:
- nonResourceURLs: ["/api", "/apis/*"]
You will need to add additional rules for the specific commands you
want to execute.
More Kubectl Tips
- Slides from the ‘Stupid Kubectl Tricks’ KubeCon talk
- On the CoreOS blog
- Terse but useful official documentation
I haven’t done a ‘year in retrospective’ publicly for a long time, but after reading
Alice Goldfuss’ 2017 year in review decided
to do one for me too!
This is a very filtered view - there are lots of important people & events in 2017
that are not contained here, and that is ok.
I finished around 6-ish years at the Wikimedia Foundation, and joined UC Berkeley’s
Data Science Division early in the year. I grew
immensely as a person & programmer in that time. The new job gives me a lot more
responsibility and it is quite fun.
At Berkeley, I build infrastructure for students to dive into writing code to solve
their own problems in their own fields without having to navigate the accidental
complexities of software installation & configuration as much as possible. This
is in line with my previous work like Quarry or
PAWS, except it’s my main paid-for job now rather than a
side project, which is great! It lets me work
full time in realizing some of the ideas from my talk on
I’m happy with the kind of work I’m doing,
the people I am doing it with, the scale I am doing it at and the impact I think
it is having. I feel lucky & privileged to be able to do it!
Wherever I go, whatever I do - good or bad - Wikimedia
will always be partially responsible for that :)
Working closer to users
At my Wikimedia Job, I was partially responsible for maintaining the
Tool Labs infrastructure. Others (mostly volunteers) built the tools
that end users actually used. While this was still good, it made me one step
removed from the actual end users. At Berkeley, end users (both students &
faculty) directly use the infrastructure I build
This increase in directness has given me a lot of joy, happiness &
confidence about the impact of the work I’m doing.
I helped rewrite & redeploy mybinder.org as part of
the mybinder team, which was one of the high points of the year! It has had
the most public facing impact of all the projects I’ve worked on this year - even
got a glowing review
from Juilia Evans! We’re now temporarily funded via a grant from the Moore
Foundation, and need to find long term sustainable solutions. We
have a lot of low hanging fruit to take on in the next year, so I am super
excited for it!
I’m now sort-of accidentally ‘inside’ Academia as defined in the US, which
is a strange and surreal experience. I’m ‘staff’, which seems to
be a distinct and different track than the grad student -> post grad -> faculty
track. From the inside, it is many moving parts than one behemoth - some move
fast, some slow & super cool stuff / tension at the intersections.
I don’t fully understand my place in it yet, but maybe someday I will!
At Wikimedia, I was in a team of (otherwise amazing!) operations folks that was mostly white and
male. Now, I’m in multiple diverse & multi-disciplinary teams, and it is amazing. I find it easier to do more impactful work, grow technically & professionally, build consensus and have fun. Hard to go back!
I spend time at the Berkeley Institute for Data Science,
with the interesting variety of people who are there. They’re all very smart
in different fields than I am in, and the intersection is great. I walk away
from every conversation with anyone feeling both dumber & smarter for the new knowledge
of things I now knew I didn’t know! Cool (and sometimes uncomfortable) things
happen at intersections, and I want to make sure I keep being in those spaces.
I am a Maintainer
With enough involvement in the Jupyter community, I have
now found myself to be an actual Maintainer of open source projects in ways
I was not when I was at Wikimedia. Took me a while to realize this comes with a lot of
responsibility and work that’s not just ‘sit and write code’. I am still
coming to terms with it, and it’s not entirely fully clear to me what the
responsibilities I now have are. Thankfully I’m not a solo maintainer but have
wonderful people who have a lot of experience in this kinda stuff doing it with
I was involved in 3 talks (
and 1 tutorial at JupyterCon this year, which was a
mistake I shall not make again. I also gave one talk at KubeCon NA 2017.
I am a little out of practice in giving good talks - while these were
okay, I know I can do better. I gave a number of talks to smaller internal audiences
at UC Berkeley & ran a number of JupyterHub related workshops - I quite enjoyed
those and will try to do more of that :)
I finally understood how little I had valued writing good documentation for
my projects and spent time correcting it this year. I still have a long way to
go, but the Jupyter community in general has helped me understand and get better
I’ve started working on python projects again, rather than just scripts. Some of
my skills here have rusted over years of not being heavily used. I got into
writing better tests and found lots of value in them. This is another place where
being part of the Jupyter ecosystem has made it pretty awesome for me.
This year I’ve had far more operational responsibilities than I had at Wikimedia,
and it has forced me to both learn more about automation / autonomous systems &
implement several of them. It’s been an intense personal growth spurt. I also
have the ability to work with public clouds & a lot of personal freedom on technology
choices (as long as I can support them!), and it’s been liberating. It will
be hard for me to go back to working at a place that’s automated a lot less.
Performance analysis + fixing
I did a lot of performance analysis of JupyterHub, in a ‘profile -> fix -> repeat’
loop. We got it from failing at around 600ish active users to about 4k-5k now, which
is great. I also learnt a lot about profiling in the process!
I learnt a lot about how containers work at the kernel level. Liz Rice’s
talk Building a container from scratch
made me realize that yes I could also understand containers internally! LWN’s
series of articles on cgroups and
namespaces helped a lot too. I feel better
understanding the hype & figuring out what is actually useful to me :) It pairs
well with the kubernetes knowledge I gained from 2016.
Lots happened here that I can not talk about publicly, but here is some!
Election & Belonging
The 2016 US Elections were very tough on me, causing a lot of emotional turmoil.
I participated in some protests, became disillusioned with current political systems,
despondent about possible new ones & generally just sad. I feel a bit more resilient,
but know even less than before if the US will be a good long term place for me.
I’d like it to be, and am currently operating on the assumption that the Nov 2018
elections in the US will turn better, and I can continue living here. But I am
starting German classes in a week just in case :)
My visa situation has stabilized somewhat. Due to wonderful efforts of many
people at UC Berkeley, I am possibly going to start my Green Card process soon.
My visa is getting renewed, and I’ll have to go back to India in a few months
to get it sorted. It’s a lot more stable than it was last year this time!
I did not travel out of the country much this year. I had the best Fried Chicken
of my life in New Orlean’s, and good Chicken 65 (!!!) in Austin. I also did my
first ever ‘road trip’, from the Bay Area to Seattle! I spent a bit of time in
New York, Portland & Seattle as well - not enough though. Paying bay area rents does
not help with travel :(
I cooked a lot more of the food I ate! I can make it as spicy or sweet as I want,
and it is still healthy if I make it at home (right?). Other people even actually
liked some of the food I made.
I haven’t fully recovered from a knee injury I had in 2016 :( It made me realize
how much I had taken my body for granted. I am taking better care of it now, and
shall continue to. I’m doing weights at home, having admitted I won’t have the
discipline to actually go to a gym regularly when it is more than a 3 minute walk…
It’s been mostly red this year! I might just stick to red from now on. I switched
out my profile picture from random stick figure to a smiling selfie that I actually
like, and it seems to have generally improved my mood.
- My primary community is now the Jupyter community, rather than the Wikimedia community.
This has had a lot of good cascading changes.
- Lots of personal changes, many I can’t publicly talk about.
- The world is an bleaker & more hopeful place than I had imagined.
The wonderful Graham Dumpleton asked on twitter why we built an entirely new tool (repo2docker) instead of using OpenShift’s cool source2image tool.
This is a very good question, and not a decision we made lightly. This post lays out some history, and explains the reasons we decided to stop using s2i. s2i is still a great tool for most production use cases, and you should use it if you’re building anything like a PaaS!
Before discussing, I want to clarify & define the various projects we are talking about.
- s2i is a nice tool from the OpenShift project that is used to build images out of git repositories. You can use heroku-like buildpacks to specify how the image should be built. It’s used in OpenShift, but can also be easily used standalone.
- BinderHub is the UI + scheduling component of Binder. This is what you see when you go to https://mybinder.org
- repo2docker is a standalone python application that takes a git repository & converts it into a docker image containing the environment that is specified in the repository. This heavily overlaps with functionality in s2i.
When repo2docker just wrapped s2i…
When we started building BinderHub, I looked around for a good heroku-like ‘repository to container image’ builder project. I first looked at Deis’ slugbuilder and dockerbuilder - they didn’t quite match our needs, and seemed a bit tied into Deis. I then found OpenShift’s source2image, and was very happy! It worked pretty well standalone, and
#openshift on IRC was very responsive.
So until July 1, we actually used s2i under the hood!
repo2docker was a wrapper that performed the following functions:
- Detect which s2i buildpack to use for a given repository
- Support building arbitrary Dockerfiles (s2i couldn’t do this)
- Support the Legacy Dockerfiles that were required under the old version of mybinder.org. The older version of mybinder.org munged these Dockerfiles, and so we needed to replicate that for compatibility.
@minrk did some wonderful work in allowing us to package the s2i binary into our python package, so users didn’t even need to download s2i separately. It worked great, and we were happy with it!
Moving off s2i
Sometime in July, we started adding support for Julia to binder/repo2docker. This brought up an interesting & vital issue - composability.
If a user had a
requirements.txt in their repo and a
REQUIRE file, then we’d have to provide both a Python3 and Julia environment. To support this in s2i, we’d have needed to make a
If it had a
runtime.txt with contents
python-2.7 and a
REQUIRE file, we’d have to provide a Python3 environment, a Python2 environment, and a Julia environment. To support this in s2i, we’d have needed to make a
If it had an
environment.yml file and a
REQUIRE file, we’d have to provide a conda environment and a Julia environment. To do this, we’d have to make a
As we add support for other languages (such as R), we’d need to keep expanding the set of buildpacks we had. It’d become a combinatorial explosion of buildpacks. This isn’t a requirement or a big deal for PaaS offerings - usually a container image should only contain one ‘application’, and those are usually built using only one language. If you use multiple languages, you just make them each into their own container & communicate over the network. However, Binder was building images that contained environments that people could explore and do things in, rather than specific applications. Since a lot of scientific computing uses multiple languages (looking at you, the people who do everything in R but scrape using Python), this was a core feature / requirement for Binder. So we couldn’t restrict people to single-language buildpacks.
So I decided that we can generate these combinatorial buildpacks in repo2docker. We can have a script that generates the buildpacks at build time, and then we can just check in the generated code. This would let us keep using s2i for doing image builds and pushes, and allow others using s2i to use our buildpacks. Win-win!
This had the following problems:
- I was generating bash from python. This was quite error prone, since the bash also needed to carefully support the various complex environment specifications we wanted to support.
- We needed to sometimes run assemble scripts as root (such as when there is an ‘apt.txt’ requiring package installs). This would require careful usage of
sudo in the generated bash for security reasons.
- This was very ‘clever’ code, and after running into a few bugs here I was convinced this ‘generate bash with python’ idea was too clever for us to use reliably.
At this point I considered making the
assemble script into Python, but then I’d be either generating Python from Python, or basically writing a full library that will be invoked from inside each buildpack. We’d still need to keep repo2docker around (for Dockerfile + Legacy Dockerfile support), and the s2i buildpacks will be quite complex. This would also affect Docker image layer caching, since all activities of
assemble are cached as one layer. Since a lot of repositories have similar environments (or are just building successive versions of same repo), this gives up a good amount of caching.
So instead I decided that the right thing to do here is to dynamically generate a Dockerfile in python code, and build / push the image ourselves. S2I was great for generating a best-practices production container that runs one thing and does it well, but for binder we wanted to generate container images that captured complex environments without regard to what can run in them. Forcing s2i to do what we wanted seemed like trying to get a square peg into a round hole.
So in this heavily squashed commit I removed s2i, and repo2docker became stand alone. It was sad, since I really would have liked to not write extra code & keep leveraging s2i. But the code is cleaner, easier for people to understand and maintain, and the composing works pretty well in understandable ways after we removed it. So IMO it was the right thing to do!
I personally would be happy to go back to using s2i if we can find a clean way to support composability + caching there, but IMO that would make s2i too complex for its primary purpose of building images for a PaaS. I don’t see repo2docker and s2i as competitors, as much as tools of similar types in different domains. Lots of <3 to the s2i / openshift folks!
I hope this was a useful read!
S2I was great for generating a best-practices production container that runs one thing and does it well, but for binder we wanted to generate container images that captured complex environments without regard to what can run in them. Forcing s2i to do what we wanted seemed like trying to get a square peg into a round hole.
Thanks to Chris Holgraf, MinRK and Carol Willing for helping read, reason about and edit this blog post