Yuvi Panda

JupyterHub | MyBinder | Kubernetes | Open Culture

Why repo2docker? Why not s2i?

https://xkcd.com/927/

The wonderful Graham Dumpleton asked on twitter why we built an entirely new tool (repo2docker) instead of using OpenShift’s cool source2image tool.

This is a very good question, and not a decision we made lightly. This post lays out some history, and explains the reasons we decided to stop using s2i. s2i is still a great tool for most production use cases, and you should use it if you’re building anything like a PaaS!

Terminology

Before discussing, I want to clarify & define the various projects we are talking about.

  1. s2i is a nice tool from the OpenShift project that is used to build images out of git repositories. You can use heroku-like buildpacks to specify how the image should be built. It’s used in OpenShift, but can also be easily used standalone.
  2. BinderHub is the UI + scheduling component of Binder. This is what you see when you go to https://mybinder.org
  3. repo2docker is a standalone python application that takes a git repository & converts it into a docker image containing the environment that is specified in the repository. This heavily overlaps with functionality in s2i.

When repo2docker just wrapped s2i…

When we started building BinderHub, I looked around for a good heroku-like ‘repository to container image’ builder project. I first looked at Deis’ slugbuilder and dockerbuilder - they didn’t quite match our needs, and seemed a bit tied into Deis. I then found OpenShift’s source2image, and was very happy! It worked pretty well standalone, and #openshift on IRC was very responsive.

So until July 1, we actually used s2i under the hood! repo2docker was a wrapper that performed the following functions:

  1. Detect which s2i buildpack to use for a given repository
  2. Support building arbitrary Dockerfiles (s2i couldn’t do this)
  3. Support the Legacy Dockerfiles that were required under the old version of mybinder.org. The older version of mybinder.org munged these Dockerfiles, and so we needed to replicate that for compatibility.

@minrk did some wonderful work in allowing us to package the s2i binary into our python package, so users didn’t even need to download s2i separately. It worked great, and we were happy with it!

Moving off s2i

Sometime in July, we started adding support for Julia to binder/repo2docker. This brought up an interesting & vital issue - composability.

If a user had a requirements.txt in their repo and a REQUIRE file, then we’d have to provide both a Python3 and Julia environment. To support this in s2i, we’d have needed to make a python3-julia buildpack.

If it had a requirements.txt, a runtime.txt with contents python-2.7 and a REQUIRE file, we’d have to provide a Python3 environment, a Python2 environment, and a Julia environment. To support this in s2i, we’d have needed to make a python3-python2-julia buildpack.

If it had an environment.yml file and a REQUIRE file, we’d have to provide a conda environment and a Julia environment. To do this, we’d have to make a conda-julia buildpack.

As we add support for other languages (such as R), we’d need to keep expanding the set of buildpacks we had. It’d become a combinatorial explosion of buildpacks. This isn’t a requirement or a big deal for PaaS offerings - usually a container image should only contain one ‘application’, and those are usually built using only one language. If you use multiple languages, you just make them each into their own container & communicate over the network. However, Binder was building images that contained environments that people could explore and do things in, rather than specific applications. Since a lot of scientific computing uses multiple languages (looking at you, the people who do everything in R but scrape using Python), this was a core feature / requirement for Binder. So we couldn’t restrict people to single-language buildpacks.

So I decided that we can generate these combinatorial buildpacks in repo2docker. We can have a script that generates the buildpacks at build time, and then we can just check in the generated code. This would let us keep using s2i for doing image builds and pushes, and allow others using s2i to use our buildpacks. Win-win!

This had the following problems:

  1. I was generating bash from python. This was quite error prone, since the bash also needed to carefully support the various complex environment specifications we wanted to support.
  2. We needed to sometimes run assemble scripts as root (such as when there is an ‘apt.txt’ requiring package installs). This would require careful usage of sudo in the generated bash for security reasons.
  3. This was very ‘clever’ code, and after running into a few bugs here I was convinced this ‘generate bash with python’ idea was too clever for us to use reliably.

At this point I considered making the assemble script into Python, but then I’d be either generating Python from Python, or basically writing a full library that will be invoked from inside each buildpack. We’d still need to keep repo2docker around (for Dockerfile + Legacy Dockerfile support), and the s2i buildpacks will be quite complex. This would also affect Docker image layer caching, since all activities of assemble are cached as one layer. Since a lot of repositories have similar environments (or are just building successive versions of same repo), this gives up a good amount of caching.

So instead I decided that the right thing to do here is to dynamically generate a Dockerfile in python code, and build / push the image ourselves. S2I was great for generating a best-practices production container that runs one thing and does it well, but for binder we wanted to generate container images that captured complex environments without regard to what can run in them. Forcing s2i to do what we wanted seemed like trying to get a square peg into a round hole.

So in this heavily squashed commit I removed s2i, and repo2docker became stand alone. It was sad, since I really would have liked to not write extra code & keep leveraging s2i. But the code is cleaner, easier for people to understand and maintain, and the composing works pretty well in understandable ways after we removed it. So IMO it was the right thing to do!

I personally would be happy to go back to using s2i if we can find a clean way to support composability + caching there, but IMO that would make s2i too complex for its primary purpose of building images for a PaaS. I don’t see repo2docker and s2i as competitors, as much as tools of similar types in different domains. Lots of <3 to the s2i / openshift folks!

I hope this was a useful read!

TLDR

S2I was great for generating a best-practices production container that runs one thing and does it well, but for binder we wanted to generate container images that captured complex environments without regard to what can run in them. Forcing s2i to do what we wanted seemed like trying to get a square peg into a round hole.

Thanks to Chris Holgraf, MinRK and Carol Willing for helping read, reason about and edit this blog post

systemd simple containment for GUI applications & shells

I earlier had a vaguely working setup for making sure browsers, shells and other applications don’t eat all RAM / CPU on my machine with systemd + sudo + shell scripts.

It was a hacky solution, and also had complications when used to launch shells. It wasn’t passing in all the environment varialbes it should, causing interesting-to-debug issues. sudo rules were complex, and hard to do securely.

I had also been looking for an excuse to learn more Golang, so I ended up writing systemd-simple-containment or ssc.

It’s a simple golang application that produces a binary that can be setuid to root, and thus get around all our sudo complexity, at the price of having to be very, very careful about the code. Fortunately, it’s short enough (~100 lines) and systemd-run helps it keep the following invariants:

  1. It will never spawn any executable as any user other than the ‘real’ uid / gid of the user calling the binary.
  2. It doesn’t allow arbitrary systemd properties to be set, ensuring a more limited attack surface.

However, this is the first time I’m playing with setuid and with Go, so I probably fucked something up. I feel ok enough about my understanding of real and effective uids for now to use it myself, but not to recommend it to other people. Hopefully I’ll be confident enough say that soon :)

By using a real programming language, I also easily get commandline flags for sharing tty or not (so I can use the same program for launching GUI & interactive terminal applications), pass all environment variables through (which can’t be just standard child inheritence, since systemd-run doesn’t work that way) & the ability to setuid (you can’t do that easily to a script).

I was sure I’d hate writing go because of the constant if err != nil checks, but it hasn’t bothered me that much. I would like to write more Go, to get a better feel for it. This code is too short to like a language, but I definitely hate it less :)

Anyway, now I can launch GUI applications with ssc -tty=false -isolation=strict firefox and it does the right thing. I currently have available -isolation=strict and -isolation=relaxed, the former performing stronger sandboxing (NoNewPrivileges, PrivateTmp) than the latter (just MemoryMax). i’ll slowly add more protections here, but just keep two modes (ideally).

My Gnome Terminal shell command is now ssc -isolation=relaxed /bin/bash -i and it works great :)

I am pretty happy with ssc as it exists now. Only thing I now want to do is to be able to use it from the GNOME launcher (I am using GNOME3 with gnome-shell). Apparently shortcuts are no longer cool and hence pretty hard to create in modern desktop environments :| I shall keep digging!

systemd gui applications

Update: There’s a follow-up post with a simpler solution now.

Ever since I read Jessie Frazelle’s amazing setup (1, 2, 3) for running GUI applications in docker containers, I’ve wanted to do something similar. However, I want to install things on my computer - not in docker images. So what I wanted was just isolation (no more Chrome / Firefox freezing my laptop), not images. I’m also not as awesome (or knowledgeable!) as Jess, so will have to naturally settle for less…

So I am doing it in systemd!

Before proceeding, I want to warn y’all that I don’t entirely know what I am doing. Don’t take any of this as security advice, since I don’t entirely understand X’s security model. Works fine for me though!

GUI applications

I started out using a simple systemd templated service to launch GUI applications, but soon realized that systemd-run is probably the better way. So I’ve a simple script, /usr/local/bin/safeapp:

#!/bin/bash
exec sudo systemd-run  \
    -p CPUQuota=100% \
    -p MemoryMax=70% \
    -p WorkingDirectory=$(pwd) \
    -p PrivateTmp=yes \
    -p NoNewPrivileges=yes \
    --setenv DISPLAY=${DISPLAY} \
    --setenv DBUS_SESSION_BUS_ADDRESS=${DBUS_SESSION_BUS_ADDRESS} \
    --uid ${USER} \
    --gid ${USER} \
    --quiet \
    "$1"

I can run safeapp /opt/firefox/firefox now and it’ll start firefox inside a nice systemd unit with a 70% Memory usage cap and CPU usage of at most 1 CPU. There’s also other minimal security stuff applied - NoNewPrivileges being the most important one. I want to get ProtectSystem + ReadWriteDirectories going too, but there seems to be a bug in systemd-run that doesn’t let it parse ProtectSystem properly…

Also, there’s an annoying bug in systemd v231 (which is what my current system has) - you can’t set CPUQuotas over 100% (aka > 1 CPU core). This is annoying if you want to give each application 3 of your 4 cores (which is what I want). Next version of Ubuntu has v232, so my GUI applications will just have to do with an aggregate of 1 full core until then.

The two environment variables seem to be all that’s necessary for X applications to work.

And yes, this might ask you for your password. I’ll clean this up into a nice non-bash script hopefully soon, and make all of these better.

Anyway, it works! I can now open sketchy websites with scroll hijacking without fear it’ll kill my machine!

CLI

I wanted each tab in my terminal to be its own systemd service, so they all get equitable amount of CPU time & can’t crash machine by themselves with OOM.

So I’ve this script as /usr/local/bin/safeshell

`#!/bin/bash
exec sudo systemd-run \
    -p CPUQuota=100% \
    -p MemoryMax=70% \
    -p WorkingDirectory=$(pwd) \
    --uid yuvipanda \
    --gid yuvipanda \
    --quiet \
    --tty \
    /bin/bash -i

The --tty is magic here, and does the right things wrt passing the tty that GNOME terminal is passing in all the way to the shell. Now, my login command (set under profile preferences > command in gnome-terminal) is sudo /usr/local/bin/safeshell. In addition, I add the following line to /etc/sudoers:

%sudo ALL = (root) NOPASSWD:SETENV: /usr/local/bin/safeshell

This + just specifying the username directly in safeshell are both hacks that make me cringe a little. I need to either fully understand how sudo’s -E works, or use this as an opportunity to learn more Go and make a setuid binary.

To do

[ ] Generalize this to not need hacks (either with better sudo usage or a setuid binary) [ ] Investigate adding more security related options. [ ] Make these work with desktop / dock icons.

I’d normally have just never written this post, on account of ‘oh no, it is imperfect’ or something like that. However, that also seems to have come in the way of ability to find joy in learning simple things :D So I shall follow b0rk’s lead in spending time learning for fun again :)

My Vim setup

I moved from emacs to vim a while ago, and have been steadily accumulating a series of plugins in my .vim. They’re all up in my rather messy dotfiles repo. Here’s a slightly more neatly organized list of the plugins I currently use:

  1. command-t – File opener and buffer switcher. In-fuckin-credibly useful.
  2. vimpress – What I use to blog since moving to wordpress.
  3. matchit – Lets % work with html tags
  4. commentary – Generic commenting and uncommenting script NERDCommenter has replaced commentary due to being more flexible and having more options.
  5. fugitive – Incredibly awesome git wrapper for vim. I rarely go to the commandline for git these days
  6. tagbar – Useful code-exploratory plugin when I’m looking around a codebase trying to familiarize myself.
  7. supertab – Buffer completion in insert mode only when I need it.
  8. gist – Put stuff up in gist to pass it around
  9. BufClose – So I can close a buffer without messing up my splits
  10. extradite:Glog replacement that builds on top of fugitive. I don’t understand why this isn’t bundled with fugitive
  11. TwitVim – Yes, so I don’t have to go to the browser (and be consumed by chat/reddit/hn) just to post a tweet.
  12. ack.vim – Ack integration for vim. Do yourself a favor and use ack instead of grep.
  13. Syntastic – Automatic syntax checking so that I don’t miss a semicolon and not know about it
  14. php-doc – Insert boilerplate PHP doc compatible strings in my PHP files whenever I want to. Very PHP specific, need to find something that works across languages. (Note: This plugin has quite some identity crisis. It’s named PDV but it’s filename is php-doc. Since php-doc is more descriptive, I’m using that)
  15. delimitMate – Automatically closes quotes, parens, braces, etc for you. I initially thought this would be super annoying, but in fact it is rather very pleasant.

I’m also on the default desert color theme – haven’t found anything better. Suggestions welcome – both for the color scheme and for new/replacement plugins. After trying out wombat and jellybeans color schemes, I have settled on wombat for now.

Suggestions for more plugins still welcome :)

This is the list as of 24 Aug 2011. Updated as of September 2 2011 (added 10, 11, 12 since last update). Updated as of September 5 2011 (changed 4, added colorscheme change). Updated as of September 12 2011 (added 13, 14, 15). I am moving quite fast, am I not? Will keep this post updated as and when things change.

Non-ASCII Characters in HTTP Headers

I was debugging an issue at work today where a (generated) file refused to download in Chrome, but the same URL worked just fine with wget. I remember reading in the HTTP Spec that HTTP headers can only be lower ASCII, so when wget mangled the output file’s name, the problem was obvious – the file name contained a character that wasn’t in lower ASCII (an accented A). Chrome had borked on encountering it, while wget soldiered on. Using iconv to strip non-ASCII characters in the file name on the server side fixed the issue.

Moral of the story? Read the RFCs! The HTTP one, in particular, is remarkably readable and you should read it if you’re doing non-trivial Web Development.

P.S: If I had had time, I’d have went around testing this behavior in several user agents and documenting their behavior (and possibly submitting bug reports) – but .

 

You are a programmer if you can run code in your head

From HN:

From my experience, the largest hurdle first time programmers have is being able to execute programs in their head. It takes a cognitive leap to go from the source code in front of them, and what happens at runtime.

(emphasis mine)

I believe that if you have made that cognitive leap, you can call yourself a programmer. It means you’ve entered into the second hump – you are a programmer. One of us. Welcome :)

This is also the reason why solving programming challenges at places like InterviewStreet Challenges, CodeChef, TopCoder, SPOJ, etc also increase your general programming skills – they require that you continuously run code in the interpreter in your head. Helps you train your procedural memory. Same reason why learning different language paradigms (OOP, Purely Functional, Procedural, etc) makes you a better programmer.

Reminds me that I have four more chapters in SICP to finish and Clojure to learn :)

 

SleepSort from 4Chan

4Chan isn’t exactly what you’d associate with ‘shit, that’s a cool piece of code!’, but look what I found!

(via HN)

Nice hack. And it is O(N) in bitsize of largest number (Ignoring all the overhead of forking new processes, the problem with the race conditions(Sort 0 and 0.1? How about 0.01 and 0.011?), processor contention, etc,.) They basically delegated the sorting to the process scheduler, treating it as a sort of priority queue.

Sometimes you just see a hack so out of the box that it makes you say ‘woah!’ and let your mouth hang open there for a while. One such time this :)

Moving back to WordPress

A while back, I moved my blog (from a wordpress install whose data I lost) to my own platform (HiSlain).

After several close calls, I’m officially moving from HiSlain back to WordPress. I just don’t have enough time to maintain all the things I wanted in a blog platform in HiSlain, hence the move back. However, it satisfied it’s purpose – I learnt to write code I could use, others could use, and others could contribute to. The old blog is still around (all permalinks would work) if you’re interested.

It’s good to be back :)

I miss writing posts in Vim though – expect to find/write a plugin that’ll let me blog/compose from Vim. Suggestions?

SICP and BrainFuck

No, I’m not going to blow your mind by telling you I’m going to do the SICP exercises in BrainFuck :)

I just wrote my first brainfuck program. Might as well be my last, but not sure. Tape-based turing-ish programing seems fun, so maybe I’ll try that again with something more expressive than brainfuck (or maybe that is pointless?)

And I’m doing the SICP Exercises. In MIT-Scheme. I’ve put them up on GitHub. I’m doing this with the FPUG-C SICP Study Group, so hope I’m able to complete it all this time :) Big props to Balaji for organizing it, and letting us get away with the free coke and coffee!