Yuvi Panda

JupyterHub | MyBinder | Kubernetes | Open Culture

Devlog 2018 12 26

Physical activity

I walked for about 6 miles yesterday evening, after doing 3-4 miles each day for about 2-3 days before. The roads were empty, and it was lovely. I’ve listened to maybe 6-10 hours of Deborah Frances-White in the last week or so, split between The Guilty Feminist and Global Pillage. Next time though, I’m going to try listen to podcasts less & observe my surroundings more. I am doing this (plus PT) to rehab my knee mostly. I slept for 12h yesterday night :)

Asyncio deadlock

I don’t have enough experience with the pitfalls of concurrent programming where you have to use synchronization techniques. I had my first deadlock today, and after a few minutes realized it was actually an infinite recursion from a typo. I need to get a better theoretical understanding of both event loop based async programming and synchronization methods. I’ve used them with threading & Java, but feel shaky in asyncio.

simperviser now has 100% unit test coverage. But it combines asyncio, processes & signals - so that doesn’t give me enough confidence as it might have otherwise. I’ll take it though :)

VSCode customization

I’ve been re-reading the pragmatic programmer again (after initially reading it about 12 years or so ago). A lot of it still holds up, although some stuff is date (love for Broken Window policing & perl). It reminded me that I hadn’t really spent much time customizing and being more productive in VSCode, so am spending time today doing that.

I switch between vscode and my terminal quite a bit, mostly for git operations and running tests. I’m going to see if I can stay inside vscode comfortably for running pytest based tests.

It made me very sad that the only thing I wanted from the pytest integration in vscode is something the maintainers aren’t actively working on - shortcut to run test the cursor is currently at. I also can’t seem to see test output directly.

I think I’ll be using the Terminal for pytest runs for now. I’m not even going to try with git.

If I was younger and not already full of projects to do, I’d have picked up and tried to make a PR for this. Boo time commitments.

Readyness check

I rallied late in the day & wrote some code around readyness checks in simpervisor. I don’t fully understand what I want it to do, but I’m going to look at what kubernetes does & try to follow that. Since this is being written for nbserverproxy, I am also going to try port nbserverproxy to simpervisor to see what kind of API affordances I’m missing. Primarily, I feel there should be a lock somewhere, but:

  1. I don’t have a unified theory of what needs locking & why
  2. I don’t know if I need a lock or another synchronization mechanism
  3. I don’t know how it’ll actually be used by the application

So designing by porting nbserverproxy seems right.

Devlog 2018 12 24

Gracefully exiting asyncio application

Continuing yesterday’s work on my simple supervisor library, I continued trying to propagate signals cleanly to child processes before exiting. I remembered that it isn’t enough to just propagate signals - you also have to actually reap them. This meant waiting for wait calls on them to return.

I had a task running concurrently that is waiting on these processes. So ‘all’ I had to do was make sure the application does not exit until these tasks are done. This turned out to be harder than I thought! After a bunch of reading, I recognized that what I needed to do was make sure I wait for all pending tasks before actually exiting the application.

This was more involved than I thought. It also must be done at the application level rather than in the library - you don’t want libraries doing sys.exit, and definitely don’t want them closing event loops.

After a bunch of looking and playing, it looks like what I want is in my application code is something like:

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    try:
        loop.run_until_complete(main())
    finally:
        loop.run_until_complete(asyncio.gather(*asyncio.Task.all_tasks()))
        loop.close()

This waits for all tasks to complete before exiting, and seems to make sure all child processes are reaped. However, I have a few unresolved questions:

  1. What happens when one of the tasks is designed to run forever, and never exit? Should we cancel all tasks? Cancel tasks after a timeout? Cancelling tasks after a timeout seems most appropriate.
  2. If a task schedules more tasks, do those get run? Or are they abandoned? This seems important - can tasks keep adding more tasks in a loop?

I am getting a better handle on what people mean by ‘asyncio is more complicated than needed’. I’m going to find places to read up on asyncio internals - particularly how the list of pending tasks is maintained.

This series of blog posts and this EuroPython talk from Lynn Root helped a lot. So did Saúl Ibarra Corretgé (one of the asyncio core devs) talk on asyncio internals

Testing

Testing code that involves asyncio, signals and processes is hard. I attempted to do so with os.fork, but decided that is super-hard mode and I’d rather not play. Instead, I wrote Python code verbatim that is then spawned as a subprocess, and use stdout to communicate back to the parent process. The child process’ code itself is inline in the test file, which is terrible. I am going to move it to its own file.

I also added tests for multiple signal handlers. I’ve been writing a lot more tests in the last few months than I was before. I credit Tim Head for a lot of this. It definitely gives me a lot more confidence in my code.

DevLog 2018 December 23

I have enjoyed keeping running logs of my coding work (devlogs) in the past, and am going to start doing those again now.

This ‘holiday’ season, I am spending time teaching myself skills I sortof know about but do not have a deep understanding of.

JupyterLab extension

I spent the first part of the day (before I started devlogging) working on finishing up a jupyterlab extension I started the day before. It lets you edit notebook metadata. I got started since I wanted to use Jupytext for my work on publishing mybinder.org-analytics.

TypeScript was easy to pick up coming from C#. I wish the phospor / JupyterLab code had more documentation though.

I ran into a bug. While following instructions to set up a JupyterLab dev setup, I somehow managed to delete my source code. Thankfully I got most of it back thanks to a saved copy in vscode. It was a sour start to the morning though.

I’ll get back on to this once the sour taste is gone, and hopefully the bug is fixed :)

asyncio: what’s next | Yuri Selivanov @ PyBay 2018

I’ve been trying to get a better handle on asyncio. I can use it, but I don’t fully understand it - I am probably leaving bugs everywhere…

From one of the asyncio maintainers. Gave me some impetus to push the default version of Python on mybinder.org to 3.7 :D I’m most excited about getting features from Trio & Curio into the standard library. Was good to hear that nobody can quite figure out exception handling, and not just me.

I discovered aioutils while searching around after this. I’ve copy pasted code that theoretically does the same things as Group and Pool from aioutils, but I’ve no idea if they are right. I’ll be using this library from now!

Processing Signals

I’m writing a simple process supervisor library to replace the janky parts of nbserverproxy. It should have the following features:

  1. Restart processes when they die
  2. Propagate signals appropriately
  3. Support a sense of ‘readiness’ probes (not liveness)
  4. Be very well tested
  5. Run on asyncio

This is more difficult than it seems, and am slowly working my way through it. (1) isn’t too difficult.

(2) is a fair bit more difficult. atexit is useless since it doesn’t do anything with SIGTERM. So I need to manage my own SIGTERM handlers. However, this means there needs to be a centralish location of some sort that decides when to exit. This introduces global state, and I don’t like that at all. But unix signals are global, and maybe there’s nothing for me to do here.

I initially created a Supervisor class that holds a bunch of SupervisedProcess’s, but it was still calling sys.exit in it. Since signals are global, I realize there’s no other real way to handle this, and so I made a global handler setup too. This has the additional advantage of being able to remove handlers when a SupervisedProcess dies, avoiding memory leaks and stuff.

Testing this stuff is hard!

I also need to make sure I don’t end up with lots of races. I’m still writing concurrent code, even without threads. Gotta be careefull. Especially with signals thrown in. Although I guess once you get a SIGTERM or SIGINT inconsistent state is not particularly worrysome.

DevLog for 21-30 Dec 2015

Clearly I missed an entire week. I need to build a better system to make this easier…

Random notes.

  • Kicked out NFS from the Tool Labs proxies (with 1). Yay! This hopefully explains the lockup of tools-proxy-01 yesterday night, maybe? It’s been restarted since, and I hope to no longer have instances randomly locoking on me. Infrastructure standards of 2009, here we come! :D I’ve also removed NFS from tools-redis, and migrated them to Jessie as well.
  • Fixed up all the races in how kubernetes workers are setup with 2
  • Another instance is ‘stuck’ again. Sigh. AAAAAAAAAAAAAAAAAAAAAAAAA. Paravoid helped debug this, tracking it down to NFS client issues in the 4.2 kernel (See phab). I moved k8s nodes back to a working 3.19 kernel (after filing issue about the other 3.19 kernel package I tried that didn’t work).
  • Moved the tools proxies to 4.2 (lol?) after finding out huge ksoftirqd spikes in them. Let’s see if that improves things
  • I split up the individual components of PAWS, and have a working nbserve in there now! Exciting times. Need to fixup nbserve to use traitlets for config
  • Big Tool Labs outage (again!). Some tool accidentally sent about 12million job requests and crashed gridengine’s underlying backing store (BerkeleyDB). Reset to a clean slate after many hours (Thanks Coren!) and mostly things are back up now. I’m reading through the Berkeley DB reference manual now.
  • Persistance failed for ores’s redises again, mostly because vm_overcommit was turned off. Fixed in the core redis module so it does not happen again.

DevLog for 2015 Dec 20

Probably going to take it easy and chill. Already sent a trivial PR up though.

Also saw the old devlog mailing list and feeling happy memories. Clearly need to bring something back like that, but I don’t know / think mailing lists are the best medium. More thinking!

Ended up learning some Tornad and wrote nbserve to serve rendered notebooks and static files in a configurable way. I should refactor PAWS to have separate jupyterhub, proxy and nbserve pods tomorrow. Also need to test for path traversal attacks and whatnot. Also coroutine based programming is a lot easier than I had originally thought! woo!

DevLog for 2015 Dec 19

  • Today looks like a day of finding, reporting and fixing bugs in WikiChatter. I had made a stupid mistake yesterday that meant not all of the Teahouse pages were being parsed, and I immediately started running into bugs. Have reported (and fixed!) two (1 and 2), and ran into another bug in mediawikiparserfromhell itself. I’m sure I’ll find more.

DevLog for Jan 5, 2015

DevLog for Jan 5, 2015

Trying out the ByWord app, since it has MarkDown support and also publishes to WordPress. Paid Rs. 900 for it, let’s hope it is useful. Hate the default font, though.

  • Need to parse View definitions on labsdb public databases and verify that we aren’t leaking any info (https://phabricator.wikimedia.org/T85783). Was going to use the sqlparse library, but that doesn’t seem very complete or useful. Will attempt to use regexes now. Wish I had learnt how to build parsers properly easily.
  • Ignore lament ^, I spent some time learning to use PyParsing, and ended up with a very simple parser that parsed the small subset of SQL I cared about! Yay :) Still need to set aside some time at some point to fully understand parsers/parsing.
  • Fixed wikdidata beta not resolving (https://phabricator.wikimedia.org/T85793). Of course, problem was caused by me as well, not verifying that DNS changes I was making didn’t have unintended consequences. It’s also somewhat surprising that it was down for 6 days before someone noticed.
  • Add labsdb-auditor report that verifies views have appropriately done definitions that do not expose private information (https://gerrit.wikimedia.org/r/#/c/182848/). This was fun to write. I should definitely learn more parsing things and seek out other simpel things I can write a parser for. Code quality is getting better, although could still be better if I can remove the regex part entirely – it currently cleans up the SQL via a regex and then parses it with pyparsing.
  • Fixed a consistency bug in the perl script that maintians public view replicas (https://gerrit.wikimedia.org/r/#/c/182819/).

Reading list: * What most young programmers need to learn was nice. I just feel lucky to have been gifted a copy of Code Complete and Pragmatic Programmer when I was 15 (by Colinizer).

DevLog for Jan 3-4 2015

Nothing much going on…

  • Continue reading Real World Haskell. This is indeed expanding my horizons, but I haven’t found any project I could use haskell in yet. Will keep looking.
  • Helped fix a parsoid outage yesterday. No biggie. I still do not understand fully how parsoid works / how it is set up, should do.
  • Spend most of the rest of the time in ‘weekend mode’, which includes mostly not doing things. That went mostly well.
  • Watched ‘Pink Floyd The Wall’ movie. HOLY FUCK MY MIND IT IS GONE OMG IT IS QUITE REALLY NICE. Recommended.

I should probably write a ‘Year in Review 2014’ post, but so much of that is going to be NSFW and private that there probably is no point… ;)

DevLog for Jan 2, 2015

Woohoo, 2015! :) Let’s see if I can keep this going this time :)

  • Have started taking weekends slightly more seriously. My brain feels less overwhelmed after that. Should take weekends even more seriously going forward.
  • Merged labsdb-auditor rewrite patch. Code clean enough for my tastes now, although not documented enough. Lots of back and forth CR with valhallasw was really nice – it’s something I’ve been missing for the past few months. Debating over style and cleanliness was also interesting. Felt like I had been taking code cleanliness and proper design not as seriously as I should have, and this is now time to change that. Also spent time consciously thinking about the commit message I was writing, and shall continue doing that. I think I’ve taken a few things for granted wrt code quality and design, and should be more careful. And learnt a few more tricks about decorators (haven’t built decorators that are parameterized before).
  • Read a few chapters of Real World Haskell. Quite nice, although the syntax feels a bit too dense for my tastes. Still should maybe write things with ChickenScheme instead :)
  • Continuing re-reading Coders at Work, which continues to be a thought provoking, wonderful book.

DevLog Sep 12-24, 2014

That was a long break. Should get into better habits. I’m sure I’d miss some here. Oh well.

It was a very interesting week, culminating in some very interesting things at Trafalgar Square. I feel much better as a person and more calmer/chilled out now :) Should remember to take weekends off.

  • Helped the WikiMetrics people with some puppet changes. They were requiring mysql directly instead of using the class we have in ops/puppet, which meant it put data in the /var partition instead of in /srv. This was because they want to be a ‘proper’ independent puppet module, not depending on the other modules in ops/puppet (since the module is used in vagrant as well). Puppet has terrible support for this kind of thing, and ops/puppet by itself also suffers from an incredible amount of NIH syndrome when it comes to modules. Some of the NIH-so-rewrite is justified, but I don’t think all of it is. Oh well. Hopefully something that can be fixed over time…
  • Added basic icinga/graphite based monitoring for contint project (runs our jenkins installation). Quite trivial, but should be ported over to shinken at some point soon.
  • Contributed a fair bit to the Shout Web IRC project. I was looking around for an open source replacement to IRCCloud, and found a bunch of OSS projects (Subway and http://ircanywhere.com/ being the others). Shout seemed the most active (and simplest) of them all – I couldn’t actually get the others to a working installation. I fixed a bunch of code formatting issues, and also added a couple of features (Desktop notifications being the most significant, I think). The maintainer was fun to work with, and merged most of my changes quite promptly. I hang out on #shout-irc channel now, and activity seems to be speeding along. Should be fun to see how it goes :)
  • Moved off IRCCloud. Stopped paying for it. I’ve also stopped doing email on my phone, in an effort to be more minimalistic / less distracted all the time. This meant no more IRC on phone, so my primary reason for using IRCCloud disappeared. Am back to using LimeChat with my ZNC bouncer now. And nothing on my phone. In case of emergencies, I still have the IRCCloud app on my phone, and can use the free edition to connect to my bouncer temporarily.
  • Started working on labs-specific monitoring, with shinken – test instance at shinken.wmflabs.org. Discussion in the ops list about how to best implement this. Shinken is supposed to horizontally scale much better than icinga since it is implemented as a bunch of different daemons that can talk over the network. Right now they are just one machine, but will probably expand at some point not too far away. They are running as a labs project rather than on a physical machine (labmon1001) because it is much easier from a networking perspective. Right now it only has a simple http check for betacluster, but more coming.
  • Started refactoring all our icinga code. Separating them out into nagios_commmon module for config that can be shared between shinken and icinga, and into an icinga module for everything else. Moved all the custom checks and check config we have, moving other things as we go. This is a bit hard / frustrating since I don’t have access to the machine icinga runs on (neon), and also because I still do not have +2 in our ops repo (should change in a month, tho!). Am about 40 patches in (small ones!), let’s see how many more it takes!
  • Moved our dsh code into a module. Was fairly trivial. Now to find ways to get it merged :)
  • Started trying out pinboard. Interesting, barebones design. The bookmarklet is a bit slow to load, though. Let’s see how often I use this. (I did pay for it)

Five more days left in the UK :( Let me see if I can continue doing these when here, in a more prompt manner.