Announcing the nbgitpuller Link Generator browser extensions

(Leave comments or discuss this post on the jupyter discourse)

nbgitpuller is my most favorite way to distribute content (notebooks, data files, etc) to students on a JupyterHub. The student mental model is ‘I click a link, and can start working on my notebook’, which is as close to ideal as we have today. That is possible since all the information required for this workflow is embedded in the link itself - so it can be distributed easily via your pre-existing communication channel (like email, course website, etc), rather than requiring your students to use yet another tool.

However, creating these links often been a bit clunky and error prone. The current link generator is pretty awesome, but requires a lot of manual copy pasting, and is prone to errors. Particularly problematic was the GitHub switching of the default branch from master to main, which really caused problems for many instructors.

To make life easier, I’ve now written a browser extension that lets you create these links straight from the GitHub interface!

On the GitHub page for files, folders and repositories, it adds an ’nbgitpuller’ button.

nbgitpuller button

On clicking this, you can enter a JupyterHub URL and the application you want to use to open this file, folder or repository. Then you can just copy the nbgitpuller URL, and share it with your students!

nbgitpuller popover

The JupyterHub URL and application you choose are remembered, so you do not need to enter it over and over again.

How to install?

On Mozilla Firefox

I ❤️ Mozilla Firefox, and you can install the extension there easily from the official addons store. You’ll also get automatic updates with new features and bug fixes this way.

On Google Chrome

On Google Chrome / Chromium, I have submitted the extension to the Chrome Web Store - but apparently there is a manual review process, and it can take weeks. In the meantime, you can install it with the following steps

  1. Download the .zip version of the latest release of the extension. You want the file named nbgitpuller_link_generator-<version>.zip.
  2. Extract the .zip file you downloaded.
  3. In your Google Chrome / Chromium, go to chrome://extensions.
  4. Enable the Developer Mode toggle in the top right. This should make a few options visible in a new toolbar.
  5. Select Load Unpacked, and select the directory into which the downloaded .zip file was extracted to. This directory should contain at least a manifest.json file that was part of the .zip file.

Other browsers

The extension is written as a WebExtension, so should work easily in other browsers - Safari, Edge, etc. However, since I do not use them myself, I don’t have instructions on how to add them there. However, if you do use those browsers, I’d love contributions on how to install the extension there!

Contribute!

The extension was a overnight hack, and it made me very happy to get it shipped. There is a lot of room for improvement - I would love for you to provide feedback and contribute in any way possible on GitHub

Learning R

I’m paying money to learn datascience with R via this John Hopkins course on Coursera. Hopefully I can spend about 4h a week on it.

Why?

I try to have a ‘T’ shaped set of competencies - some competence in a lot of areas, and deeper knowledge in some. Having some competence in many areas lets you notice where a particular skill can be very helpful in solving a problem, and either apply it yourself or bring in someone who can. Interesting problems are found in intersections, and learning this will give me access to more intersections :)

It is also an extremely useful skillset - I am sure I can apply it to more problems than I can apply (for example) cloud computing knowledge.

I’ve been building tools for data scientists for a while now, but since I don’t actually know much about data science itself my effectiveness is limited.

Why this course?

I went to coursera.org, typed ‘data science’ and this was the first that showed up :D

Why R?

I already know Python as a programming language, which sometimes makes it difficult for me to learn data science via it. Many courses targetted at people with my level of skill in data science / stats also teach some python alongside, and I often found that distracting.

With my JupyterHub contributor hat on, I think it’s extremely important that R is a first class citizen in all the teaching & research tools I build. Getting some experience using it will help in this goal.

Why this blog post?

Just as a form of external accountability.

The many ways to access your JupyterHub environment

JupyterHub was designed to provide multi-user access to Classic Jupyter notebook and JupyterLab, running on machines far away with access to data & compute your local machines don’t have. However, there are a lot of users this leaves out - R users on RStudio, folks who prefer ssh based interfaces, people on local IDEs like vscode or pycharm, etc. We trust that these users know better than ‘us’ developers & admins what user interface works best for them, and make sure that our JupyterHub setups work for them.

I propose that we try provide the following in all JupyterHub setups, but particularly on cloud-based deployments focused on research.

  1. Arbitrary web applications on the browser, not just classic notebook & JupyterLab. This could be RStudio / Shiny for R folks, Pluto.jl for Julia folks, Theia IDE or vscode for arbitrary languages, etc. These need to be first class citizens with amazing support. jupyter-server-proxy does a lot of this already - but still needs to be ‘set up’ in your deployment. DO IT!

  2. ssh support, so you can ssh yuvipanda@myhub.org, and have it work as you would expect. It should have the same environment & home directories as your browser based sessions. Along with something like kbatch, this could let folks who like their current HPC style workflows keep it. SFTP for file transfer, SSH Tunneling, scripting via command execution, etc should also come along with this, allowing re-use with a vast array of existing tools & workflows.

  3. Connecting from local IDEs, for notebook execution. The Jupyter protocol is designed to work with a variety of frontends, and many IDEs support connecting directly to a Jupyter kernel - Visual Studio Code, PyCharm, spyder, emacs, etc. There is already decent-ish support for this, but would be nice to explicitly document, bugfix and promote this to users.

  4. Connecting from local IDEs for a full remote development environment. Connecting to a remote Jupyter kernel often doesn’t give you all that you want from your IDE - no access to remote filesystem, limited access from other IDE features (version control, global find, etc). Many IDEs offer a ‘remote development’ experience that gives you all this, and usually requiring just a few SSH features to work. VSCode’s remote development feature is very well done, popular and not open source, but we should support it. PyCharm supports this, and Emacs’ TRAMP is highly rated as well.

You might not need some of these in your deployment - for example, you might have no R users, rendering RStudio not very useful. However, I think the world is getting more and more polyglot, and we need to move with it. Consider deploying these, even if your users don’t explicitly ask for it - I suspect many will use it once it is there. <3

Work to be done

There’s a lot of work to be done here to make this all a great experience.

  1. Documentation and publicity for jupyter-server-proxy. For example, this README is all the docs that exist on how to set up RStudio to work in your JupyterHub. This is low hanging fruit with huge impact. Same for running remote desktop applications, Pluto.jl, and many many others.
  2. Work on jupyterhub-ssh. Aims to provide full SSH support - terminal emulation, command execution, tunneling and SFTP. This would go a long way in supporting (2) and (4) above. It sortof-kinda works now, but effort poured into it will be well spent. I long for a day when setting this up would be as simple as setting up z2jh or TLJH, and it actually isn’t that far away if effort is put into it.
  3. More outreach to non-Jupyter communities about levaraging JupyterHub for their workflows. The name ‘Jupyter’ in ‘JupyterHub’ is no longer fully accurate, given all the other things that can be run with it. We should reach out to other communities and ‘bring them into the fold’, so to speak - so their workflows, desires and bugs can drive development.
  4. A unified setup guide for ‘research hubs’, that guides first-time admins through setting these up - including great documentation. A z2jh for research hubs on the cloud would be extremely well received I believe.

Will you help?

I’ve been talking to Chris Holdgraf a lot on how to go about getting folks into doing stuff like this. I see two primary options:

  1. Get funding from various entities (foundations, universities, corporations, etc) for folks to do this. Hopefully 2i2c can be a vehicle for this.
  2. Pour a lot more effort into getting more new developers into JupyterHub development than we are now. This would mean things like participating in Google Summer of Code, Outreachy, development sprints in various conferences, etc.

Would love to hear thoughts from you, dear reader :) Email yuvipanda@gmail.com.

kbatch: sbatch, but for kubernetes

Submitting ’non-interactive’ jobs that can run independent of a user’s current session on a JupyterHub is a pressing problem. There are many many solutions to this, but I think there’s a lot of value in having an extremely simple solution that solves exactly one extremely well defined problem.

We will only consider JupyterHubs running on top of Kubernetes, since that has all the actual capabilities we need to solve this problem. We only need to have a UX wrapper around it.

Problem to solve

If my JupyterHub session is terminated, all the code running there dies - including any dask sessions I might have - even though they are handled externally by dask-gateway.

This is trivially accomplished in batch scheduling systems like slurm, but JupyterHub doesn’t have a simple paradigm for it. It would be very, very useful for it to have one.

Stealing from slurm

Slurm has the sbatch command for submitting jobs, and the squeue command for viewing job information. They have a large number of options that can do extremely complicated things - but I think we can copy a tiny subset of their functionality to solve our specific problem. We will not be copying all its behaviors, nor try to be compatible with it - just be inspired.

kbatch

kbatch would be an equivalent to sbatch - just submitting jobs. You can specify the resources your job would need (CPUs, memory, GPUs, etc), give it a script to execute, and it’ll just submit it and return. Nothing more! At least to begin with, it wouldn’t have a lot of the complex features sbatch has - no array tasks, no multi-step jobs, etc. It would just be a script (or a notebook) that can run independent of the current notebook session. If you want parallelism, you should use dask from this script.

Other names for this could be something like ksubmit or kbackground or similar - the core thing it does is to submit jobs.

Options this command would have:

  1. Script (or notebook to run)
  2. Resources it needs (CPU, memory, GPU, etc)
  3. Where to send stdout / stderr (by default, mimic sbatch behavior and put it on your homedir)
  4. Job name / comment for you to keep track of it

Users shouldn’t have to specify environment or home directories - these should be automatically detected to match current JupyterHub notebook environment. We could probably allow KBATCH commands in the script, similar to SBATCH commands - but more conversation with people who actually use this is needed :)

kqueue

kqueue would let you check in on the status of your jobs - wether they were running, stopped, failed, etc.

That’s it. I think just with these two, we can provide users the ability to run background jobs, from the commandline, similar to how they do so with slurm. We can eventually write GUIs for this fairly easily, since the functionality provided by them is quite simple.

Extra features?

We can easily add more features - tailing logs, getting a shell into your running job to poke around, array jobs, etc. But it would be very nice if we can keep it a simple tool that can be marked complete once it solves the very specific problem it set out to solve.

Implementation

Kubernetes Jobs provide literally everything we need. We will write a simple python application that uses the kubernetes API to do everything. All meta-information will be stored as labels in the Kubernetes Job object, and the python application will be responsible for providing the appropriate UX to our users.

Caveats

This is extremely simple, and isn’t meant for more complex DAG use cases. For those, look at something like Airflow or Prefect or any of the million other things that exist. This is a small UX improvement over kubectl, basically.

Now, let’s go do it!

Setting up a "Production Ready" TLJH

The Littlest JupyterHub is an extremely capable hub distribution that I’d recommend for situations where you expect, on average, under 100 active users.

Why not Kubernetes?

The primary reason to use Zero to JupyterHub on k8s over TLJH in cases with smaller number of users would be to reduce costs - Kubernetes can spin down nodes when not in use. However, you’ll always have at least one node running - for the hub / proxy pods. The extra complexity that comes with it is not worth it - particulary around needing to build your own docker images - is not worth it. TLJH works perfectly well for these cases!

What is ‘production’?

A JupyterHub that you can run itself securely without lots of intervention from the person who created it is what I’ll call a production-ready JupyterHub. It’s a pretty arbitrary standard. In this blog post, I’ll lay out what I want in the TLJH hubs I run before I let users on them.

Authentication

Uses a real Authenticator, not the default FirstUseAuthenticator. The default authenticator is pretty insecure, and should really not be used. If you don’t know what to use, I’ll suggest the Google or GitHub authenticators.

HTTPS

Enable HTTPS. An absolute requirement, TLJH makes it quite easy. You do need to get a domain for this to work, which can be a source of friction. Totally worth it, and many authenticators require this too.

Resource Limits

In many systems, a single user can often write code that accidentally crashes the whole system. By default, TLJH protects against this by enforcing memory limits. I think the default is a 1G limit, but you should tune it to fit your own users. TLJH has more documentation on how to estimate your VM size based on your expected usage patterns.

Sizing your VM correctly

If you choose a VM that’s too big, you’ll end up spending a lot of cash for unused resources. If it’s too small, your users will not have the resources they need to do their work. TLJH provides some helpful docs estimating your VM size, and you can always resize your VM afterwards if you get it wrong.

Disk backups

TLJH contains everything on the VM’s disk - your user environment, users’ home directories, current hub configuration, etc. It is very important you back this up, to recover in case of disasters. Automated disk snapshots from your cloud provider are an easy way to do this. Most major cloud providers offer a way to do this - Google Cloud, Digital Ocean, AWS, etc. Some let you automate it as well - Google & AWS certainly do, I’m not sure about other clodu providers. This isn’t the best way to do backup - there’s approximately 1 billion ways to do so. However, this is an absolute minimum, and it might just be enough.

If you want to be more fancy, I’d suggest using a separate disk / volume for your user home directories, possibly on ZFS, and snapshot much more aggressively. Talk to your nearest google search bar for your options.

Pin your public IP

Some cloud providers often your VM’s public IP address if you start / stop them. This can be pretty bad - you’ll have to change your domain’s DNS entry, and re-aquire HTTPS. A hassle! You can tell your cloud provider to hang on to your IP even if your VM goes down / changes. And you should! DigitalOcean doesn’t require this, but google does. I think AWS does too, but I’m not sure how you can reserve the public IP for it - since it’s usually a domain name itself.

Base environment setup + snapshot

TLJH has a shared conda environment that is used by all users. Everyone can read from it, but only users who are admin can write to it (via sudo). This is one of TLJH’s core design tradeoffs - admins can install packages the way they are used to, without requiring a separate image-build step. But it also means the admin can mess it up - conda environments can be sometimes fickle! So it’s not a bad idea to spend some time in the beginning setting everything up - python packages, JupyterLab extensions, etc. Then make a disk snapshot, so you can revert to it if things go bad (this is where having a separate disk for your user home directories comes in handy).

SSH admin access

The TLJH documentation strives hard to make sure SSH isn’t required for setup and most common usage. However, if your TLJH breaks in certain ways, you can no longer access the machine - since all access is via TLJH! For this, I recommend making sure someone who is admin has SSH access to the VM. Most cloud providers offer a way to set the root ssh key on creation. If not, you can follow the many guides on the internet to making it happen.

You can also just put your ssh keys in $HOME/.ssh/authorized_keys, and ssh in as jupyter-<username>@<hub-ip>. This works for any / all users!

Others?

I’m sure this isn’t the end - probably need something about firewalls, monitoring and automated system package upgrades. But hey, great start!

Access dask-kubernetes clusters running on a cloud from your local machine

You can run your Jupyter Notebook locally, and connect easily to a remote dask-kubernetes cluster on a cloud-based Kubernetes Cluster with the help of kubefwd. This notebook will show you an example of how to do so. While this example is a Jupyter Notebook, the code will work any local python medium - REPL, IDE (vscode), or just plain ol’ .py files

Latest executable version of this notebook can be found in this repository

Credit

This work was sponsored by Quansight ❤️

Create & setup a Kubernetes cluster

You need to have a working kubernetes cluster that is configured correctly. If you can get kubectl get ns to work properly, it means your cluster is working fine & connected for this to go.

Install & run kubefwd

kubefwd lets you access services in your cloud Kubernetes cluster as if they were localy, with a clever combination of ol’ school /etc/hosts hacks & fancy kubernetes port-forwarding. It requires root on Mac OS & Linux, and should theoretically work on Windows too (haven’t tested).

Once you have installed it, run it in a separate terminal.

sudo kubefwd svc -n default -n kube-system

If you’ve created your own namespace for your cluster, use that instead of default. The kube-system is required until this issue is fixed.

If the kubefwd command runs successfully, we’re good to go!

Install libraries we’ll need

In addition to dask-kubernetes, we’ll also need numpy to test our cluster with dask arrays.

%pip install numpy dask distributed dask-kubernetes

Setup dask-kubernetes configuration

Normally, the pod template would come from an external configuration file. We keep this in the notebook to make it more self contained.

POD_SPEC = {
    'kind': 'Pod',
    'metadata': {},
    'spec': {
        'restartPolicy': 'Never',
        'containers': [
            {
                'image': 'daskdev/dask:latest',
                'args': [
                    'dask-worker',
                    '--death-timeout', '60'
                ],
                'name': 'dask',
            }
        ]
    }
}

Create a remote cluster & connect to it

We create a KubeCluster object, with deploymode='remote'. This creates the scheduler as a pod on the cluster, so worker <-> scheduler communication is easy & efficient. kubefwd helps us communicate to this remote scheduler, so we can pretend we are actually on the remote cluster.

If you are using kubectl to watch the objects created in your namespace, you’ll see a service and a pod created for this. kubefwd should also list a log line about forwarding the service port locally.

from dask_kubernetes import KubeCluster

cluster = KubeCluster.from_dict(POD_SPEC, deploy_mode='remote')

Create some workers

We have a scheduler in the cloud, now time to create some workers in the cloud! We create 2, and can watch the worker pods come up with glee in kubectl

All scaling methods (adaptive scaling, using the widget, etc) should work here.

cluster.scale(2)

Run some computation!

We test our cluster by doing some trivial calculations with dask arrays. You can use any dask code as you normally would here, and it would run on the cloud Kubernetes cluster. This is especially helpful if you have large amounts of data in the cloud, since the workers would be really close to where the data is.

You might get warnings about version mismatches. This is ok for the demo, in production you’d probably build your own docker image that will have fixed versions.

from dask.distributed import Client
import dask.array as da

# Connect Dask to the cluster
client = Client(cluster)

# Create a large array and calculate the mean
array = da.ones((1000, 1000, 1000))
print(array.mean().compute())  # Should print 1.0

Cleanup dask cluster

When you’re done with your cluster, remember to clean it up to release the resources!

This doesn’t affect your kubernetes cluster itself - you’ll need to clean that up manually

cluster.close()

Next steps?

Lots more we can do from here.

Ephemeral Kubernetes Clusters

We can wrap kubernetes cluster creation in some nice python functions, letting users create kubernetes clusters just-in-time for running a dask-kubernetes cluster, and tearing it down when they’re done. Users can thus ‘bring their own compute’ - since the clusters will be in their cloud accounts - without having the complication of understanding how the cloud works. This is where this would be different from the wonderful dask-gateway project I think.

Remove kubefwd

kubefwd isn’t strictly necessary, and should ideally be replaced by a kubectl port-forward call that doesn’t require root. This should be possible with some changes to the dask-kubernetes code, so the client can connect to the scheduler via a different address (say, localhost:8974, since that’s what kubectl port-forward gives us) vs the workers (which need something like dask-cluster-scheduler-c12.namespace:8786, since that is in-cluster address).

Longer term, it would be great if we can get rid of spawning other processes altogether, if/when the python kubernetes client gains the ability to port-forward.

Integration with TLJH

I love The Littlest JupyterHub (TLJH). A common use case is that a group of users need a JupyterHub, mostly doing work that’s well seved by TLJH. However, sometimes they need to scale up to a big dask cluster to do some work, but not for long. In these cases, I believe a combination of TLJH + Ephemeral Kubernetes Clusters is far simpler & easier to manage than running a full Kubernetes based JupyterHub. In addition, we can share the conda environment from TLJH with the dask workers, removing the need for users to think about docker images or environment mismatches completely. This is a massive win, and merits further exploration.

???

I am not actually an end user of dask, so I’m sure actual dask users will have much more ideas. Or they won’t, and this will just end up being a clever hack that gives me some joy :D Who knows!

Check if an organization is using GSuite

I needed to find out if an organization was using GSuite for their emails, so I can allow their users to login to a JupyterHub I was setting up. I use Google Auth in other places, so I wanted to find out if this organization was using GSuite.

I could’ve just asked them, but where’s the fun in that? Instead, you can use dig to find out!

For example, if I were to test berkeley.edu, I can run:

$ dig -t mx berkeley.edu

This gives me a bunch of output, the important bits of which are:

;; ANSWER SECTION:
berkeley.edu.		242	IN	MX	5 alt1.aspmx.l.google.com.
berkeley.edu.		242	IN	MX	10 alt3.aspmx.l.google.com.
berkeley.edu.		242	IN	MX	10 alt4.aspmx.l.google.com.
berkeley.edu.		242	IN	MX	1 aspmx.l.google.com.
berkeley.edu.		242	IN	MX	5 alt2.aspmx.l.google.com.

This means that mail to anyone@berkeley.edu is routed via Google, so I know this organization is using GSuite for their users!

On Being an 'Entitled User'

Entitled users burn out maintainers. While my massive burning out wasn’t directly related, it didn’t help either. For a lot of other maintainers, it has been a primary source of burnout. This is an unfortunate reality of our time. There are a ton of stories from maintainers out there -see this for a recent example. Not going to rehash that.

However, I’ve recently found myself on the ‘other side’ of this coin. I exhibited the following symptoms

  • Whining about lack of documentation. I discover useful nuggets from helpful people on chat and by reading through the code, but don’t actually contribute those to docs.
  • Complain about technology choices, without a lot of actual experience in the domain.
  • When technology choices that I favor were made, not actually put my effort into it. I acknowledge that I don’t have to do this - I have little time too. But with whatever time I have, I wish I would do this than whine.
  • Hit-and-run engagement, where I only show up every few months whenever I am trying to solve a particular problem, complain / do things, then run away. Things seem to get more to my liking each time I come back, but I still seem to be complaining anyway. I also acknowledge that this is ok - I am burnt out too. But this does mean I can’t expect to find working with this code base easy. It also means that the team of people actually working on it only interact with me when I’m in this mode, which I don’t like either.
  • Wade into emotionally charged situations based on events that have already happened in the project that I wasn’t part of, trample around slightly, and then withdraw on recognizing I don’t have the emotional bandwidth to fully understand what was happening. This makes the situation worse for everyone, including myself.
  • There’s probably more behavior here that I don’t recognize.

In a lot of ways, I think my frustrations and criticisms are valid. However, there’s very constructive ways to engage, and then there’s just useless whining that burns other people out. I would like to think I’ve generally picked the constructive way in most projects. But that has blinded me in my behavior in a few projects.

I’m going to try and change my behavior to match what I’d like a frustrated user to do in open source projects where I’m the maintainer.

  • I’ll contribute documentation (either as blog posts, issues, or doc PRs) when I find solutions to problems that I am frustrated by
  • If I favor other technology choices, I’d provide constructive criticism of why, and then STFU. Or, I’ll actually work on making the change. I will not continue to whine.
  • When I recognize an emotionally charged situation, I will limit myself to actions I can fully emotionally engage in. This is a limitation of my own emotional bandwidth, fueled by burnout and other things. I’m simply recognizing the limitation, and trying to not make anything worse.
  • I do have a right to whine, and I’ll do so in private to people not otherwise engaged in the project.
  • I will generally ask myself ‘how would I like a user to behave if I was the maintainer?’ and try to act in that way

To be clear, these are the things I want to change about myself based on my situation. This isn’t a prescriptive list for other people to follow.

To the wonderful hardworking people who are maintainers of any popular open source projects - thank you for putting up with me.

Note taking: Why bother?

Talking to Ankur Sethi made me think more clearly about why I want note-taking, bookmarking setups.

I am trying to stop feeling so overwhelmed by moving things out of my puny brain.

The solution to this is two fold.

  1. Commit to and do fewer things. Already working on this.
  2. Be more efficient in the things you are doing.

A few years ago I gave up on chasing efficiency gains because they were all so marginal (hello, maintaining a super customized .vimrc). But I think I threw out the baby with the bathwater there, by focusing on the wrong kinds of efficiency. So am trying again!

Note Taking II - Retaining consumed content

In conversation with Nirbheek about my blog post on note taking, he said something like ‘my system of knowledge has holes, and I lose track of stuff I learnt because of that’. This was slightly complementary to my earlier quest - the outliner helped me in structuring how I thought and produced content. However, a bookmarking system will help me consume and retain knowledge. Both need to be integrated, but I currently had only one half of the puzzle.

We ended up talking about ‘bookmarks’ and how he has a lot of them. I’ve felt the need for saving stuff I read in useful ways, but never used bookmarks because I never look at them again. It would be awesome to have a system of that sort - especially if it was integrated into my notetaking and todo list solutions.

We talked about some current solutions - Pratul’s meticulously tagged pinboard account, Ankur’s blog’s links category. All this was too much work for me, but I couldn’t quite articulate why. I’ve tried both these options before and couldn’t keep them up. Longevity was my goal here, so I kept looking.

Memex

This lead us to Memex, and open source system (with a paid hosted version) that seemed to do everything we wanted. They made a particular point of not taking VC money, which is a big draw for me.

Nirbheek tried it out, and found it buggy / wanting. Particularly, it only worked on half his bookmarks - the rest had rotten, or the import itself was buggy. This could be fixed with Internet Archive integration, but there were a few more bugs about data loss in their issue tracker that didn’t give me too much confidence. It does still look amazing, and I will try it out at some point soon - unlike Nirbheek, I don’t have a few thousand bookmarks to import.

However, it gave met he name Memex, with a history behind it - an article by Vannevar Bush (Claude Shannon was his student) from 1945 called As we may think. This was an extremely fascinating read - I read it in the original layout, with fascinating ads by the side (Men’s garters, for example).

Here is my annotated version, with the origins of the rest of this blog post at the bottom. It’s fairly short, so I suggest you read the original. I’m not going to rehash it, but just talk about what I think I need in my ‘bookmarking’ solution.

What is a bookmark?

Bookmark is a terrible word for what I want, but I can’t think of a better one right now. Bush defines the ‘memex’ as:

A memex is a device in which an individual stores all his books, records and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory.

(Gendered pronouns in the source left in for for faithful reproduction. I’d like to think that if this was written today, it would use gender neutral pronouns)

So a ‘bookmark’ is an individual thing that I come across that I might possibly ever want to remember again. I can’t remember it all, so I want to toss it into a ‘bookmarking system’ that can be searched with ’exceeding speed and flexibility’.

Based on this, I figured out my current criteria for looking at & trying out such systems to find one that meets my needs.

Search first

I want my interface to be a simple text box into which I can type things, and it’ll search the entire contents of everything I’ve ‘bookmarked’. This should be articles I’ve read, books, tweets, videos I’ve watched, podcasts I’ve listened to, my notes, etc. I don’t want to do any tagging or categorizing myself.

Tagging and categorizing remind me of dmoz and Yahoo Directories. Way too much work for me, and I’ll never look at them again.

Bush calls selection (we would call that ‘search’ today) as the primary problem , and rails against indexing (IMO we would call this tagging / categorizing) as insufficient.

This search needs to be extremely fast, and very relevant.

Never forgets

I should have access to these ‘bookmarks’ all the time. This means cross-device sync (Android, iPad, Mac OS for me).

It also needs to actually archive all the content I throw at it. Save web pages fully. Download (and transcribe) videos / podcats. Find some way to keep books in there. No link rot, no ‘shit this content is now behind a paywall’, no ‘damn, I wish I had my laptop with me’.

Hyperlinking

I should be able to connect bookmarks together. I guess this could be done with tagss, but I want something else that’s better. Bush talks about being able to link ‘bookmarks’ to each other with a blurb about the link - this is what we would call a ‘hyperlink’ now.

My note taker has hyperlinks with other notes, but I should be able to hyperlink to content I have saved. I should also be able to link together content I’ve saved that doesn’t have links themselves.

Strong integration between this and my note taker is probably the way to go. Each piece of content should have highlights, and notes attached to them. These could contain hyperlinks easily. The highlights and notes should also be searchable.

Publishable

Bush talks about ’trailblazers’ - people who find new content, link it together in useful ways, and publish it for others.

This was how I used Google Reader’s Shared functionality. You could publish a link with all the content you have consumed that you marked as ‘shared’, with a note as well. I would subscribe to other people’s shared feeds, and find useful content (and blogs) to follow.

RIP Google Reader. It continues to make me wary of relying on any of Google’s services.

Having a note taker has already drastically changed how I think and produce content. I feel integrating this into my life will have similar effects on how I consume content. Now that I have a better idea of what I want, I’ll spend some more time looking.

Credits

Thanks to Nirbheek, Ankur Sethi and Prateek for conversations and thoughts on this topic :)