Access dask-kubernetes clusters running on a cloud from your local machine
Contents
You can run your Jupyter Notebook locally, and connect easily to a remote dask-kubernetes
cluster
on a cloud-based Kubernetes Cluster with the help of kubefwd. This notebook
will show you an example of how to do so. While this example is a Jupyter Notebook, the code will work
any local python medium - REPL, IDE (vscode), or just plain ol’ .py
files
Latest executable version of this notebook can be found in this repository
Credit
This work was sponsored by Quansight ❤️
Create & setup a Kubernetes cluster
You need to have a working kubernetes cluster that is configured correctly. If you can get
kubectl get ns
to work properly, it means your cluster is working fine & connected for
this to go.
Install & run kubefwd
kubefwd
lets you access services in your cloud Kubernetes cluster as if they were
localy, with a clever combination of ol’ school /etc/hosts
hacks & fancy kubernetes
port-forwarding. It requires root on Mac OS & Linux, and should theoretically work on
Windows too (haven’t tested).
Once you have installed it, run it in a separate terminal.
sudo kubefwd svc -n default -n kube-system
If you’ve created your own namespace for your cluster, use that instead of default
.
The kube-system
is required until this issue
is fixed.
If the kubefwd
command runs successfully, we’re good to go!
Install libraries we’ll need
In addition to dask-kubernetes
, we’ll also need numpy
to test our cluster with dask arrays.
%pip install numpy dask distributed dask-kubernetes
Setup dask-kubernetes configuration
Normally, the pod template would come from an external configuration file. We keep this in the notebook to make it more self contained.
POD_SPEC = {
'kind': 'Pod',
'metadata': {},
'spec': {
'restartPolicy': 'Never',
'containers': [
{
'image': 'daskdev/dask:latest',
'args': [
'dask-worker',
'--death-timeout', '60'
],
'name': 'dask',
}
]
}
}
Create a remote cluster & connect to it
We create a KubeCluster
object, with deploymode='remote'
. This creates
the scheduler as a pod on the cluster, so worker <-> scheduler communication
is easy & efficient. kubefwd
helps us communicate to this remote scheduler,
so we can pretend we are actually on the remote cluster.
If you are using kubectl
to watch the objects created in your namespace,
you’ll see a service
and a pod
created for this. kubefwd
should
also list a log line about forwarding the service port locally.
from dask_kubernetes import KubeCluster
cluster = KubeCluster.from_dict(POD_SPEC, deploy_mode='remote')
Create some workers
We have a scheduler in the cloud, now time to create some workers in the
cloud! We create 2, and can watch the worker pods come up with glee in
kubectl
All scaling methods (adaptive scaling, using the widget, etc) should work here.
cluster.scale(2)
Run some computation!
We test our cluster by doing some trivial calculations with dask arrays. You can use any dask code as you normally would here, and it would run on the cloud Kubernetes cluster. This is especially helpful if you have large amounts of data in the cloud, since the workers would be really close to where the data is.
You might get warnings about version mismatches. This is ok for the demo, in production you’d probably build your own docker image that will have fixed versions.
from dask.distributed import Client
import dask.array as da
# Connect Dask to the cluster
client = Client(cluster)
# Create a large array and calculate the mean
array = da.ones((1000, 1000, 1000))
print(array.mean().compute()) # Should print 1.0
Cleanup dask cluster
When you’re done with your cluster, remember to clean it up to release the resources!
This doesn’t affect your kubernetes cluster itself - you’ll need to clean that up manually
cluster.close()
Next steps?
Lots more we can do from here.
Ephemeral Kubernetes Clusters
We can wrap kubernetes cluster creation in some nice python functions, letting users create kubernetes clusters just-in-time for running a dask-kubernetes cluster, and tearing it down when they’re done. Users can thus ‘bring their own compute’ - since the clusters will be in their cloud accounts - without having the complication of understanding how the cloud works. This is where this would be different from the wonderful dask-gateway project I think.
Remove kubefwd
kubefwd
isn’t strictly necessary, and should ideally be replaced by a kubectl port-forward
call that doesn’t require root. This should be possible with some changes to the dask-kubernetes
code, so the client can connect to the scheduler via a different address (say, localhost:8974
, since that’s what kubectl port-forward
gives us) vs the workers (which need something like dask-cluster-scheduler-c12.namespace:8786
, since that is in-cluster address).
Longer term, it would be great if we can get rid of spawning other processes altogether, if/when the python kubernetes client gains the ability to port-forward.
Integration with TLJH
I love The Littlest JupyterHub (TLJH). A common use case is that a group of users need a JupyterHub, mostly doing work that’s well seved by TLJH. However, sometimes they need to scale up to a big dask cluster to do some work, but not for long. In these cases, I believe a combination of TLJH + Ephemeral Kubernetes Clusters is far simpler & easier to manage than running a full Kubernetes based JupyterHub. In addition, we can share the conda environment from TLJH with the dask workers, removing the need for users to think about docker images or environment mismatches completely. This is a massive win, and merits further exploration.
???
I am not actually an end user of dask, so I’m sure actual dask users will have much more ideas. Or they won’t, and this will just end up being a clever hack that gives me some joy :D Who knows!
Author
LastMod 2020-06-27