Kubernetes HOST_CFG problem w/calico/rke2

Hi, this is a separate issue than getting set up with kubernetes in general (I think). I’ve got OOD doing everything generally right with connecting to an RKE2 system with Calico as the network CNI. I’m testing the bc_k8s_jupyter application and the container keeps crashing and I think it’s because the config that’s generated looks like this:

c.NotebookApp.port = 8080
c.NotebookApp.ip = '0.0.0.0'
c.NotebookApp.disable_check_xsrf = True
c.NotebookApp.allow_origin = '*'
c.Application.log_level = 'DEBUG'
c.NotebookApp.password=u'sha1::c941202f1232e9d06224f9b32dce3ec96777fdf6'
c.NotebookApp.base_url='/node/10-115-20-84.kubernetes.default.svc.cluster.local.
10-115-20-84.calico-typha.calico-system.svc.cluster.local/31732/'

The actual error is in the log output from jupyter:

E 15:28:16.932 NotebookApp] Exception while loading config file /ood/ondemand_config.py
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.9/site-packages/traitlets/config/application.py", line 738, in _load_config_files
2023-11-28T10:28:16.933390318-05:00         config = loader.load_config()
      File "/opt/conda/lib/python3.9/site-packages/traitlets/config/loader.py", line 614, in load_config
2023-11-28T10:28:16.933393794-05:00         self._read_file_as_dict()
2023-11-28T10:28:16.933395357-05:00       File "/opt/conda/lib/python3.9/site-packages/traitlets/config/loader.py", line 646, in _read_file_as_dict
        exec(compile(f.read(), conf_filename, 'exec'), namespace, namespace)
2023-11-28T10:28:16.933398864-05:00       File "/ood/ondemand_config.py", line 7
        c.NotebookApp.base_url='/node/10-115-20-84.kubernetes.default.svc.cluster.local.
2023-11-28T10:28:16.933402170-05:00                                                                                         ^
    SyntaxError: EOL while scanning string literal

In the submit.yml.erb file, we have the standard line:

   host_port_cfg = "c.NotebookApp.base_url=\'/node/${HOST_CFG}/${PORT_CFG}/\'"

I’m guessing that HOST_CFG is getting either a newline or lots of extra spaces. Is there any way you can think of to strip out newlines and stuff from those variables? I could try a different CNI to see if it generates more sane hostnames, but it’d be nice to clear out the garbage from the variable too.

You could try something like this:

<%- host_cfg = ENV[HOST_CFG].strip -%>
host_port_cfg = "c.NotebookApp.base_url=\'/node/<%= host_cfg %>/${PORT_CFG}/\'"

That may do the trick to strip the whitespace and insert the variable correctly.

Nice. Let me try that.

It doesn’t seem to like having the <%- -%> inside of the main <% %> section.

#<SyntaxError: /var/www/ood/apps/dev/user/gateway/bc_k8s_jupyter/submit.yml.erb:5: syntax error, unexpected '<', expecting end-of-input
   <%- stuff = ENV[HOST_CFG} 
   ^

This is ruby’isms that I don’t think I get.

The <%- -%> syntax prevents any whitespace from being inserted which may happen with the <% %> version is all.

There’s a closing tag on that as well, right? I only see the opening one in the error.

If you look at the Jupyter app you can see this happening at the top of the file to set the variables needed:

Breaking it down a little, if I put this at the beginning of my submit.yml.erb file, I get an error:

<%- host_cfg = ENV['HOST_CFG'].strip -%>
undefined method `strip' for nil:NilClass

Ok, sounds like ENV['HOST_CFG'] isn’t something we can get since it’s returning Nil.

What happens if you try:

<%- host_cfg = HOST_CFG.strip -%>

Might be it’s a regular variable and not an ENV var we can access and call the method on right there.

That gives me:

uninitialized constant BatchConnect::App::HOST_CFG

I think I might be seeing it. The HOST_CFG is actually generated in the utility container from the /bin/find_host_port command right at the end.

export HOST_CFG=$HOST_CFG
export PORT_CFG=$PORT_CFG

So, maybe I could hack up something like:

source /bin/find_host_port; export HOST_CFG=`echo $HOST_CFG|tr -d [:space:]`; /bin/add_line.....

Ok, I’m not sure where that variable is actually coming from looking at the code and ENV vars. Is it set somewhere else in your app maybe? Part of the scirpt.sh? It may be you need to export those variables to make them available.

I think this could work. Look here at the before.sh in the Jupyter app and I think you can use the export here to set this correctly.

Thanks. I’ll take a look at that larger example. The line I put in actually did get the container running so it fixed up the original issue. There’s still something funky with the nodeport that’s being created as I can’t connect to the host on that port.

Definitely something up with the way the ondemand pod/service is being set up such that I can’t reach the nodeport on the k8s host. IPtables is blocking it, so something didn’t get created right and I’m not really seeing it.

For example, if I have a simple pod/service setup:

apiVersion: v1
kind: Pod
metadata:
  labels:
    app.kubernetes.io/name: test
  name: test
  namespace: default
spec:
  containers:
  - image: rancher/hello-world
    imagePullPolicy: Always
    name: container-0
    ports:
    - containerPort: 80
      protocol: TCP
  restartPolicy: Always
---
apiVersion: v1
kind: Service
metadata:
  name: myport
spec:
  type: NodePort
  selector:
    app.kubernetes.io/name: test
  ports:
    - protocol: TCP
      port: 80

This gets created fine and if I open a browser to the host:port I get the test web page.

When launching with OnDemand, I see the service being created, the jupyter session is running and all that, but I can’t get to it (just hangs and times out). Just to see what’s up I logged in on the console of the kubernetes node and I can reach jupyter. Something in the way the networking is being set up in the yaml is not quite right.

Hmm, have you looked at this part of the docs over the networking portion:
https://osc.github.io/ood-documentation/latest/installation/resource-manager/kubernetes.html#deploy-hooks-to-bootstrap-users-kubernetes-configuration

It may be you need to set that NETWORK_POLICY_ALLOW_CIDR in order to reach the container.

Yeah, that I think might be a big piece of that. I wonder how that really should be set up though, since the intention is that it’s blocking access from other namespaces, yet you still need to be able to browse into the container from external.

I just went ahead and deleted the network policy, now when I click on the link, I get a 404: Not found error. My URL is:

https://ondemand.example.org/node/k8s-node.example.org/31094/login

If I click on the ‘jupyter notebook’ link above that I get a proxy error.

If I just connect directly to k8s-node.example.org:31094 I do connect (also getting a 404 error), and clicking ‘Jupyter’ above that takes me to the password page.

Seems somewhat closer.

I think the containers are still namespaced correctly and separated, the /etc/ood/config/hook.env looks to handle just the networking.

I’ll be honest, I’ve never even seen these hooks until your post, so I’m still trying to figure this out myself.

But, from what I understand of k8’s, we are allowing requests from the local network into the container in this namespace, but that won’t break the separation between namespaces.

Hmmm, so this is making a bit more sense. The various hook files below are what likely will answer your questions around networking and namespacing:

Namespace hook:

Networking hook:

So when all of this is cobbled together you’ll be able to use the hooks to gain network access with those settings, but have your containers namespaced as well. I’m not an expert at k8s so hopefully all these words are used right.