Problems with creating jupyter session with k8s cluster configuration

I’m trying to use OnDemand referring an existing kubernetes cluster. Thanks to this thread, I could install OOD on my ubuntu server using Ansible. Now I am trying to create a cluster configuration in /etc/ood/config/cluster.d directory and run a Jupyter session.

Following is my cluster configuration in cluster.d folder.

v2:
    batch_connect:
        ssh_allow: false
    job:
        adapter: kubernetes
        all_namespaces: false
        auth:
            type: managed
        auto_supplemental_groups: false
        bin: /usr/local/bin/kubectl
        cluster: cluster.local
        config_file: ~/.kube/config
        context: kubernetes-admin@cluster.local
        mounts: []
        server:
            cert_authority_file: /etc/kubernetes/pki/ca.crt
            endpoint: <internal_ip>:6443
    login:
        host: <external_ip>
    metadata:
        title: kubernetes

Since I don’t have any OIDC IdP, I selected auth type ‘managed’ to ignore context changes. cluster/context property is the current cluster/context I’m working on.

I installed bc_k8s_jupyter (GitHub - OSC/bc_k8s_jupyter: Run a batch connect Jupyter in a kuberentes cluster) app and tried to create a session via OOD Interactive Apps menu. When I click the ‘Launch’ button, I get an error message like this.

Failed to submit session with the following error:

error: error loading config file “~/.kube/config”: open ~/.kube/config: permission denied

  • If this job failed to submit because of an invalid job name please ask your administrator to configure OnDemand to set the environment variable OOD_JOB_NAME_ILLEGAL_CHARS.
  • The Jupyter (Kubernetes) session data for this session can be accessed under the staged root directory.

Am I missing something in K8s configuration tutorial? Or is there anything I misconfigured?

Hi and welcome!

Is ~/.kube/config readable to you? That appears to be the error. If it’s managed that means, that you yourself manage it. However this file was created - it does not seem to be readable by the current user.

I installed kubernetes .kube/config is located in /root directory. I installed OnDemand on the node which is also the control-plane of kubernetes cluster. I expected OnDemand refers to the .kube/config file that belongs to root user, but it seems like that’s not true. I tried to change the visibility of the directory and file (even 777 mode) but still it has the same error.
Can I check which user is used by OnDemand via log or any other configuration file?

It’s who you’re logged in as.

This is resolving to your HOME. The HOME of whoever you’re currently logged in as.

The current user name I use in OnDemand is deepops.
And in /home/deepops/, there is .kube/config file.
The mode of the file is 666, so I think the mode is not the cause of the issue.

And here’s another strange phenomenon; if I set the value of config_file to /home/deepops/.kube/config, then the permission error disappears. Instead, “Error from server (BadRequest): the server rejected our request for an unknown reason” is printed.

I’m trying to use my existing kubernetes cluster and context, and I realized I dont’ need to override context/cluster configuration in kubernetes cluster yaml file.

v2:
    batch_connect:
        ssh_allow: false
    job:
        adapter: kubernetes
        all_namespaces: false
        auth:
            type: managed
        auto_supplemental_groups: false
        bin: /usr/local/bin/kubectl
        config_file: /home/<username>/.kube/config
        mounts: []
        server:
            endpoint: <internal_ip>:6443
    login:
        host: <external_ip>
    metadata:
        title: kubernetes

This configuration works fine to me.

You can also remove this as it’s the default but won’t work for other users. That is, every user should point to their own kube config.

1 Like

Now I can make a job with interactive apps in OOD. But here’s another question.


When I click the ‘Connect to the container’ button, I expect that I can open a Jupyter notebook in browser. But the response is 404 page not found.
Is there any additional custom configuration I need to make to connect to the container?
I used kubectl to check that the pod is created and running successfully.

Here are the relevant configs in ood_portal.yml. You need to enable node and rnode URIs, but beyond that you also need a regular expression for your hosts.

host_regex: <some regex>
rnode_uri: '/rnode'
node_uri: '/node'

https://osc.github.io/ood-documentation/latest/app-development/interactive/setup/enable-reverse-proxy.html

I reconfigured ood_portal file and 404 error is no more appearing. Also, I found out that there was a problem in TurboVNC installation, so I reinstalled TurboVNC on computing nodes. I verified reverse proxy using nc -l command on computing nodes and checked the proxy works fine.

But still, when I click the button to connect to the interactive app, the server does not respond any message. I disabled firewall, and the host_regex pattern is [^]+
Which file would be relevant to this issue?

What’s the actual behavior you see? I mean can I get screen shots and/or can you open your browsers dev tools to see anything either in the network tab or console?


When I open the browser dev tool and see the ‘Network’ section, I can see the request is sent but the status is (pending).

After a minute, I get this error message.

Failed to connect to host:port
And the http status code is 503 service unavailable.

This would indicate a network timeout. Meaning your OOD host can find the k8s node, but when they try to connect you’re blocked by something in the network (a firewall or similar) so you end up with a timeout.

Generally this is a firewall thing. If the ports weren’t open - but you had connectivity it would fail immediately. Since it’s not failing immediately, you have no connectivity between the OOD webserver and your k8s worker node. You can test this with a simple telnet from the OOD webserver.

I enabled all IP addresses and port numbers and tested with telnet.
Here’s the result.

In kubernetes master node,
telnet <work_node_ip> <32002 (nodeport)> hangs.
telnet <cluster_ip> <32002> is connected.

In kubernetes work node,
telnet <work_node_ip> <32202> is connected.

I guess proxy is not working as I expected. Am I right?

We use node ports (the first one) so it hanging here is what we’d expect given your past posts. You seem to be blocking this traffic.

I manually created a deployment with service, which is exposed by NodePort.

  1. When I create it in the default namespace or any other namespace I created, the service is reachable via <work_node_ip>:.

  2. But if I create a deployment in the namespace that managed by OOD, the service is not reachable via <work_node_ip>:

Here are other clues.

  • I can’t find any specific error in firewall setting, (as I already mentioned that I disabled the entire firewall of my GCP Compute Engines.)
  • I manage kube-proxy with ipvs, and I guess port forwarding is working. Following is the result of ipvsadm command.
    Prot LocalAddress:Port Scheduler Flags
    → RemoteAddress:Port Forward Weight ActiveConn InActConn
    TCP 10.128.0.61 :31518 rr
    → 10.233.91.20 : 8080 Masq 1 0 0
    (internal ip of node & nodeport is forwarded and it is pointing <pode_ip>:)
  • From the master node, I can telnet to the service IP. But it seems like the service IP is not forwarded to the node IP.
  • When I change the current user of the context to the user created by OOD boostrap hook script, I can’t get pod information by using kubectl. I get the error message “kubectl error must provide idp-issuer-url”.

Seems like it is hard to troubleshoot, but I hope this information may help you find out what is the cause of the problem. Thank you!

And I’m also curious about the role of hook.env file which is used to bootstrap the k8s user. Is there anything I have to know to configure the hook.env properly?

These 2 are related in that I have questions around how you’re bootstrapping users and namespaces.

On the one hand you can have us bootstrap user namespaces, policies and through the hook environment. On the other - you can do everything out of bounds – that is you set it all up and we just expect it to work.

I’m not sure which scheme you’re trying to setup given you’re using managed auth. But yes if you

Yes if you want us bootstrapping users. Again - you can do this out of bounds, or you could have us do the same every time a PUN starts in pun_root_pre_hook.

All that said - I’m not really quite sure but clearly there’s something to do with namespaces. If the user is able to create/modify a NodePort service then it must be some issue with network connectivity to that new namespace. Though why that would be the case I can’t say, maybe some GKE defaults?

1 Like

The cause of the issue is misconfigured hook.env file.
Changing
NETWORK_POLICY_ALLOW_CIDR=“127.0.0.1/32” (I didn’t understand the role of hooks.env file so I just used hooks.env.example file.)
to
NETWORK_POLICY_ALLOW_CIDR=“0.0.0.0/0” or any other valid value
makes it work.