White screen when trying to start a new notebook

I often get a blank screen after a notebook is started and I try to connect, Is it possible the call to squeue is taking too long to respond- Here is error output when trying to connect from this page - https://login.sherlock.stanford.edu/pun/sys/dashboard/batch_connect/sys/sh_jupyter/session_contexts

tail -f .ondemand/error.log

App 39493 output: [2020-02-25 15:14:35 -0800 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index s
tatus=200 duration=5913.74 view=42.40”
App 39493 output: [2020-02-25 15:14:46 -0800 ] INFO “execve = [{“SLURM_CONF”=>”/etc/slurm/slurm.conf"}, “/usr/bin/squeue”, “–all”, “–states=all”, “–noconvert”, “-o”,
“\u001E%a\u001F%A\u001F%B\u001F%c\u001F%C\u001F%d\u001F%D\u001F%e\u001F%E\u001F%f\u001F%F\u001F%g\u001F%G\u001F%h\u001F%H\u001F%i\u001F%I\u001F%j\u001F%J\u001F%k
\u001F%K\u001F%l\u001F%L\u001F%m\u001F%M\u001F%n\u001F%N\u001F%o\u001F%O\u001F%q\u001F%P\u001F%Q\u001F%r\u001F%S\u001F%t\u001F%T\u001F%u\u001F%U\u001F%v\u001F%V\u
001F%w\u001F%W\u001F%x\u001F%X\u001F%y\u001F%Y\u001F%z\u001F%Z\u001F%b”, “-j”, “61951312”, “-M”, “sherlock”]"
App 39493 output: [2020-02-25 15:14:47 -0800 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index s
tatus=200 duration=2075.57 view=23.08”
App 39493 output: [2020-02-25 15:15:01 -0800 ] INFO “execve = [{“SLURM_CONF”=>”/etc/slurm/slurm.conf"}, “/usr/bin/squeue”, “–all”, “–states=all”, “–noconvert”, “-o”,
“\u001E%a\u001F%A\u001F%B\u001F%c\u001F%C\u001F%d\u001F%D\u001F%e\u001F%E\u001F%f\u001F%F\u001F%g\u001F%G\u001F%h\u001F%H\u001F%i\u001F%I\u001F%j\u001F%J\u001F%k
\u001F%K\u001F%l\u001F%L\u001F%m\u001F%M\u001F%n\u001F%N\u001F%o\u001F%O\u001F%q\u001F%P\u001F%Q\u001F%r\u001F%S\u001F%t\u001F%T\u001F%u\u001F%U\u001F%v\u001F%V\u
001F%w\u001F%W\u001F%x\u001F%X\u001F%y\u001F%Y\u001F%z\u001F%Z\u001F%b”, “-j”, “61951312”, “-M”, “sherlock”]"
App 39493 output: [2020-02-25 15:15:03 -0800 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index s
tatus=200 duration=2996.02 view=23.98”
App 43275 output: Rails Error: Unable to access log file. Please ensure that /var/www/ood/apps/sys/dashboard/log/production.log exists and is writable (ie, make it writable for user a
nd group: chmod 0664 /var/www/ood/apps/sys/dashboard/log/production.log). The log level has been raised to WARN and the output directed to STDERR until the problem is fixed.
App 39493 output: [2020-02-25 15:15:17 -0800 ] INFO “execve = [{“SLURM_CONF”=>”/etc/slurm/slurm.conf"}, “/usr/bin/squeue”, “–all”, “–states=all”, “–noconvert”, “-o”,
“\u001E%a\u001F%A\u001F%B\u001F%c\u001F%C\u001F%d\u001F%D\u001F%e\u001F%E\u001F%f\u001F%F\u001F%g\u001F%G\u001F%h\u001F%H\u001F%i\u001F%I\u001F%j\u001F%J\u001F%k
\u001F%K\u001F%l\u001F%L\u001F%m\u001F%M\u001F%n\u001F%N\u001F%o\u001F%O\u001F%q\u001F%P\u001F%Q\u001F%r\u001F%S\u001F%t\u001F%T\u001F%u\u001F%U\u001F%v\u001F%V\u
001F%w\u001F%W\u001F%x\u001F%X\u001F%y\u001F%Y\u001F%z\u001F%Z\u001F%b”, “-j”, “61951312”, “-M”, “sherlock”]"
App 43275 output: [2020-02-25 15:15:18 -0800 ] INFO “execve = [“git”, “describe”, “–always”, “–tags”]”
App 43275 output: [2020-02-25 15:15:18 -0800 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sys/sh_jupyter/session_contexts/new format=html controller=BatchConnect::Session
ContextsController action=new status=200 duration=3111.58 view=73.30”
App 39493 output: [2020-02-25 15:15:19 -0800 ] INFO “method=GET path=/pun/sys/dashboard/batch_connect/sessions.js format=js controller=BatchConnect::SessionsController action=index s
tatus=200 duration=6338.24 view=19.45”

Reloaded the page -

App 137101 output: [2020-02-25 15:18:33 -0800 ] WARN “Can’t verify CSRF token authenticity”
App 137101 output: [2020-02-25 15:18:33 -0800 ] INFO “method=POST path=/pun/sys/dashboard/batch_connect/sys/sh_jupyter/session_contexts format=html controller=BatchConnect::SessionCo
ntextsController action=create status=422 error=‘ActionController::InvalidAuthenticityToken: ActionController::InvalidAuthenticityToken’ duration=11.10 view=0.00”
App 137101 output: [2020-02-25 15:18:33 -0800 ] FATAL “ActionController::InvalidAuthenticityToken (ActionController::InvalidAuthenticityToken):\n vendor/bundle/ruby/2.4.0/gems/action
pack-4.2.11/lib/action_controller/metal/request_forgery_protection.rb:181:in handle_unverified_request'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_controller/metal /request_forgery_protection.rb:209:in handle_unverified_request’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_controller/metal/request_forgery_protection.rb:204:in verify_authenticity_token'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb:432:inblock in make_lambda’\n vendor/bundle/ruby/2.4.0/gems/actives
upport-4.2.11/lib/active_support/callbacks.rb:164:in block in halting'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb:504:in block in call’\n
vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb:504:in each'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb :504:in call’\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb:92:in __run_callbacks__'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/li b/active_support/callbacks.rb:778:in _run_process_action_callbacks’\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb:81:in run_callbacks'\n ven dor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/abstract_controller/callbacks.rb:19:in process_action’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_controller/metal
/rescue.rb:29:in process_action'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_controller/metal/instrumentation.rb:32:in block in process_action’\n vendor/bundle/ru
by/2.4.0/gems/activesupport-4.2.11/lib/active_support/notifications.rb:164:in block in instrument'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/notificati ons/instrumenter.rb:20:in instrument’\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/notifications.rb:164:in instrument'\n vendor/bundle/ruby/2.4.0/gems/a ctionpack-4.2.11/lib/action_controller/metal/instrumentation.rb:30:in process_action’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_controller/metal/params_wrapper.rb
:250:in process_action'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/abstract_controller/base.rb:137:in process’\n vendor/bundle/ruby/2.4.0/gems/actionview-4.2.11/lib/act
ion_view/rendering.rb:30:in process'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_controller/metal.rb:196:in dispatch’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4
.2.11/lib/action_controller/metal/rack_delegation.rb:13:in dispatch'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_controller/metal.rb:237:in block in action’\n ven
dor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/routing/route_set.rb:74:in dispatch'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/routing/r oute_set.rb:43:in serve’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/journey/router.rb:43:in block in serve'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4 .2.11/lib/action_dispatch/journey/router.rb:30:in each’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/journey/router.rb:30:in serve'\n vendor/bundle/ruby/2 .4.0/gems/actionpack-4.2.11/lib/action_dispatch/routing/route_set.rb:817:in call’\n vendor/bundle/ruby/2.4.0/gems/rack-1.6.11/lib/rack/etag.rb:24:in call'\n vendor/bundle/ruby/2.4 .0/gems/rack-1.6.11/lib/rack/conditionalget.rb:38:in call’\n vendor/bundle/ruby/2.4.0/gems/rack-1.6.11/lib/rack/head.rb:13:in call'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2. 11/lib/action_dispatch/middleware/params_parser.rb:27:in call’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middleware/flash.rb:260:in call'\n vendor/bund le/ruby/2.4.0/gems/rack-1.6.11/lib/rack/session/abstract/id.rb:225:in context’\n vendor/bundle/ruby/2.4.0/gems/rack-1.6.11/lib/rack/session/abstract/id.rb:220:in call'\n vendor/bu ndle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middleware/cookies.rb:560:in call’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middleware/callba
cks.rb:29:in block in call'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb:88:in run_callbacks’\n vendor/bundle/ruby/2.4.0/gems/activesup
port-4.2.11/lib/active_support/callbacks.rb:778:in _run_call_callbacks'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/callbacks.rb:81:in run_callbacks’\n
vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middleware/callbacks.rb:27:in call'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middle ware/remote_ip.rb:78:in call’\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middleware/debug_exceptions.rb:17:in call'\n vendor/bundle/ruby/2.4.0/gems/acti onpack-4.2.11/lib/action_dispatch/middleware/show_exceptions.rb:30:in call’\n vendor/bundle/ruby/2.4.0/gems/lograge-0.10.0/lib/lograge/rails_ext/rack/logger.rb:15:in call_app'\n v endor/bundle/ruby/2.4.0/gems/railties-4.2.11/lib/rails/rack/logger.rb:20:in block in call’\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/tagged_logging.rb:
68:in block in tagged'\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11/lib/active_support/tagged_logging.rb:26:in tagged’\n vendor/bundle/ruby/2.4.0/gems/activesupport-4.2.11
/lib/active_support/tagged_logging.rb:68:in tagged'\n vendor/bundle/ruby/2.4.0/gems/railties-4.2.11/lib/rails/rack/logger.rb:20:in call’\n vendor/bundle/ruby/2.4.0/gems/request_st
ore-1.4.1/lib/request_store/middleware.rb:19:in call'\n vendor/bundle/ruby/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middleware/request_id.rb:21:in call’\n vendor/bundle/ru
by/2.4.0/gems/rack-1.6.11/lib/rack/methodoverride.rb:22:in call'\n vendor/bundle/ruby/2.4.0/gems/rack-1.6.11/lib/rack/runtime.rb:18:in call’\n vendor/bundle/ruby/2.4.0/gems/active
support-4.2.11/lib/active_support/cache/strategy/local_cache_middleware.rb:28:in call'\n vendor/bundle/ruby/2.4.0/gems/rack-1.6.11/lib/rack/lock.rb:17:in call’\n vendor/bundle/rub
y/2.4.0/gems/actionpack-4.2.11/lib/action_dispatch/middleware/static.rb:120:in call'\n vendor/bundle/ruby/2.4.0/gems/rack-1.6.11/lib/rack/sendfile.rb:113:in call’\n vendor/bundle/
ruby/2.4.0/gems/railties-4.2.11/lib/rails/engine.rb:518:in call'\n vendor/bundle/ruby/2.4.0/gems/railties-4.2.11/lib/rails/application.rb:165:in call’\n /opt/ood/ondemand/root/usr
/share/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:97:in process_request'\n /opt/ood/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/request_handl er/thread_handler.rb:157:in accept_and_process_next_request’\n /opt/ood/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:110:in main_loo p'\n /opt/ood/ondemand/root/usr/share/ruby/vendor_ruby/phusion_passenger/request_handler.rb:415:in block (3 levels) in start_threads’\n /opt/ood/ondemand/root/usr/share/ruby/vendor
_ruby/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception’”
[ N 2020-02-25 15:18:33.4698 137022/T5 age/Cor/SecurityUpdateChecker.h:519 ]: Security update check: no update found (next check in 24 hours)
[ N 2020-02-25 15:20:15.7263 13927/T3 age/Cor/CoreMain.cpp:1117 ]: Checking whether to disconnect long-running connections for process 39493, application /var/www/ood/apps/sys/dashboa
rd (production)
[ N 2020-02-25 15:20:15.7319 13927/T3 age/Cor/CoreMain.cpp:1117 ]: Checking whether to disconnect long-running connections for process 43275, application /var/www/ood/apps/sys/dashboa
rd (production)

Hi @mpiercy, sorry for the delay. There’s absolutely some lag there, the duration field is in miliseconds so I see some of those are 3-6 seconds (or even 11!) which is quite a lot.

I guess I would ask if squeue is slow from the command line as well? Is it slow from the command line on the OOD webserver? Maybe you can setup some simple bash script to while true; do time squeue; sleep 60; done (I wrote that by hand, don’t know if it’s valid) and see if you lag there too. That could help us narrow it down.

I’d also take a quick peak at the load average of the OOD web server itself just to quickly rule it out.

Hi @jeff.ohrstrom
Here are some real times from the squeue that OOD calls every few seconds-

/usr/bin/squeue --all --states=all --noconvert -o,"%a,%A,%B,%c,%C,%d,%D,%e,%E,%f,%F,%g,%G,%h,%H,%i,%I,%j,%J,%k,%K,%l,%L,%m,%M,%n,%N,%o,%O,%q,%P,%Q,%r,%S,%t,%T,%u,%U,%v,%V,%w,%W,%x,%X,%y,%Y,%z,%Z,
%b" -j 62756509 -M sherlock
Ill try to get the OOD server loads soon

NumSamples = 500; Min = 0.12; Max = 29.59

Mean = 3.387500; Variance = 25.797945; SD = 5.079168; Median 1.157000

each ∎ represents a count of 4

    0.1170 -     3.0645 [   318]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    3.0645 -     6.0120 [    96]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    6.0120 -     8.9595 [    45]: ∎∎∎∎∎∎∎∎∎∎∎
    8.9595 -    11.9070 [    12]: ∎∎∎
   11.9070 -    14.8545 [     5]: ∎
   14.8545 -    17.8020 [     7]: ∎
   17.8020 -    20.7495 [     3]: 
   20.7495 -    23.6970 [     5]: ∎
   23.6970 -    26.6445 [     7]: ∎
   26.6445 -    29.5920 [     2]:

Thanks! Yea we’re tied to queues performance pretty tightly. So I guess the takeway here is to for us better handle the case when it’s slow and to maybe optimize the query.

I wonder if you have a large cluster or if other SLURM users are experiencing similar issues?

Hi @jeff.ohrstrom

It’s not huge, but we have many partitions since we use the condo model, so SLURM has to deal with a lot of decisions when allocating resources
Roughly-
1200 nodes
300-600 users running jobs at a given time
160 partitions

Can you make time it waits on squeue configurable?

Thanks,
Mark

Hi sorry for the late reply.

I don’t think it’s configurable and indeed is related to security anyhow (tokens are only good for so long and so on).

But as we’re going to run slurm here, I imagine we’ll get around to this eventually. Unfortunately we just don’t have the cycles now to put into replicating the issue in a test environment.

As a work around you could use the bin_overrides and write a simple script to put a timeout in front of the squeue command. Here I’ve got 10 which I just invented off hand from your distribution above, you may be able to increase it.

Though this will exit 124 if it times out, it may be a better user experience to have the timeout so you can retry instead of it crashing or hanging in some odd way.

#!/bin/bash

timeout 10 squeue $@

Hey! I believe we fixed Slurm timeouts like this in version 1.8.17. Thanks for reaching out!