Corrupt connection file

I’ve just done an initial install of OOD, and for the most part, I have it working.

However, when running an interactive desktop, I’m getting an error when
I click on the ‘My Interactive Sessions’ button on the dashboard.

I’m seeing an ‘Internal Server Error’ with text that starts with:

#<ActionView::Template::Error: (<unknown>): invalid trailing UTF-8 octet at line 1 column 1>

/opt/rh/rh-ruby22/root/usr/share/ruby/psych.rb:370:in `parse'
/opt/rh/rh-ruby22/root/usr/share/ruby/psych.rb:370:in `parse_stream'
/opt/rh/rh-ruby22/root/usr/share/ruby/psych.rb:318:in `parse'
/opt/rh/rh-ruby22/root/usr/share/ruby/psych.rb:284:in `safe_load'
/var/www/ood/apps/sys/dashboard/app/models/batch_connect/session.rb:396:in `connect'

… and many more lines.

In my PUN error log file, I’m seeing similar-

App 9021 stdout: [2018-08-01 15:31:28 -0400 ]  INFO "execve =
[{\"SLURM_CONF\"=>\"/etc/slurm/slurm.conf\"}, \"/usr/bin/squeue\",
\"--all\", \"--states=all\", \"--noconvert\", \"-o\",
\"%a\\u001F%A\\u001F%b\\u001F%B\\u001F%c\\u001F%C\\u001F%d\\u001F%D\\u001F%e\\u001F%E\\u001F%f\\u001F%F\\u001F%g\\u001F%G\\u001F%h\\u001F%H\\u001F%i\\u001F%I\\u001F%j\\u001F%J\\u001F%k\\u001F%K\\u001F%l\\u001F%L\\u001F%m\\u001F%M\\u001F%n\\u001F%N\\u001F%o\\u001F%O\\u001F%q\\u001F%P\\u001F%Q\\u001F%r\\u001F%S\\u001F%t\\u001F%T\\u001F%u\\u001F%U\\u001F%v\\u001F%V\\u001F%w\\u001F%W\\u001F%x\\u001F%X\\u001F%y\\u001F%Y\\u001F%z\\u001F%Z\",
 \"-j\", \"190\"]"

App 9021 stdout: [2018-08-01 15:31:28 -0400 ]  INFO "method=GET
path=/pun/sys/dashboard/batch_connect/sessions format=html
controller=BatchConnect::SessionsController action=index status=500
error='ActionView::Template::Error: (<unknown>): invalid trailing
UTF-8 octet at line 1 column 1' duration=35.72 view=0.00"

App 9021 stdout: [2018-08-01 15:31:28 -0400 ] FATAL
"ActionView::Template::Error ((<unknown>): invalid trailing UTF-8
octet at line 1 column 1):\n    1: <%= session_panel session do %>\n
 2:   <%= session_view session do %>\n    3:     <%\n    4:       if
session.running?\n  app/models/batch_connect/session.rb:396:in
`connect'\n  app/models/batch_connect/session.rb:405:in `to_hash'\n
app/helpers/batch_connect/sessions_helper.rb:3:in `session_panel'\n
app/views/batch_connect/sessions/_panel.html.erb:1:in
`_app_views_batch_connect_sessions__panel_html_erb__2226473861888657584_36164260'\n
 app/views/batch_connect/sessions/index.html.erb:37:in
`_app_views_batch_connect_sessions_index_html_erb__2670672794125958996_36346460'"

Source: Originally posted by Kevin M. Hildebrand in the ood-users mailing list.

It’s caused by a corrupt connection file that was generated by one of your running interactive session jobs. When an Interactive Session starts running on an allocated cluster node it will attempt to create a connection.yml file in its staging directory. This file holds all the necessary information (e.g., host, port, and VNC password) necessary for the user to connect back to the session through the Dashboard. The Dashboard will attempt to parse this file using the Ruby YAML parsing library psych and generate a link that the user can use to connect to their running session.

Step1: Find connection.yml file

You can find these connection files under the staging directory for each of the running Interactive Sessions. I don’t have access to my data anymore, so working from memory you can probably find it under (give or take a few mistakes in directories):

~/ondemand/data/sys/dashboard/batch_connect/sys/bc_desktop/<cluster>/output/<unique_session_id>/connection.yml

or you can get fancy and do:

$ find ~/ondemand/data -name connection.yml

Step2: Find and fix the bug

You will want to read the contents of those files and look for any odd characters that stand out. If you can track down the offending file that will help narrow down the bug and a possible fix.

Take the following connection.yml file as an example.

host: compute-0-0.xxx.yyy.zzz
port: 5902
password: �wA2XcYm
display: 2
websocket: 26532
spassword: �s�r�oym

It looks like whatever’s generating the passwords for VNC and Websockets is putting non-printable ASCII characters in them (which look like they’re being interpreted as UTF8, even though they’re not valid UTF8 characters.)

Ariel’s comment explains in detail the steps to debug the problem.

This problem occurs when connection file generated by one of the running interactive sessions is corrupt, due to non-printable ASCII characters, non valid UTF8. In this particular instance the locale on the compute node was set to ‘en_US’ instead of ‘en_US.UTF-8’. Since our code was using /dev/urandom to generate the password used in the connection file, some non UTF-8 characters were getting into the file. We changed the code used to generate passwords, addressed in https://github.com/OSC/ood_core/issues/91, so hopefully in OnDemand 1.4 this problem will be solved.