Table jobs not found: Job Composer Error

Hello everybody.

Today we are trying to install OOD on another cluster and we are getting some issues in the job composer application.

We are getting an error message and if we check the logs we have a message about some type of ruby process trying to check a table called “jobs”.

We are not sure if this table is related with the slurm head node in any ways or is something that we are not doing well, it’s the first time we are getting this type of errors. The log it’s like:

INFO “[323363b8-eccd-4db0-8f11-0685ff932aa2] method=GET path=/pun/sys/myjobs/ format=html controller=WorkflowsController action=index status=500 error=‘ActiveRecord::StatementInvalid: Could not find table ‘jobs’’ duration=70.64 view=0.00 db=8.93”
App 396321 output: [2023-12-15 12:26:35 +0000 ] FATAL “[323363b8-eccd-4db0-8f11-0685ff932aa2] \n[323363b8-eccd-4db0-8f11-0685ff932aa2] ActiveRecord::StatementInvalid (Could not find table ‘jobs’):\n[323363b8-eccd-4db0-8f11-0685ff932aa2] \n[323363b8-eccd-4db0-8f11-0685ff932aa2] app/models/workflow.rb:16:in block in <class:Workflow>'\n[323363b8-eccd-4db0-8f11-0685ff932aa2] app/controllers/workflows_controller.rb:225:in update_jobs’”

Thank you for your help!

This can happen if the database doesn’t initialize correctly because you navigated to the URL directly. It’s not safe to navigate to the URL directly - you need to go through the navigation bar’s link.

To fix the issue, you need to remove the database ~/ondemand/data/sys/myjobs/production.sqlite3 then navigate to myjobs through the navigation bar’s link.

Well, we are using the menu items as you say and we check with one of our users that the “production.sqlite3” is created and is empty.

To check if it is a corrupt file or similar we delete it for this user, and once we try to access it from the Nav Bar we have the same problem and this file is created empty again.

In this case we had to install OOD from the rpm’s in a more manual way (due to proxy restrictions), so I don’t know if there is something wrong with the installation of this server or you have any idea what might be going on here.

Once again thanks for your time Jeff.
Christian

Actually I just tried by removing the file then touching it and it just worked. Now that I’m thinking about it, I’m sure I patched this at some point. What version are you running? I just tried on 3.0.3. I wonder if you have to restart your webserver to get rid of any stale file handles?

By ‘restart your webserver’ I mean just restarting the users’ PUN from the help menu, not the whole server.

we installed version 3.0.3 but due to this bug we tried to downgrade to 3.0.1, because we saw that there were security fixes in 3.0.2 (so we chose this version to rule out versioning problems).

Not sure about must, but I’d try that at least.

I’d have to check the changelog, but myjobs doesn’t get a lot of updates, and even without checking I could say with 90% confidence that nothing in myjobs got updated in 3.0.1 to 3.0.3.

I think so, because we have done other installations with version 3.0.1 and this problem does not exist.

So I’m afraid of missing something in the installation process because of how we had to do it (manually I mean).

Checking other of these files “~/ondemand/data/sys/myjobs/production.sqlite3” in previous installations (other clusters) it seems that in this one the Job Composer is not creating this file the way it should (and leaves it blank).

From the full changlog it appears the only change is that we upgraded rails.

Navigating from the navigation bar is important here because this link will run this file that sets up the database. I was able to replicate your issue by removing the database file, bouncing the webserver and navigating directly to /pun/sys/myjobs.

You cannot navigate directly to /pun/sys/myjobs – You have to navigate through the link in the dashboard. After I replicated the issue I

  • removed the database file
  • bounced my PUN (restart the webserver in the help menu)
  • land on /pun/sys/dashboard and click the link in the navigation bar

and it came back. I was able to replicate by navigating directly to /pun/sys/myjobs – this is what’s causing your errors.

Thank you Jeff, we are ending the job time now so next Monday I will check this with my coworkers, I’ll let you know the results.
Thanks once again for all the help.

1 Like

No problem. Like I say, I was able to replicate it with the steps above on 3.0.3, so I’m sure that’s what it is.

Also - you should definitely upgrade to 3.0.3. This pull request below fixed something extremely dangerous. Start up times end up being very very large (like 60 seconds or more) because of it. So it’s actually a really big deal because it can cause outages fairly easily.

Hello again Jeff,I hope you had a good weekend.

Coming back to this problem, we tried your solution but it doesn’t seem to be the problem, we replicated your process and nothing was different. At this point I realised something and I want to ask you what are the permissions that the “ondemand/data/sys/sys/myjobs/production.sqlite3” files need? I’m thinking that the accounts they give us are having problems on the shared file systems and this is perhaps interfering with our attempts to launch Job Composer (we’ve had to change permissions on paths like /etc/ood because these accounts set the incorrect ones once we edit files, for example the ood_portal.yaml).

That is unfortunate!

Ownership for sure, and probably RW.

That sounds strange - but one thing that comes to mind with working with sqlite3 on an NFS is locking (or rather setting the NFS to not have file locks). We mount our HOME drives with local_lock=none.

I wonder if there’s any error in /var/log/ondemand-nginx/$USER complaining about locks?

This is my log, just now I logged into the OOD portal and just trying to launch Job Composer, I didn’t see any of this but I put it here so you can check it out or if you see anything weird: (The time of launch is 15:31 but i put logs from before just in case)

[ E 2023-12-18 15:22:07.0297 585672/T1a age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/ood/apps/sys/shell: The application process exited prematurely.
Error ID: 5707f621
Error details saved to: /tmp/passenger-error-tvmMLd.html

[ E 2023-12-18 15:22:07.0370 585672/Te age/Cor/Con/CheckoutSession.cpp:283 ]: [Client 4-4] Cannot checkout session because a spawning error occurred. The identifier of the error is 5707f621. Please see earlier logs for details about the error.
[ N 2023-12-18 15:22:49.1297 585672/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 586288, application /var/www/ood/apps/sys/myjobs (production)
App 586723 output: innerError Error: Cannot find module ‘…/build/Debug/pty.node’
App 586723 output: Require stack:
App 586723 output: - /var/www/ood/apps/sys/shell/node_modules/node-pty/lib/unixTerminal.js
App 586723 output: - /var/www/ood/apps/sys/shell/node_modules/node-pty/lib/index.js
App 586723 output: - /var/www/ood/apps/sys/shell/app.js
App 586723 output: - /opt/ood/ondemand/root/usr/share/passenger/helper-scripts/node-loader.js
App 586723 output: at Function.Module._resolveFilename (internal/modules/cjs/loader.js:931:15)
App 586723 output: at Function.Module._load (internal/modules/cjs/loader.js:774:27)
App 586723 output: at Module.require (internal/modules/cjs/loader.js:1003:19)
App 586723 output: at Module.require (/opt/ood/ondemand/root/usr/share/passenger/helper-scripts/node-loader.js:80:25)
App 586723 output: at require (internal/modules/cjs/helpers.js:107:18)
App 586723 output: at Object. (/var/www/ood/apps/sys/shell/node_modules/node-pty/lib/unixTerminal.js:30:15)
App 586723 output: at Module._compile (internal/modules/cjs/loader.js:1114:14)
App 586723 output: at Object.Module._extensions…js (internal/modules/cjs/loader.js:1143:10)
App 586723 output: at Module.load (internal/modules/cjs/loader.js:979:32)
App 586723 output: at Function.Module._load (internal/modules/cjs/loader.js:819:12) {
App 586723 output: code: ‘MODULE_NOT_FOUND’,
App 586723 output: requireStack: [
App 586723 output: ‘/var/www/ood/apps/sys/shell/node_modules/node-pty/lib/unixTerminal.js’,
App 586723 output: ‘/var/www/ood/apps/sys/shell/node_modules/node-pty/lib/index.js’,
App 586723 output: ‘/var/www/ood/apps/sys/shell/app.js’,
App 586723 output: ‘/opt/ood/ondemand/root/usr/share/passenger/helper-scripts/node-loader.js’
App 586723 output: ]
App 586723 output: }
App 586723 output: /var/www/ood/apps/sys/shell/node_modules/node-pty/lib/unixTerminal.js:35
App 586723 output: throw outerError;
App 586723 output: ^
App 586723 output:
App 586723 output: Error: /var/www/ood/apps/sys/shell/node_modules/node-pty/build/Release/pty.node: failed to map segment from shared object
App 586723 output: at Object.Module._extensions…node (internal/modules/cjs/loader.js:1173:18)
App 586723 output: at Module.load (internal/modules/cjs/loader.js:979:32)
App 586723 output: at Function.Module._load (internal/modules/cjs/loader.js:819:12)
App 586723 output: at Module.require (internal/modules/cjs/loader.js:1003:19)
App 586723 output: at Module.require (/opt/ood/ondemand/root/usr/share/passenger/helper-scripts/node-loader.js:80:25)
App 586723 output: at require (internal/modules/cjs/helpers.js:107:18)
App 586723 output: at Object. (/var/www/ood/apps/sys/shell/node_modules/node-pty/lib/unixTerminal.js:26:11)
App 586723 output: at Module._compile (internal/modules/cjs/loader.js:1114:14)
App 586723 output: at Object.Module._extensions…js (internal/modules/cjs/loader.js:1143:10)
App 586723 output: at Module.load (internal/modules/cjs/loader.js:979:32) {
App 586723 output: code: ‘ERR_DLOPEN_FAILED’
App 586723 output: }
[ E 2023-12-18 15:23:58.0321 585672/T1g age/Cor/App/Implementation.cpp:221 ]: Could not spawn process for application /var/www/ood/apps/sys/shell: The application process exited prematurely.
Error ID: 5017903f
Error details saved to: /tmp/passenger-error-V7RPKW.html

[ E 2023-12-18 15:23:58.0401 585672/T8 age/Cor/Con/CheckoutSession.cpp:283 ]: [Client 1-5] Cannot checkout session because a spawning error occurred. The identifier of the error is 5017903f. Please see earlier logs for details about the error.
[ N 2023-12-18 15:26:00.1059 585672/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 586088, application /var/www/ood/apps/sys/dashboard (production)
App 586896 output: [2023-12-18 15:31:01 +0000 ] WARN “Error opening MOTD at \nException: bad URI(is not URI?): nil”
App 586896 output: [2023-12-18 15:31:01 +0000 ] INFO “method=GET path=/pun/sys/dashboard/ format=html controller=DashboardController action=index status=200 duration=117.20 view=24.16”
App 586896 output: [2023-12-18 15:31:05 +0000 ] INFO “method=GET path=/pun/sys/dashboard/apps/show/myjobs format=html controller=AppsController action=show status=302 duration=12.97 view=0.00 location=http://txxxxxxxxxxxxxxxxxx/pun/sys/myjobs
App 586937 output: [2023-12-18 15:31:07 +0000 ] DEBUG “[ffe1ddfd-3dd5-4a5d-8a57-eb6c54756c1e] \e[1m\e[35m (1.4ms)\e[0m \e[1m\e[34mSELECT sqlite_version(*)\e[0m”
App 586937 output: [2023-12-18 15:31:07 +0000 ] INFO “[ffe1ddfd-3dd5-4a5d-8a57-eb6c54756c1e] method=GET path=/pun/sys/myjobs/ format=html controller=WorkflowsController action=index status=500 error=‘ActiveRecord::StatementInvalid: Could not find table ‘jobs’’ duration=85.84 view=0.00 db=20.78”
App 586937 output: [2023-12-18 15:31:07 +0000 ] FATAL “[ffe1ddfd-3dd5-4a5d-8a57-eb6c54756c1e] \n[ffe1ddfd-3dd5-4a5d-8a57-eb6c54756c1e] ActiveRecord::StatementInvalid (Could not find table ‘jobs’):\n[ffe1ddfd-3dd5-4a5d-8a57-eb6c54756c1e] \n[ffe1ddfd-3dd5-4a5d-8a57-eb6c54756c1e] app/models/workflow.rb:16:in block in <class:Workflow>'\n[ffe1ddfd-3dd5-4a5d-8a57-eb6c54756c1e] app/controllers/workflows_controller.rb:225:in update_jobs’”

OK - how about how you mount the NFS? Do you disable local locks?

The nfs system is provided by the customer, so we have no in-depth knowledge of it.

You can issue mount -l | grep <your directory> to see.

But even so, you provide mount options on your server when your machine mounts it, so I think you should be in control of it.

We are really in a situation where we have no power to touch the NFS, what we have to do is to identify the error and see what is going on. I don’t know if this problem we are having is due to the shared system or what, can you think of anything for this situation?

NFS file locking is the only thing that comes to mind as what could be causing this issue. It would appear that you can create the file (that suggests you have read & write permissions to that location) but are unable to actually issue SQL queries against it.

Searching the internet for sqlite3 nfs lock will give you some indication that using sqlite3 on NFS storages (as OnDemand does) has some caveats.

If you go through the flow that I’ve provided by removing the file, bouncing the server, then navigating through the navigation bar - you should see something relevant in your error logs (should being the operative word here). It seems like it’s unable to create the database (in the setup-production file linked above) and so should log that information as to why that’s not the case.

I’ll double check to confirm that we catch & log any errors when we run this script.