I have a group of users who noticed that sometimes when they try to launch interactive jobs, if the request times out (e.g., flaky long distance connection), that they can end up in a weird state where there is no job card on the interactive jobs page but the job is actually running. This particular group has learned to work around it by manually constructing an appropriate URL to reach their running job.
However, I feel that is really bad because users who don’t know that this is a possible outcome will assume that the job didn’t launch and launch a second job, while the first job continues to burn away their CPU allocation units and take up space on the cluster. Imagine, for example, a machine-learning user’s interactive jupyter notebook requesting multiple GPUs effectively taking an entire node offline. The user later wonders where all their CPU allocation units are going.
If this is not a known issue, I would like to submit a bug report. I assume I would need to provide some additional technical details for reproduction.