Occasional random(?) PUN error.log blowups with millions of lines of WebSocket errors

Once or twice a month, /var/log suddenly fills up on our OOD v2.0.32 host (rhel8.7).
I’m hoping someone can give me a path to debugging this, or at least understanding what’s going on.

Upon investigating any given blow-up, I’ve been finding a single /var/log/ondemand-nginx/USER/error.log with (usually) millions of consecutive lines reading something like:

App 123456 output: Send error: WebSocket is not open: readyState 3 (CLOSED)

In one case in June, a user’s error.log was almost 16M lines long… and all but 4000 of those lines were the websocket error.

## one extreme case
$ grep -c 'Send error: WebSocket is not open' error.log
15793572

$ grep -vc 'Send error: WebSocket is not open' error.log
3897

## different user with simpler uniq|grep output
$ uniq -c error.log | grep -C10 'Send error: WebSocket is not open'
       1 App 1143850 output: [2023-06-19 23:15:35 -0700 ]  INFO "[ActiveJob] [TransferLocalJob] [eb8d57ed-ab21-4098-bb47-cd57d76a734e] Performed TransferLocalJob (Job ID: eb8d57ed-ab21-4098-bb47-cd57d76a734e) from Async(default) in 5382.58ms"
       1 App 1143850 output: [2023-06-19 23:15:36 -0700 ]  INFO "method=GET path=/pun/sys/dashboard/transfers/t71e347d3-a9d5-4dbc-9e85-992fdea7154b.json format=json controller=TransfersController action=show status=200 duration=17.48 view=1.09"
       1 App 1143850 output: [2023-06-19 23:15:36 -0700 ]  INFO "method=GET path=/pun/sys/dashboard/files/fs/scratch/NAU_USER4/genome format=html controller=FilesController action=fs status=200 duration=39.24 view=6.33"
       1 App 1143850 output: [2023-06-19 23:15:52 -0700 ]  INFO "method=GET path=/pun/sys/dashboard/files/fs/scratch format=html controller=FilesController action=fs status=200 duration=1510.95 view=767.86"
       1 App 1143850 output: [2023-06-19 23:15:55 -0700 ]  INFO "method=GET path=/pun/sys/dashboard/files/fs/scratch/NAU_USER4 format=html controller=FilesController action=fs status=200 duration=42.55 view=12.43"
       1 App 916256 output: Closed terminal: 1113626
       1 App 916256 output: Connection established
       1 App 916256 output: Opened terminal: 1146627
       1 App 916256 output: Send error: write EPIPE
       1 App 916256 output: Closed terminal: 1146627
   26148 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
       1 App 916256 output: Connection established
       1 App 916256 output: Opened terminal: 1148401
  782733 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
       1 [ N 2023-06-19 23:25:23.9963 869511/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 1143850, application /var/www/ood/apps/sys/dashboard (production)
11459433 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
       1 App 916256 output: Closed terminal: 916325
 3525258 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
       1 App 916256 output: Closed terminal: 1112699
       1 App 916256 output: Closed terminal: 1036162
       1 App 916256 output: Closed terminal: 1064723
       1 App 916256 output: Closed terminal: 1148401
       1 App 916256 output: Closed terminal: 1073691
       1 App 916256 output: Closed terminal: 1076412
       1 [ N 2023-06-20 01:08:19.3749 869511/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 916256, application /var/www/ood/apps/sys/shell (production)
       1
       1 [ N 2023-06-20 02:00:03.4952 869511/T8 age/Cor/CoreMain.cpp:671 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown)
       1 [ N 2023-06-20 02:00:03.4953 869511/T1 age/Cor/CoreMain.cpp:1246 ]: Received command to shutdown gracefully. Waiting until all clients have disconnected...

The issue is definitely not constant, nor widespread. When I look at rotated older logs from an affected user, grepping for the same error shows it can sometimes occur just a handful of times:

## USER2
$ zgrep -c 'Send error: WebSocket is not open' error.log* | column -ts':'
error.log              18637659
error.log-20230618.gz  3
error.log-20230625.gz  467938
error.log-20230702.gz  150172
error.log-20230709.gz  23


## USER3
$ zgrep -c 'Send error: WebSocket is not open' error.log* | column -ts':'
error.log              14356374
error.log-20221006.gz  0
error.log-20230120.gz  203
error.log-20230129.gz  0
error.log-20230319.gz  0

## USER4
$ zgrep -c 'Send error: WebSocket is not open' error.log* | column -ts':'
error.log              17156556
error.log-20230702.gz  0
error.log-20230709.gz  8
error.log-20230716.gz  0
error.log-20230723.gz  4

Anyone have any ideas what’s going on? Or a shortlist of things to check into?

I assume it’s driven by user behavior. Specifically, they have a shell session open and it disconnects, perhaps erroneously. I don’t think there’s any thing we can do really WRT to the message and it being created.

That said - you may be able to be more aggressive about rotating logs. These appear to rotate based on the date. I just filed this ticket upstream to rotate by size. It seems like you can enable that on your side to just overwrite the logrotate file that we ship.

1 Like

I wanted to give some more solid examples after reading treydock’s response to that issue.
I personally will be implementing MAXSIZE verbiage to remediate our issues here, but (if this blown-up-log issue isn’t just local to us) even hourly rotation may not be sufficient to prevent a log fs from filling up-- many of the duplicate lines accrue extremely rapidly.

In many of the selected cases below, hundreds-of-thousands/millions of lines are produced in 2-5 minutes. (The true rapidity is probably faster than that, because the below parsing method is pretty basic and will fail to group the [many] messages that get broken over multiple lines, for whatever reason.)

Anonymized logs are available for the time being via my “nau pun log blowups” globus collection.

## de-duplicate (and count) lines (mostly "App # output" lines)
$ for i in `ls <error.log files>`; do uniq -c $i; done > uniq_x4_users.txt

## select 4-digit+ counts and pre/post timestamps
$ awk '
  $0~/2023/{PREVTIME=$0};
  $1>999{print PREVTIME;print;NEEDNEXT=1;}; 
  $0~/2023/ && NEEDNEXT==1{print;NEEDNEXT=0}
' uniq_x4_users.txt > bracketed_x4_users.txt

## select just 6-digit+ counts, and timestamps
$ grep -C1 -P '^ *\d{6,}' bracketed_x4_users.txt | cut -c1-90
      1 App 1745892 output: [2023-07-14 22:38:11 -0700 ]  INFO "method=PUT path=/pun/sys/d
 115335 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 [ N 2023-07-14 22:38:55.3943 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
      1 [ N 2023-07-14 22:38:55.3943 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
 100113 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1757591 output: [2023-07-14 22:39:10 -0700 ] DEBUG "\e[1m\e[36mWorkflow Load (
--
      1 App 1745892 output: [2023-07-14 22:39:25 -0700 ]  INFO "method=GET path=/pun/sys/d
 274026 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1757591 output: [2023-07-14 22:40:16 -0700 ] DEBUG "\e[1m\e[36mWorkflow Load (
      1 App 1745892 output: [2023-07-14 22:40:17 -0700 ]  INFO "method=GET path=/pun/sys/d
 266388 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1757591 output: [2023-07-14 22:41:20 -0700 ] DEBUG "\e[1m\e[36mWorkflow Load (
--
      1 App 1757591 output: [2023-07-14 22:41:25 -0700 ]  INFO "method=GET path=/pun/sys/m
 375027 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1745892 output: [2023-07-14 22:42:34 -0700 ]  INFO "method=PUT path=/pun/sys/d
--
      1 App 1745892 output: [2023-07-14 22:42:48 -0700 ]  INFO "method=GET path=/pun/sys/d
 177342 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1745892 output: [2023-07-14 22:43:27 -0700 ]  INFO "method=GET path=/pun/sys/d
--
      1 App 1745892 output: [2023-07-14 22:43:42 -0700 ]  INFO "method=GET path=/pun/sys/d
 135778 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1745892 output: [2023-07-14 22:44:16 -0700 ]  INFO "files: initiating asynchro
--
      1 App 1757591 output: [2023-07-14 22:44:26 -0700 ]  INFO "method=GET path=/pun/sys/m
 159374 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1745892 output: [2023-07-14 22:44:59 -0700 ]  INFO "method=PUT path=/pun/sys/d
--
      1 App 1757591 output: [2023-07-14 22:46:04 -0700 ]  INFO "method=GET path=/pun/sys/m
1225388 App 1669740 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 [ N 2023-07-14 22:48:39.7113 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
      1 App 1770504 output: [2023-07-14 23:03:25 -0700 ]  INFO "method=GET path=/pun/sys/d
1344728 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1770504 output: [2023-07-14 23:07:46 -0700 ]  INFO "method=GET path=/pun/sys/d
--
      1 App 1770504 output: [2023-07-14 23:08:05 -0700 ]  INFO "method=GET path=/pun/sys/d
 103089 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1770504 output: [2023-07-14 23:08:18 -0700 ]  INFO "method=GET path=/pun/sys/d
--
      1 App 1770504 output: [2023-07-14 23:08:18 -0700 ]  INFO "method=GET path=/pun/sys/d
 256814 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1770452 output: [2023-07-14 23:08:49 -0700 ] DEBUG "\e[1m\e[36mWorkflow Load (
--
      1 App 1770452 output: [2023-07-14 23:08:59 -0700 ]  INFO "method=GET path=/pun/sys/m
 241976 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1770452 output: [2023-07-14 23:09:41 -0700 ] DEBUG "\e[1m\e[36mWorkflow Load (
--
      1 App 1770452 output: [2023-07-14 23:09:53 -0700 ]  INFO "method=GET path=/pun/sys/m
 452961 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1770504 output: [2023-07-14 23:11:13 -0700 ]  INFO "method=GET path=/pun/sys/d
--
      1 App 1770504 output: [2023-07-14 23:11:28 -0700 ]  INFO "method=GET path=/pun/sys/d
1051467 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1770504 output: [2023-07-14 23:14:59 -0700 ]  INFO "method=GET path=/pun/sys/d
--
      1 App 1770504 output: [2023-07-14 23:15:25 -0700 ]  INFO "method=GET path=/pun/sys/d
1000354 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 [ N 2023-07-14 23:19:03.3091 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
      1 [ N 2023-07-14 23:19:03.3091 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
 226751 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 [ N 2023-07-14 23:21:54.1875 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
--
      1 App 1792841 output: [2023-07-14 23:22:04 -0700 ]  INFO "method=GET path=/pun/sys/m
 148060 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1793748 output: [2023-07-14 23:22:50 -0700 ]  INFO "method=GET path=/pun/sys/d
      1 App 1793748 output: [2023-07-14 23:22:51 -0700 ]  INFO "method=GET path=/pun/sys/d
 756740 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1792841 output: [2023-07-14 23:25:53 -0700 ] DEBUG "\e[1m\e[36mWorkflow Load (
--
      1 App 1793748 output: [2023-07-14 23:25:59 -0700 ]  INFO "method=GET path=/pun/sys/d
1608793 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 [ N 2023-07-14 23:32:50.5516 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
      1 [ N 2023-07-14 23:32:50.5523 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet
6701370 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 App 1822441 output: [2023-07-15 00:02:10 -0700 ]  INFO "method=GET path=/pun/sys/d
--
      1 App 1822441 output: [2023-07-15 00:02:34 -0700 ]  INFO "method=GET path=/pun/sys/d
 631019 App 1770764 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
      1 [ N 2023-07-15 00:12:27.5060 1669542/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whet

Yea, the other side of this is to actually figure out what’s going on on the server. So we should likely do that as well to see if there’s anything serverside we can do so that it doesn’t spew so many logs all the time.

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.