Once or twice a month, /var/log suddenly fills up on our OOD v2.0.32 host (rhel8.7).
I’m hoping someone can give me a path to debugging this, or at least understanding what’s going on.
Upon investigating any given blow-up, I’ve been finding a single /var/log/ondemand-nginx/USER/error.log with (usually) millions of consecutive lines reading something like:
App 123456 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
In one case in June, a user’s error.log was almost 16M lines long… and all but 4000 of those lines were the websocket error.
## one extreme case
$ grep -c 'Send error: WebSocket is not open' error.log
15793572
$ grep -vc 'Send error: WebSocket is not open' error.log
3897
## different user with simpler uniq|grep output
$ uniq -c error.log | grep -C10 'Send error: WebSocket is not open'
1 App 1143850 output: [2023-06-19 23:15:35 -0700 ] INFO "[ActiveJob] [TransferLocalJob] [eb8d57ed-ab21-4098-bb47-cd57d76a734e] Performed TransferLocalJob (Job ID: eb8d57ed-ab21-4098-bb47-cd57d76a734e) from Async(default) in 5382.58ms"
1 App 1143850 output: [2023-06-19 23:15:36 -0700 ] INFO "method=GET path=/pun/sys/dashboard/transfers/t71e347d3-a9d5-4dbc-9e85-992fdea7154b.json format=json controller=TransfersController action=show status=200 duration=17.48 view=1.09"
1 App 1143850 output: [2023-06-19 23:15:36 -0700 ] INFO "method=GET path=/pun/sys/dashboard/files/fs/scratch/NAU_USER4/genome format=html controller=FilesController action=fs status=200 duration=39.24 view=6.33"
1 App 1143850 output: [2023-06-19 23:15:52 -0700 ] INFO "method=GET path=/pun/sys/dashboard/files/fs/scratch format=html controller=FilesController action=fs status=200 duration=1510.95 view=767.86"
1 App 1143850 output: [2023-06-19 23:15:55 -0700 ] INFO "method=GET path=/pun/sys/dashboard/files/fs/scratch/NAU_USER4 format=html controller=FilesController action=fs status=200 duration=42.55 view=12.43"
1 App 916256 output: Closed terminal: 1113626
1 App 916256 output: Connection established
1 App 916256 output: Opened terminal: 1146627
1 App 916256 output: Send error: write EPIPE
1 App 916256 output: Closed terminal: 1146627
26148 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
1 App 916256 output: Connection established
1 App 916256 output: Opened terminal: 1148401
782733 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
1 [ N 2023-06-19 23:25:23.9963 869511/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 1143850, application /var/www/ood/apps/sys/dashboard (production)
11459433 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
1 App 916256 output: Closed terminal: 916325
3525258 App 916256 output: Send error: WebSocket is not open: readyState 3 (CLOSED)
1 App 916256 output: Closed terminal: 1112699
1 App 916256 output: Closed terminal: 1036162
1 App 916256 output: Closed terminal: 1064723
1 App 916256 output: Closed terminal: 1148401
1 App 916256 output: Closed terminal: 1073691
1 App 916256 output: Closed terminal: 1076412
1 [ N 2023-06-20 01:08:19.3749 869511/T4 age/Cor/CoreMain.cpp:1147 ]: Checking whether to disconnect long-running connections for process 916256, application /var/www/ood/apps/sys/shell (production)
1
1 [ N 2023-06-20 02:00:03.4952 869511/T8 age/Cor/CoreMain.cpp:671 ]: Signal received. Gracefully shutting down... (send signal 2 more time(s) to force shutdown)
1 [ N 2023-06-20 02:00:03.4953 869511/T1 age/Cor/CoreMain.cpp:1246 ]: Received command to shutdown gracefully. Waiting until all clients have disconnected...
The issue is definitely not constant, nor widespread. When I look at rotated older logs from an affected user, grepping for the same error shows it can sometimes occur just a handful of times:
I assume it’s driven by user behavior. Specifically, they have a shell session open and it disconnects, perhaps erroneously. I don’t think there’s any thing we can do really WRT to the message and it being created.
That said - you may be able to be more aggressive about rotating logs. These appear to rotate based on the date. I just filed this ticket upstream to rotate by size. It seems like you can enable that on your side to just overwrite the logrotate file that we ship.
I wanted to give some more solid examples after reading treydock’s response to that issue.
I personally will be implementing MAXSIZE verbiage to remediate our issues here, but (if this blown-up-log issue isn’t just local to us) even hourly rotation may not be sufficient to prevent a log fs from filling up-- many of the duplicate lines accrue extremely rapidly.
In many of the selected cases below, hundreds-of-thousands/millions of lines are produced in 2-5 minutes. (The true rapidity is probably faster than that, because the below parsing method is pretty basic and will fail to group the [many] messages that get broken over multiple lines, for whatever reason.)
Yea, the other side of this is to actually figure out what’s going on on the server. So we should likely do that as well to see if there’s anything serverside we can do so that it doesn’t spew so many logs all the time.