Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Post here your questions about SFS2X. Here we discuss all server-side matters. For client API questions see the dedicated forums.

Moderators: Lapo, Bax

FlyingSerg1
Posts: 8
Joined: 31 Aug 2018, 02:18

Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby FlyingSerg1 » 11 Jun 2019, 01:49

Hello,

We've developed multiplayer game using html5 sdk on client. Server creates room extensions using java.

Production servers started to fail after a week with with "Too many open files" exceptions. When I checked lsof it shows 500k+ file descriptors in CLOSE_WAIT state. All of them are IPv6 TCP connections from different external IPs (for sure related to connected players). Please advise how to fix it ?

We use smartfox: 2.13.4
server runs on aws c5.large
java:
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03)
OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)

Our players usually close browsers or tab after completing game or set of games (though we have stand up and disconnect buttons). Usual load on server CCU wise 100-300 24x7.

What I also noticed that during stress test number of file descriptors in ESTABLISHED state for 500 CCU = 95K+
root@banana:~# lsof -n | grep smartfox | grep ESTABLISHED | wc -l
95570
For stress test we used java client with direct tcp connection and not html5 client.

Other question: potentially our 500 CCU pulled load equals 4500 CCU and server was able to handle it. But if 500 CCU has 95k+ files opened than 4500 will open almost 700K which is very close to file limit that can be open on c5.large. I guess these two problems might be somewhat related. But CLOSE_WAIT problem is more pressing.

Thank you,
Sergey.
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 11 Jun 2019, 09:16

Hi,
Production servers started to fail after a week with with "Too many open files" exceptions. When I checked lsof it shows 500k+ file descriptors in CLOSE_WAIT state. All of them are IPv6 TCP connections from different external IPs (for sure related to connected players). Please advise how to fix it ?

500k connections seem a lot! :shock: Holy moly.
In any case this may not be not directly connected to SFS2X per se. It could be more of a network issue, I dare say, and I think it should be investigated with your network admin or hosting provider.

What I also noticed that during stress test number of file descriptors in ESTABLISHED state for 500 CCU = 95K+
root@banana:~# lsof -n | grep smartfox | grep ESTABLISHED | wc -l
95570

I am not sure what you're seeing here.
I've tested a similar scenario in AWS, running a 300 CCU test and inspecting the connections with this command

Code: Select all

sudo lsof -a -i4 -i6 -itcp | grep java | grep ESTABLISHED | wc -l

The server's OS is Ubuntu 18. The lsof command filters out all TCP connections (over both IPv4 and IPv6), then filters the "java" process (which is how SFS2X looks like under Linux) and finally counts all the connections in ESTABLISHED state.

There are no surprises here. The result matches the number of connections perfectly.
I would recommend you use the same command to check on your side, unless your version of lsof uses different switches (I kind of doubt it, since the same command works identically on macOS Mojave, and Linux Mint).

Also, after the test I monitored the CLOSE_WAIT state of the previous connections and there were none.
The test was done using socket connections on port 9933 with the standard Java API.

Let us know.
Lapo
--
gotoAndPlay()
...addicted to flash games
FlyingSerg1
Posts: 8
Joined: 31 Aug 2018, 02:18

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby FlyingSerg1 » 11 Jun 2019, 18:05

Hi Lapo,

Thank you for fast response. 500K connections is really holy moly :(.

Also, after the test I monitored the CLOSE_WAIT state of the previous connections and there were none.
The test was done using socket connections on port 9933 with the standard Java API.


We also don't have such problem with standard java API connecting to port 9933. It is only websocket connections from our players browsers.

sudo lsof -a -i4 -i6 -itcp | grep java | grep ESTABLISHED | wc -l

For some reason lsof does not show you full picture of open file descriptors if you use any switches like -i or -u. Example (smartfox is our user under which we run our servers) 1. list of all files that opened excluding smartfor. 2. list of all opened files. 3. list of all files that opened by smartfox user only. Numbers does not add up.:
root@blood-orange:~# lsof -u^smartfox | wc -l
2831
root@blood-orange:~# lsof | wc -l
208853
root@blood-orange:~# lsof -u smartfox | wc -l
1201
And system fails because of these 200K numbers and not filtered one.

It might be problem with openSdk. I don't know yet. We'll try to test it with different jvm.

Could you try run similar to your env command and compare numbers while running java test ? if they match ?
lsof -n | grep smartfox | grep ESTABLISHED | wc -l


Also some extra information: number of unique entries are much less 3000 out of 500K (flies.log contains lsof output):
root@banana:/home/smartfox$ cat files.log | grep smartfox | grep CLOSE_WAIT | awk '{print $6 " " $9}' | uniq -u | wc -l
2698
root@banana:/home/smartfox$ cat files.log | grep smartfox | grep CLOSE_WAIT | awk '{print $6 " " $9}' | wc -l
571976
So something within jvm or smartfox opens bunch of file descriptors for one socket connection. Silly question could it be leaking though class loaders that you use for each game room that our game app creates/destorys dynamically ?

Another confusing thing is this: number of file descriptors in process directory much less than lsof shows
root@banana:/proc/26992# ls -l ./fd/ | grep 'socket:' | awk '{print $11}' | uniq -u | wc -l
2205
root@banana:/proc/26992# lsof -n | grep smartfox | grep CLOSE_WAIT | wc -l
242008

We also use NPC that we create on demand to emulate "robot" in 2-4 player table game. Server disconnects NPC when game is completed or table became empty(no human players). Though it might not affect server connections since NPC does not have connection anymore. Just FYI.

We have another game which runs on smartfox 2.8.1 version. Game does not use room extension approach but zone extension approach. It does not create game room per game. It has predefined set of rooms on zone level and creates games as separate entities. All requests are handled though predefined room. This game does not have any problem with CLOSE_WAIT connections. It is more we did not have any problem with the running servers for 2 years.

Thank you,
Sergey.
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 12 Jun 2019, 08:26

Hi,
I have repeated yesterday's test with a standard SFS2X 2.13.5, Ubuntu 18, default JVM (the one shipped with SFS2X), and using websocket instead of raw sockets.

I am not able to see any lingering connections in CLOSE_WAIT state (or any other partially-closed state, for that matter).
To avoid issues with lsof I also double checked with netstat.
The clients are simulated from a MacOS client machine in our office, going out through a VPN and connecting to the SFS2X server in Germany (AWS, Frankfurt region)

Are there any firewalls o proxies in front of your servers that may cause these connections to stick for longer time?

So something within jvm or smartfox opens bunch of file descriptors for one socket connection.

I am not following. You're saying the SFS2X is connecting somewhere else?
What is the host and port for this connection?

Silly question could it be leaking though class loaders that you use for each game room that our game app creates/destorys dynamically ?

I don't think so. Unless you're opening connections from your Extension code, Extensions by default do not deal with connections, at all.
And if the former is true, then you are responsible for shutting down those connections in the destroy() method of the Extension itself.

We also use NPC that we create on demand to emulate "robot" in 2-4 player table game. Server disconnects NPC when game is completed or table became empty(no human players). Though it might not affect server connections since NPC does not have connection anymore. Just FYI.

Yes, NPCs do not generate network-based connections.

We have another game which runs on smartfox 2.8.1 version. Game does not use room extension approach but zone extension approach. It does not create game room per game. It has predefined set of rooms on zone level and creates games as separate entities. All requests are handled though predefined room. This game does not have any problem with CLOSE_WAIT connections. It is more we did not have any problem with the running servers for 2 years.

As I said. Extensions don't deal with low level connections, unless your code does.

Hope it helps
Lapo

--

gotoAndPlay()

...addicted to flash games
FlyingSerg1
Posts: 8
Joined: 31 Aug 2018, 02:18

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby FlyingSerg1 » 14 Jun 2019, 20:21

Hi Lapo,

We've identified two problems. First jetty does not close some of the org.eclipse.jetty.websocket.server.WebSocketServerConnection. It get stucked in "CLOSING" state. Second problem is openJdk leaks file handles. Oracle Jdk does not leak handles but still keeps it between 168-176 per connection.

So we kept several servers with openJdk and switch several to Oracle Jdk. Result of this command after 24 hours of running
lsof -n | grep smartfox | grep CLOSE_WAIT | awk '{print $10}' | sort | uniq -c

OpenJdk
...
234 10.99.66.150:8443->99.90.3.201:57862
234 10.99.66.150:8443->99.93.137.152:55459
651 (CLOSE_WAIT)

Oracle Jdk
...
176 10.99.45.164:8443->99.93.137.152:55457
176 10.99.45.164:8443->99.93.137.152:55660
636 (CLOSE_WAIT)

As you can see number of close wait connections are roughly the same. But number of file handles are worst in openJdk(234 vs 176). Number of handles will steadily grow to a 1000. That is what I saw before server died.

We analyzed heap dumps from different servers. It turns out that 80-90% of WebSocketServerConnection stucked in "CLOSING" state. Please notice id. This 67.143.176.92:48109 is still on server after 24 hours in CLOSE_WAIT state. Also close handshake source is "LOCAL".
memory-dump-smartfox3.png
(90.58 KiB) Not downloaded yet


Here is picture of connections in "closing" state.
memory-dump-smartfox4.png
(145.96 KiB) Not downloaded yet


Unfortunately we can't identify circumstances under which connection became stuck yet. Please advise what kind of diagnostic/tool we can use on smartfox to figure this out ?

Thank you,
Sergey.
Attachments
memory-dump-smartfox.png
(63.74 KiB) Not downloaded yet
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 15 Jun 2019, 07:18

Hi,
thanks for the additional details. We'll look into it.
As I said, so far we were not able to reproduce your situation with a standard SFS2X 2.13.5. In any case we'll try another live test next week, with more clients and longer run time and see what happens.

I'll post more details when available

Unfortunately we can't identify circumstances under which connection became stuck yet. Please advise what kind of diagnostic/tool we can use on smartfox to figure this out ?

I don't have specific advices as of now. We don't control the inner working of Jetty. We need to be able to reproduce the problem first.

Recently we've addressed a different issue in Jetty where packet compression would leak native memory. I am not sure if the two things could be connected. Probably they're not.
In any case we are releasing a new update next week which addresses the problem.

Cheers
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 18 Jun 2019, 10:16

Hi,
we have been running another test for over 24 hours and we'll be letting it run for the next few days.

The test setup is as follows:

» SmartFoxServer 2X 2.13.5 running on an EC2 t2.micro instance, Ubuntu 18.04.2 LTS
» Java stress test tool (using standard SFS2X Java client API), running on an EC2 t2.small instance, Ubuntu 18.04.2 LTS.

The stress test connects ~800 CCU (via websocket) which join various Rooms (each can be 5-8 users) and send messages to each other. Typically every client stays connected for 90-100 seconds and then it disconnects. Whenever a client leaves a new one is created, and the test keeps going.

The test runs in the internal network rather than over the internet.
I think it's the best approach to determine if Jetty or the JRE are responsible for the incomplete disconnection issues. Over the internet it is much easier to encounter CLOSE_WAIT states due to various network issues. We want to exclude those external causes and observe the behavior of the server in an "undisturbed" network.

At this point we've had over 240K clients joining and leaving the server and we've kept monitoring the count of CLOSE_WAIT connections almost every hour. There are no signs of leaks whatsoever.

We also tried force-disconnecting all clients at once, to see if it would make things worse.
In this case we've seen a number of lingering CLOSE_WAIT connections but they gradually disappear over time and the net result is always zero.

We'll keep running / monitoring the test and post updates in the next days.
Cheers
Lapo

--

gotoAndPlay()

...addicted to flash games
FlyingSerg1
Posts: 8
Joined: 31 Aug 2018, 02:18

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby FlyingSerg1 » 18 Jun 2019, 21:23

Hi Lapo,

Thank you for looking into this! We actually might have this compression native memory leak you mentioned before. 2k not closed connections should not eat up 3G of memory. Here is top command for the process:
9670 smartfox 20 0 11.486g 6.553g 7680 S 13.0 87.9 77:48.19 java
Virtual memory even more 11G, compare to RES 6.5G
At the same time smartfox admin console shows that jvm uses ony 700M of memory and small gc jumps between 300M and 500M

Is this compression leak related to secure socket ? We use WSS to connect to server.

We also single out specific browser or OS issue. CLOSE_WAIT exists for all type of browsers and OSes. FF, Opera, Edge, Chrome, Mac, Win10, Win8, CrOs, even linux etc

Thank you,
Sergey.
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 19 Jun 2019, 07:43

FlyingSerg1 wrote:Hi Lapo,

Thank you for looking into this! We actually might have this compression native memory leak you mentioned before. 2k not closed connections should not eat up 3G of memory. Here is top command for the process:
9670 smartfox 20 0 11.486g 6.553g 7680 S 13.0 87.9 77:48.19 java
Virtual memory even more 11G, compare to RES 6.5G
At the same time smartfox admin console shows that jvm uses ony 700M of memory and small gc jumps between 300M and 500M

Is this compression leak related to secure socket ? We use WSS to connect to server.

It could be. We've just released patch 2.13.6 which, among other things, addresses that particular issue.
You can download it here:
https://www.smartfoxserver.com/download#p=updates
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 19 Jun 2019, 09:25

We've have taken another look at the test that has now been running for ~48 hours and we still have no evidence of leaked connections.
During the past two days we haven't even seen anything even remotely suspicious about lingering connections, so I am not convinced there's any issue with either Jetty or the Java Runtime.

If this was a problem you would easily notice this even in a local environment. Have you tried reproducing it like that?
Thanks
Lapo

--

gotoAndPlay()

...addicted to flash games
FlyingSerg1
Posts: 8
Joined: 31 Aug 2018, 02:18

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby FlyingSerg1 » 21 Jun 2019, 22:24

Hi Lapo,

Great news ! We've updated our servers to 2.13.6. Memory consumption is greatly improved ! We've been running server for one day but it is much better than it was before.

We was able to pin point scenario when connection became stuck in CLOSE_WAIT state. All our stress test scenarios are revolved around active users. It turns out that we need to look at idle users. Scenario is very simple and 100% reproducible(at least on our environment):
1. Using browser and html5 sdk connect to server and zone. We've tested so far FF and Chrome but I believe it is generic problem.
2. Run this "lsof -n | grep 'XXX.153.4.143' | grep ESTAB | awk '{print $10}' | sort | uniq -c" to see IP:port
3. Wait till server kicks your connection out because of idling
4. Run this "lsof -n | grep 'XXX.153.4.143' | grep CLOSE_WAIT | awk '{print $10}' | sort | uniq -c"
and connection became CLOSE_WAIT and never goes away.

root@pomelo:~# lsof -n | grep 'XXX.153.4.143' | grep ESTAB | awk '{print $10}' | sort | uniq -c
187 10.99.34.224:8443->XXX.153.4.143:10679
1 (ESTABLISHED)
root@pomelo:~# lsof -n | grep 'XXX.153.4.143' | grep ESTAB | awk '{print $10}' | sort | uniq -c
root@pomelo:~# lsof -n | grep 'XXX.153.4.143' | grep CLOSE_WAIT | awk '{print $10}' | sort | uniq -c
179 10.99.34.224:8443->XXX.153.4.143:10679
1 (CLOSE_WAIT)

I guess to fix it client should close connection before server kicks out idling connection. But I believe there is problem with initiating connection closing from server side. In memory dump provided before web socket connection has "CLOSING" state and close handshake source is "LOCAL".
I did not check scenarios when server kicks out users because of flood filter, bad word filter. It might be affected too.

FYI out idle time 15 minutes.
<overrideMaxUserIdleTime>900</overrideMaxUserIdleTime>

Our network experts think it might be related to MTU which is too high (though it is default for AWS c5.large instances)
https://docs.aws.amazon.com/AWSEC2/late ... k_mtu.html
They have been capturing the tcp packets and found a lot of tcp "ack" packets malformed. which could cause the sockets leaks.
Ack packets leak from server to client after it receives the "FIN" packet from the client:
network-error.png
(49.99 KiB) Not downloaded yet


root@pomegranate:~# ifconfig
ens5 Link encap:Ethernet HWaddr 12:c5:e1:db:e7:48
inet addr:10.99.68.5 Bcast:10.99.79.255 Mask:255.255.240.0
inet6 addr: XXXX::10c5:e1ff:fedb:e748/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9001 Metric:1
RX packets:4198267 errors:0 dropped:0 overruns:0 frame:0
TX packets:4508611 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:864250637 (864.2 MB) TX bytes:967702065 (967.7 MB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:5952 errors:0 dropped:0 overruns:0 frame:0
TX packets:5952 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:523650 (523.6 KB) TX bytes:523650 (523.6 KB)

We've changed it to 1500 but it did not help. We still have connections in CLOSE_WAIT state on these hosts.

Update: I just tested bad word filter. Kicking because of swearing does not cause CLOSE_WAIT connection. Or not 100% repro. I'll try to test it more.

Thank you,
Sergey.
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 24 Jun 2019, 09:05

Hi,
I think you may be onto something, but I also feel like it's uniquely related to your environment.
We cannot reproduce any of this from our side.

I tested 5 HTML clients connected via websocket to SFS2X running on Linux Mint. As soon as they are kicked out (once the idle timer triggers) all of the connections are gone.
Lapo

--

gotoAndPlay()

...addicted to flash games
FlyingSerg1
Posts: 8
Joined: 31 Aug 2018, 02:18

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby FlyingSerg1 » 03 Dec 2019, 03:02

Hi Lapo,

We've started to see this problem again when we introduced mobile version of our website. We've solved problem by disconnecting client manually after idling 4 minutes 30 seconds before server kicks client out. It worked for a while. We've been seeing unclosed connections still but it was 20-25 connections per day, so we've been restarting server every 2 weeks. Now it started to crush every 5 days. We still investigating the problem but looks like mobile client does not disconnects manually after 4.30 when browser goes into background(to be confirmed).

Sorry, I've re-read my previous posts and looks like I did not mention that we use WSS. We can't use regular WS because all our website works under https. Could you please try on your side WSS with simple login to zone scenario and tell if you can reproduce it ?

Thank you,
Sergey.
User avatar
Lapo
Site Admin
Posts: 21117
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Server fails with "Too many open files" exceptions and 500K+ file descriptors in CLOSE_WAIT state

Postby Lapo » 03 Dec 2019, 09:19

Hi,
as regards WS vs WSS it doesn't really matter. The problem of lingering connections happens at the TCP level (not at the application level) when the client sends a FIN packet and the kernel sets the connection in CLOSE_WAIT state, meaning that the application (SFS2X/Jetty) should close its own side of the connection.

In any case I've done a test opening 10 browser clients, connect them to a remote SFS2X (2.13.3) via WSS + login + join Room.
Then I checked via netstat and naturally I can see the 10 connections in ESTABLISHED state, on server port 8443.
I closed the browser after a while, checked again via netstat and found no CLOSE_WAIT connections lingering.

I don't think this is something that can be reproduced, otherwise we'd have many customers reporting it. At the same time I have no idea why it happens on your setup. Anyways, searching around the internet I found this small perl script that might help in your case: https://github.com/rghose/kill-close-wait-connections

Hope it helps
Lapo

--

gotoAndPlay()

...addicted to flash games

Return to “SFS2X Questions”

Who is online

Users browsing this forum: No registered users and 20 guests