SFS2x Memory Leak/Crash
Posted: 08 Sep 2018, 10:22
Hello,
We launched our game 3 days ago, and have 5 servers each with 1-3k conceurrent user online at all times (average is 1500 ish). Every machine is installed with vanilla Centos 7, SFS 2x 2.13.3 and NetData only (with 300k open files limit). They are all Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz (8 core) with 32GB RAM and unlimited 1gbps lines.
We do not use any extensions, or MMO features; just simple rooms, chat and UserVariable updates. No database, no emails, no UDP, no excryption etc. Each machine is set to use the default amount of threads (since docs suggest not fiddling with them) but we did set the JVM with 8GB min and 16GB max. On average our users are sending 5 messages per second each in rooms with a max size of 4 people. This generates approx 80-160mbps of traffic.
The problem is that these instances are crashing constantly and we have to manually restart them. According to net data the servers will suddenly start running out of of memory, despite 16GB max being set in the JVM, the machine having 32GB total and there being nothing else running (other than NetData, which uses tiny amounts of RAM). In the logs it shows that SFS is getting an OOM kill command from the OS. Sometimes (but rarely) the entire machine will lock up completely and need a full restart.
In a desperate attempt to fix the problem I have set all SFS instances to restart hourly. Whilst it has helped, it still hasn't solved the issue, and one instance recently crashed like this after just 20 minutes of uptime. It's also not a great user experience for our players, as it creates a 10+ second long lag while the client auto remakes/rejoins games.
Since it's happening on all 5 machines, each rented from different datacenters, I think it's safe to say it's not a hardware fault.
I should note that I have never seen any of the threads increase automatically, and all messages queues show green in terms of load.
The problem does not appear to be directly linked to traffic. It's happened to machines with only 800 online, and I've also seen an instance run for hours with 3k online and not have issues. It feels very random.
We have tried:
- Increasing JVM memory to 12-24GB, didn't stop the problem but might have made it slightly rarer (hard to tell though)
- Lowering JVM memory 4-8GB, this would cause SFS to literally reboot itself every 5 minutes.
- Increasing core and extension threads to 64/32. Didn't really seem to make any difference.
Do you have any suggestions for us? We're getting a bit desperate now to be honest
We launched our game 3 days ago, and have 5 servers each with 1-3k conceurrent user online at all times (average is 1500 ish). Every machine is installed with vanilla Centos 7, SFS 2x 2.13.3 and NetData only (with 300k open files limit). They are all Intel(R) Xeon(R) CPU E3-1270 v6 @ 3.80GHz (8 core) with 32GB RAM and unlimited 1gbps lines.
We do not use any extensions, or MMO features; just simple rooms, chat and UserVariable updates. No database, no emails, no UDP, no excryption etc. Each machine is set to use the default amount of threads (since docs suggest not fiddling with them) but we did set the JVM with 8GB min and 16GB max. On average our users are sending 5 messages per second each in rooms with a max size of 4 people. This generates approx 80-160mbps of traffic.
The problem is that these instances are crashing constantly and we have to manually restart them. According to net data the servers will suddenly start running out of of memory, despite 16GB max being set in the JVM, the machine having 32GB total and there being nothing else running (other than NetData, which uses tiny amounts of RAM). In the logs it shows that SFS is getting an OOM kill command from the OS. Sometimes (but rarely) the entire machine will lock up completely and need a full restart.
In a desperate attempt to fix the problem I have set all SFS instances to restart hourly. Whilst it has helped, it still hasn't solved the issue, and one instance recently crashed like this after just 20 minutes of uptime. It's also not a great user experience for our players, as it creates a 10+ second long lag while the client auto remakes/rejoins games.
Since it's happening on all 5 machines, each rented from different datacenters, I think it's safe to say it's not a hardware fault.
I should note that I have never seen any of the threads increase automatically, and all messages queues show green in terms of load.
The problem does not appear to be directly linked to traffic. It's happened to machines with only 800 online, and I've also seen an instance run for hours with 3k online and not have issues. It feels very random.
We have tried:
- Increasing JVM memory to 12-24GB, didn't stop the problem but might have made it slightly rarer (hard to tell though)
- Lowering JVM memory 4-8GB, this would cause SFS to literally reboot itself every 5 minutes.
- Increasing core and extension threads to 64/32. Didn't really seem to make any difference.
Do you have any suggestions for us? We're getting a bit desperate now to be honest