SFS Pro 1.6.6 patched to 1.6.19
Ubuntu Server 12.04 "Precise Pangolin" LTS
Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
Cores : 12
Cache : 15360KB
2 x 2000 GB
Onto my issue. Yesterday a player spammed the server with malformed XML requests ala Billion Laughs Attack. This caused the server to overload and eventually crash. I modified some of the AntiFlood settings in the config.xml file:
Code: Select all
<WarningMessage><![CDATA[No flooding allowed!)]]></WarningMessage>
<KickMessage><![CDATA[You've been warned! No flooding! Now you're kicked]]></KickMessage>
<BanMessage><![CDATA[Stop Flooding!! You're being banned]]></BanMessage>
After doing so, I haven't experienced the attack again, but I can't confirm it's been fixed.
Anyways, the reason I mention this is because today I've been encountering some different issues, and it seems unlikely that they're unrelated. The server started to log a bunch of "I/O Error during accept loop: Too many open files" errors, eventually causing the server to crash, or kick players randomly. This is the first time I've encountered this error, even with 1500+ concurrent users online. I've since increased the ulimit on the server, and can confirm the limit has been permanently changed, even after relogging and restarting the server. I've also increased the ulimit on the start.sh script and sfs file in the start() function. I see the correct number (currently 64,000 global, 20,000 for SFS) when running the ulimit -n command.
However, after doing so, the server will now eventually (not immediately) begin lagging, and refusing to complete the login process for users. The "top" command shows Java CPU usage at over 100%, while the Server Load is always 0% in the Admin Tool. I located the thread that was causing the issue and created a thread dump with jstack. I can provide the entire dump to you if necessary, but here's the info on that specific thread:
Code: Select all
"selector" prio=10 tid=0x00007b076c179400 nid=0x2ecc runnable [0x00007b07a890a000..0x00007b07a890ab40]
at java.lang.Thread.run(Unknown Source)
Locked ownable synchronizers:
Before the server crashes but after the lag issues begin, I profiled how long each request to the server takes. Every request, and even tasks from schedulers, take significantly longer to execute.
I'm not sure is this is because of the attacker, or because of poor server configurations, etc. I don't believe it has to do with the actual extension code, ie. an infinite loop, because I haven't modified the server code for a while and just started experiencing these issues after the XML Bomb attack.
Do you have any ideas?