Problems after upgrading to SFS2X 2.13.0

Post here your questions about SFS2X. Here we discuss all server-side matters. For client API questions see the dedicated forums.

Moderators: Lapo, Bax

User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 20 Jun 2018, 09:10

rewb0rn wrote:Yes it is the extension queue. However the problem only appears on 2.13.0 and 2.13.1 (2.13.2 only installed yesterday). The last 3 months we were switching back to 2.11.0 and the problem is not happening there. But we are publishing to HTML5 now so we need binary wss protocol.

Okay. I don't think there's anything specific in 2.11 compared to 2.13 that changes how queue works. In fact that aspect of the server has been the same for a long time (since 2.9.x I believe).

It seems like there's a lot of pressure on the Extension queue, due to those slow I/O operations. Unfortunately the only "cure" for that is parallelism, which in turn means more threads, which in turn require more memory (each thread has his own stack memory).
Using thread pools alleviates the memory issue greatly.

If you're running 1000s of threads and have high traffic (i.e. 10K as you mentioned) you will probably need 4-8Gb of RAM for SFS2X alone, though I would not worry too much about memory if no OOM have been seen in the logs.

I was thinking that our backup settings were too sensitive, so I switched backup triggering time from 3 seconds to 20 seconds, but 2 days ago it happened again.

Yeah, 3 seconds seems excessive, I'd probably use the 15-20 second is already pretty low.

Code: Select all

By the way: Is there a way to read the values that are tracked in the dashboard through code? I would like to have a better record of statistics and write them to a file instead of having the dashboard open in another tab.

Depends on which values you need to read.
Send us an email to our support@... email box with the details and we'll try to help you further on this topic.
(add a link to this thread for reference)

Cheers
Lapo
--
gotoAndPlay()
...addicted to flash games
User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 20 Jun 2018, 10:09

Side note: we have just release patch 2.13.3 that fixes a regression introduced by the previous patch.
I highly recommend the update.

See here for more:
viewtopic.php?f=18&t=20677
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
rewb0rn
Posts: 177
Joined: 26 Jan 2009, 12:54
Location: Berlin, Germany

Re: Problems after upgrading to SFS2X 2.13.0

Postby rewb0rn » 20 Jun 2018, 14:00

Hi Lapo,

thanks for the info. With the help from your email I have now prepared the stats logging and updated the server to 2.13.3. I will report back if I can gather more insights into the problem.

Cheers,
Ruben
User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 20 Jun 2018, 14:33

Great!
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
rewb0rn
Posts: 177
Joined: 26 Jan 2009, 12:54
Location: Berlin, Germany

Re: Problems after upgrading to SFS2X 2.13.0

Postby rewb0rn » 31 Jul 2018, 11:31

Hi Lapo,

so today it happened again and this time, the logs give a detailed explanation of what is happening (and specifically, in which order), so it should help a lot to debug whats going on.

At 12:26 I received an alert that users were dropping from the live server in all our games because the server did not respond with a PONG to them in time (users will need to receive a custom command PONG after they sent a custom command PING within 20 seconds, otherwise they will believe they lost the connection and restart the smartfox connect flow). Within seconds, the user count dropped from 5500 to 600.

We now log a couple of key values two times per second on the server, so we can make a nice analysis. In short, it looks like for whatever reason, the write queue of the Smartfox is stuck for a period of time and no data is being sent from the Smartfox to the users (for 30 seconds).

Please check this image of the data:
data.png
(24.73 KiB) Not downloaded yet


You can see at line 10 the writer queue size starts queueing up while no packets are being written, the value is stuck at 105133394 for 30 seconds. 20 seconds later at 31.07.2018 12:24:17:053 we see a massive drop of all users since now everyone does not receive the PONG command. Only at 31.07.2018 12:24:26:982 the written packets start counting up again, so the Smartfox starts sending data, but all the users have been disconnecting already:
data2.png
(24.93 KiB) Not downloaded yet



As I wrote before, this behavior is only observable in SFS 2.13.0. We have never experienced it in SFS 2.11.0.

Did you change anything about how the writer queue is handling data in SFS 2.13.0 or 2.12.0? Why does the writer queue increase without data being sent? Any change in thread locking or similar? I hope we can shed some light on this issue and come up with a solution..

I have a complete csv file with more data (including memory etc.) but have uploaded only the data that seemed relevant to the problem. Let me know if you want to look at the raw file. Additional info: 5500 users is not really peak for us, we usually have 10000 concurrent users online in the evening, so I do not think this is directly related to load. The server instance was started this morning, so it's also unlikely to be related to a long production cycle.

Looking forward to your reply. Thanks in advance,
Ruben
User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 31 Jul 2018, 13:41

rewb0rn wrote:Hi Lapo,
Did you change anything about how the writer queue is handling data in SFS 2.13.0 or 2.12.0?

No, we did not.
That's one department that hasn't been touched in a long time.

Why does the writer queue increase without data being sent?

It can happen if the server is not able to write to the network (due to issue on the network itself), while other requests are being executed, which in turn generate more updates to be sent out. These updates will fill up the queue very quickly.

Any change in thread locking or similar? I hope we can shed some light on this issue and come up with a solution..

No, and we've stress tested the heck out of the SFS2X engine, and have so many clients running it in high traffic environments. We're very, very confident there're no issues there.

Since these issues seem sporadic you could try to capture a full thread dump when they occur. With a quick search I've found a code snippet that could do it:
https://stackoverflow.com/questions/129 ... mmatically

So you could use it to take a snapshot of the threads after 15-20 seconds of no data being written out.

I would also investigate the state of the network at the time of the incident (via your hosting provider history graphs): there should be a sharp drop in the outgoing data. What about the incoming data?

Thanks
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
rewb0rn
Posts: 177
Joined: 26 Jan 2009, 12:54
Location: Berlin, Germany

Re: Problems after upgrading to SFS2X 2.13.0

Postby rewb0rn » 02 Aug 2018, 10:24

Hi Lapo,

thanks for your response!

To answer your question: Incoming data continues to be processed, the values for read packets and read bytes increase while the count of written bytes is stuck. I will have a look at dumping the threads from within the code, thanks for the hint! Unfortunately we have no histograms of outgoing traffic on our servers except munin, that does not offer a high enough resolution.

I have been digging a bit more in the logs and it turns out that there was a similar incident a couple of minutes earlier. It took only 14 seconds though, so it did not trigger disconnects. From 12:21:05 to 12:21:19 no outgoing data was written. In the Smartfox logs, I see a lot of exception logs related to the Bluebox in this timeframe:

Code: Select all

31 Jul 2018 | 12:19:42,909 | WARN  | SocketWriter-1 | v3.data.BBClient |     | Client BBox message queue limit was reached. Disconnecting: {BBClient - { Id: 59837, Type: BLUEBOX, Logged: Yes, IP: [USER-IP] }, 911bc001d88a06fe8c34423bcc556b1a}
31 Jul 2018 | 12:19:52,841 | WARN  | qtp1625939772-6387 | bluebox.v3.BlueBox |     | Invalid request, session expired. Message: { ssid: 8006f0737fb3000e96a9b30ef2694559, cmd: poll, data: NoData }
31 Jul 2018 | 12:20:29,590 | WARN  | qtp1625939772-6442 | bluebox.v3.BlueBox |     | Invalid request, session expired. Message: { ssid: b6432b6ada3a8d6736f2a3eb7efa7a18, cmd: data, data: 48 bytes }
31 Jul 2018 | 12:20:34,347 | WARN  | qtp1625939772-6912 | bluebox.v3.BlueBox |     | Invalid request, session expired. Message: { ssid: b6432b6ada3a8d6736f2a3eb7efa7a18, cmd: poll, data: NoData }
31 Jul 2018 | 12:20:54,890 | WARN  | SocketWriter-1 | v3.data.BBClient |     | Client BBox message queue limit was reached. Disconnecting: {BBClient - { Id: 62122, Type: BLUEBOX, Logged: Yes, IP: [USER-IP] }, d4e4bc74787ab73a71a821a3c2767d85}
31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-3 | bitswarm.core.SocketWriter |     | Error during write. Session: { Id: 58616, Type: DEFAULT, Logged: No, IP: [USER-IP] }
31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-3 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processRegularSession(SocketWriter.java:273)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:217)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-1,5,main]
31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-1,5,main]
31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

31 Jul 2018 | 12:21:19,712 | WARN  | SocketWriter-3 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-3,5,main]
31 Jul 2018 | 12:21:19,715 | WARN  | SocketWriter-3 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

31 Jul 2018 | 12:21:19,717 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-1,5,main]
31 Jul 2018 | 12:21:19,720 | WARN  | SocketWriter-2 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-2,5,main]
31 Jul 2018 | 12:21:19,722 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

31 Jul 2018 | 12:21:19,722 | WARN  | SocketWriter-2 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

31 Jul 2018 | 12:21:19,723 | WARN  | SocketWriter-2 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-2,5,main]
31 Jul 2018 | 12:21:19,723 | WARN  | SocketWriter-2 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)

31 Jul 2018 | 12:21:19,724 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-1,5,main]
31 Jul 2018 | 12:21:19,724 | WARN  | SocketWriter-3 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-3,5,main]
31 Jul 2018 | 12:21:19,724 | WARN  | SocketWriter-2 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-2,5,main]
31 Jul 2018 | 12:21:19,730 | WARN  | SocketWriter-3 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)


(a lot more of those follow)

Could the exception in the Bluebox handling be a reason for the data writer being stuck? Or is it rather a result of the data writer being stuck? Maybe this exception is blocking the writer queue?

TIA for your ongoing support!
User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 02 Aug 2018, 10:56

rewb0rn wrote:
To answer your question: Incoming data continues to be processed, the values for read packets and read bytes increase while the count of written bytes is stuck.

This is not enough though... because data might be buffered locally (already in read by the OS).
I would like you to check the bandwidth graph from your hosting providers and, if possible, show us the network pattern during that time where you say no data was being written.

Without that it would be very difficult to pinpoint where the problem might be.

I have been digging a bit more in the logs and it turns out that there was a similar incident a couple of minutes earlier. It took only 14 seconds though, so it did not trigger disconnects. From 12:21:05 to 12:21:19 no outgoing data was written. In the Smartfox logs, I see a lot of exception logs related to the Bluebox in this timeframe:

This is indicative of more network issues and a confirmation that SFS2X is actually working as expected.
The server is actively writing to the sessions (otherwise you would not see the error messages) but the BlueBox queues are full and no more messages can be added until the client consumes the data already sent.

It looks like you need to investigate more deeply what is going on your network.
Is sounds like you're experiencing some "hiccups" that are preventing the server to write data fast enough, or at all.

Hope it helps
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
rewb0rn
Posts: 177
Joined: 26 Jan 2009, 12:54
Location: Berlin, Germany

Re: Problems after upgrading to SFS2X 2.13.0

Postby rewb0rn » 02 Aug 2018, 11:03

Thanks Lapo,

Will it be enough to log the network data on an operating system level? Or do we need the data from the hardware level (i.e. switch?) because that will not be available for us.

Just to be sure, have you scrolled through the error log I posted? Because it sounds you refer to this message

Code: Select all

31 Jul 2018 | 12:20:54,890 | WARN  | SocketWriter-1 | v3.data.BBClient |     | Client BBox message queue limit was reached. Disconnecting: {BBClient - { Id: 62122, Type: BLUEBOX, Logged: Yes, IP: [USER-IP] }, d4e4bc74787ab73a71a821a3c2767d85}
31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-3 | bitswarm.core.SocketWriter |     | Error during write. Session: { Id: 58616, Type: DEFAULT, Logged: No, IP: [USER-IP] }


But what concerns me more are these exceptions:

Code: Select all

31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | Problems in SocketWriter main loop, Thread: Thread[SocketWriter-1,5,main]
31 Jul 2018 | 12:21:19,710 | WARN  | SocketWriter-1 | bitswarm.core.SocketWriter |     | java.lang.NullPointerException
   com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:244)
   com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   java.lang.Thread.run(Thread.java:745)


Thanks
User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 02 Aug 2018, 11:35

The two things are connected. If the session dies during processing it can cause an NPE, as the two processes are asynchronous (the connection dying and the thread writing to such connection)

Also, these are only bluebox related, but I'd expect the problem to affect all users, and typically bluebox users are a vast minority.

Finally, as regards investigating the network state, it would be much better to start at the source by watching the traffic stats outside of your machine. If everything looks good you could move one step up and check at the OS level.

Are you using a dedicated hosting provider? If not, could you explain what environment and internet connectivity type does your machine(s) use?

Thanks
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
rewb0rn
Posts: 177
Joined: 26 Jan 2009, 12:54
Location: Berlin, Germany

Re: Problems after upgrading to SFS2X 2.13.0

Postby rewb0rn » 02 Aug 2018, 11:46

Okay understood. You are right, Bluebox is the minority, but it was the only meaningful exception in the logs at that time.

We are renting dedicated bare bone machines to run the smartfox on. The operating system is Ubuntu 16.04 and the bandwidth is 1 GBit/s upload / download. Our provider does offer network traffic monitoring measured on the hardware network interface, however the resolution is bad (5 minutes) and the resolution gets worse the further you go to the past. So unfortunately it does not show anything meaningful for a 30 seconds incident from 2 days ago. Let me know if you need more info about our machines.

I am looking at high resolution logging on the OS level now as described here: http://xmodulo.com/measure-packets-per- ... rface.html

This can be used to log bandwidth per second like this:

Code: Select all

./netspeed.sh eth0
TX eth0: 2255 kB/s RX eth0: 883 kB/s
TX eth0: 1805 kB/s RX eth0: 739 kB/s
TX eth0: 2697 kB/s RX eth0: 973 kB/s
TX eth0: 1592 kB/s RX eth0: 708 kB/s
TX eth0: 2355 kB/s RX eth0: 939 kB/s
TX eth0: 2342 kB/s RX eth0: 936 kB/s
TX eth0: 1507 kB/s RX eth0: 754 kB/s
TX eth0: 2591 kB/s RX eth0: 961 kB/s
TX eth0: 1684 kB/s RX eth0: 784 kB/s
TX eth0: 2227 kB/s RX eth0: 913 kB/s
User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 02 Aug 2018, 15:29

Hi,
sure you can check the OS network stats.
Another thing to check (maybe you've already done it) is to take a look at the CPU usage patterns and see if there's some correlation with extra CPU usage in those "blackout" moments, but I don't think it's CPU related. If it were, you would see the whole server slow down, not just the outgoing traffic.

The thread dump could be useful to see what's going on in the server but if the issue is triggered from outside, it won't help us finding the cause.

Unless you're messing with the server's internals there's really no good reason why the output threads should pause. Maybe you can review your server side code and make sure you're not writing data in for loops? That might be something that can cause performance issues, though it can't block the server as you have described.

Cheers
Lapo

--

gotoAndPlay()

...addicted to flash games
User avatar
rewb0rn
Posts: 177
Joined: 26 Jan 2009, 12:54
Location: Berlin, Germany

Re: Problems after upgrading to SFS2X 2.13.0

Postby rewb0rn » 02 Aug 2018, 15:50

You are right, it's not CPU related, there is no increase of CPU load during that phase.

OS Level traffic logging is active now and the automatic thread dump is ready to be deployed with the next update. So we will have to wait for it to happen again. I will tell our developer to review code for for loop data output. But I agree, for loops might increase the data on the queue, but they should not block the data being written on the network.

I will update this thread as soon as I have news.
User avatar
rewb0rn
Posts: 177
Joined: 26 Jan 2009, 12:54
Location: Berlin, Germany

Re: Problems after upgrading to SFS2X 2.13.0

Postby rewb0rn » 27 Aug 2018, 07:55

Hello,

last friday it happened again and this time we have a threaddump that was triggered after the writer queue was stuck for 20 seconds. It looks like a lock on the writer threads waiting for a BlueBox process. Here is the relevant lock data, let me know if you need more info:


Code: Select all

"SocketWriter-3" #44 prio=5 os_prio=0 tid=0x00007f68fcf37800 nid=0x27da waiting for monitor entry [0x00007f682f993000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   at sfs2x.bluebox.v3.data.BBClient.enqueueMessage(BBClient.java:132)
   - waiting to lock <0x0000000653b9e490> (a sfs2x.bluebox.v3.data.BBClient)
   at com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:254)
   at com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   at com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)

"SocketWriter-2" #43 prio=5 os_prio=0 tid=0x00007f68fcf36000 nid=0x27d9 waiting for monitor entry [0x00007f682fa94000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   at sfs2x.bluebox.v3.data.BBClient.enqueueMessage(BBClient.java:132)
   - waiting to lock <0x0000000653b9e490> (a sfs2x.bluebox.v3.data.BBClient)
   at com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:254)
   at com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   at com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)

"SocketWriter-1" #42 prio=5 os_prio=0 tid=0x00007f68fcf34800 nid=0x27d8 waiting for monitor entry [0x00007f682fb95000]
   java.lang.Thread.State: BLOCKED (on object monitor)
   at sfs2x.bluebox.v3.data.BBClient.enqueueMessage(BBClient.java:132)
   - waiting to lock <0x0000000653b9e490> (a sfs2x.bluebox.v3.data.BBClient)
   at com.smartfoxserver.bitswarm.core.SocketWriter.processBlueBoxSession(SocketWriter.java:254)
   at com.smartfoxserver.bitswarm.core.SocketWriter.processSessionQueue(SocketWriter.java:220)
   at com.smartfoxserver.bitswarm.core.SocketWriter.run(SocketWriter.java:193)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)


waiting for

Code: Select all

"qtp639467920-947676" #947676 prio=5 os_prio=0 tid=0x00007f688400f000 nid=0x15ad waiting on condition [0x00007f679a5f7000]
   java.lang.Thread.State: WAITING (parking)
   at sun.misc.Unsafe.park(Native Method)
   - parking to wait for  <0x0000000654ef2c98> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
   at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
   at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
   at org.eclipse.jetty.util.SharedBlockingCallback$Blocker.block(SharedBlockingCallback.java:210)
   at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:164)
   at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:299)
   at org.eclipse.jetty.server.HttpWriter.flush(HttpWriter.java:56)
   at org.eclipse.jetty.server.ResponseWriter.flush(ResponseWriter.java:147)
   - locked <0x0000000655612ad8> (a org.eclipse.jetty.server.Utf8HttpWriter)
   at sfs2x.bluebox.v3.BlueBox.flushMessageQueue(BlueBox.java:315)
   at sfs2x.bluebox.v3.BlueBox.handlePollingRequest(BlueBox.java:265)
   at sfs2x.bluebox.v3.BlueBox.doPost(BlueBox.java:123)
   - locked <0x0000000653b9e490> (a sfs2x.bluebox.v3.data.BBClient)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
   at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
   at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:845)
   at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1689)
   at sfs2x.bluebox.v3.SessionFilter.doFilter(SessionFilter.java:124)
   at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
   at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
   at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
   at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
   at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
   at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
   at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
   at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
   at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
   at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
   at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
   at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
   at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
   at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
   at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
   at org.eclipse.jetty.server.Server.handle(Server.java:524)
   at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
   at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
   at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
   at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
   at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
   at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
   at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
   at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
   at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
   at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
   at java.lang.Thread.run(Thread.java:745)


We also have OS level network statistics:

Code: Select all

Date,kb/s Outgoing,kb/s Incoming,Packets Outgoing,Packets Incoming
2018-08-24 14:22:17,2282,815,13923,11402
2018-08-24 14:22:18,2463,973,16665,14007
2018-08-24 14:22:19,3091,1114,20565,16069
2018-08-24 14:22:20,2613,934,16796,13468
2018-08-24 14:22:21,2222,880,14485,12604
2018-08-24 14:22:22,2167,841,14026,11733
2018-08-24 14:22:23,3595,1282,23902,18661
2018-08-24 14:22:24,3343,1156,21501,16407
2018-08-24 14:22:25,174,198,1622,1786
2018-08-24 14:22:27,136,143,1318,1243
2018-08-24 14:22:28,86,105,935,925
2018-08-24 14:22:29,62,96,844,818
2018-08-24 14:22:30,73,92,882,873
2018-08-24 14:22:31,72,117,830,813
2018-08-24 14:22:32,57,81,772,746
2018-08-24 14:22:33,79,89,865,811
2018-08-24 14:22:34,60,86,816,802
2018-08-24 14:22:35,61,95,720,695
2018-08-24 14:22:36,18,20,206,200
2018-08-24 14:22:37,26,21,213,187
2018-08-24 14:22:38,29,20,205,172
2018-08-24 14:22:39,20,16,166,154
2018-08-24 14:22:40,14,32,136,142
2018-08-24 14:22:41,27,14,171,156
2018-08-24 14:22:42,89,59,653,542
2018-08-24 14:22:43,406,246,3060,2715
2018-08-24 14:22:44,417,293,3428,3344
2018-08-24 14:22:45,448,348,3580,3723
2018-08-24 14:22:46,447,340,3932,3889
2018-08-24 14:22:47,489,354,4180,4091
2018-08-24 14:22:48,569,392,4554,4236
2018-08-24 14:22:49,640,434,4852,4570
2018-08-24 14:22:50,854,495,6042,5234
2018-08-24 14:22:51,821,496,5685,5091
2018-08-24 14:22:52,694,484,4760,4649
2018-08-24 14:22:53,139,208,1761,2062
2018-08-24 14:22:54,117,160,1478,1931
2018-08-24 14:22:55,5008,3385,26765,21437
2018-08-24 14:22:56,6301,3783,32282,24222
2018-08-24 14:22:57,4763,3012,26356,20807
2018-08-24 14:22:58,3537,2518,18461,14304
2018-08-24 14:22:59,4111,2895,21920,17502


So it looks like the network traffic is decreasing, since the writers are blocked, but not completely zero, so not a hardware problem (?).

What exactly is the Bluebox Thread waiting for? Is it something that could be optimized in a future patch? We use 512 extension threads and 512 system threads each, which is quite a lot. So maybe that explains why we hit this condition more often than other companies?

Thanks in advance
User avatar
Lapo
Site Admin
Posts: 23008
Joined: 21 Mar 2005, 09:50
Location: Italy

Re: Problems after upgrading to SFS2X 2.13.0

Postby Lapo » 27 Aug 2018, 09:23

Thanks, this should be useful.
I don't remember having asked about the amount of BlueBox clients connected. You said you have between 5K to 10K CCU online what's the average number of people connected via BlueBox?

We use 512 extension threads and 512 system threads each, which is quite a lot. So maybe that explains why we hit this condition more often than other companies?

Maybe, but it's not even that unusual. With high traffic it's expected to have higher thread counts, especially if you rely on external calls to other services.
The 512 + 512 threads, do you mean they are all active? If so can you give us a few info on what these extra are for?

If you prefer to post these info privately we can continue the discussion via our support@... mailbox.
Thanks.
Lapo

--

gotoAndPlay()

...addicted to flash games

Return to “SFS2X Questions”

Who is online

Users browsing this forum: Baidu [Spider] and 61 guests