For the test we used a specific tool, called BitSmasher, that we have perfected in the course of the last 5-6 years.
At the moment this tool is not available for public use but we might be able to release it in the future, once we have made it a bit more user friendly and we provide the necessary documentation.
The tool in itself is just a sophisticated client-code replicator. What it does is taking your client logic, written with regular Java API and wrapped in a specific class, and generate gazillions of copies, each with its own connection etc...
The difference between BitSmasher and doing the same job manually is that BitSmasher uses an optimized threading system which allows to run, for example, 100K CCU with ~ 60 threads, instead of 1-2 per connection. In other words it replaces the threading inside the Java API with a specialized thread pool, than can service 1000s of client instances at once.
If you did the same manually, launching 100K Java clients in the same JVM, you would end up with
2 x CCU number of threads, in this case == 200K threads!
Which is clearly too much for any server machine.
Another advantage is that BitSmasher supports master-slave clustering, so you can use any number of client machines to build a massive test and control them as if they were one.
For smaller tests, up to 10-15K CCU you can probably do it manually by replicating as many clients as your stress-test machine can handle, and also use multiple machines to reach a higher load.
Hope it helps.