Not logged inGosu Forums
Forum back to libgosu.org Help Search Register Login
Up Topic Gosu / Gosu Exchange / Rendering Performance Suggestion
- - By Antikythera Date 2010-10-03 21:44
Hello all

I have been having fun playing around with Gosu. I really like the neat, modern coding style. It looks like it will be a solid foundation for my game.

I decided to give Gosu a stress test to check its performance, so I remade one of the demo projects for the Haaf Game Engine library. This displays a 1000 copies of a single image bouncing around the window and lets the frame rate run as fast as possible. Gosu held up well in this very artificial test - after turning off vsync, it got about 650 frames per second compared to the 1000 FPS that HGE got - probably smooth enough for most people. However when I had a look with the Very Sleepy profiler, I noticed that Gosu seemed to be spending a significant amount of time in std::_Tree insert methods and also (bizarely) in wcsncpy. Very Sleepy reports that wcsncpy is being called by free and malloc, which are also taking a fair amount of time.

I tracked all these calls to the std::multiset that stores the DrawingOp queue. It seems to me that a sorted collection is not actually required for the drawing queue, because the queue is only ever read all in one go in performDrawOps or compileTo. I tried replacing the multiset with a std::vector and putting a std::sort at the start of performDrawOps and compileTo - this does require making these methods non-const. The idea being to allocate memory in large chunks rather than one DrawingOp at a time. Performance in my stress test program increased to 1000 FPS or above.

I am using Windows and Visual Studio 2010, in case that is of interest.

I do not know whether this change is enough to make a significant difference in actual games, that do not use thousands of drawing operations. However I cannot think of any significant downsides. I am happy to post a patch if you want it.
Parent - - By jlnr (dev) Date 2010-10-03 21:57
The reason for using multiset is simply that it was the first thing that came to my mind when I was writing this, and I agree that it might not be the best data structure. However, std::sort will not work as a replacement because IIRC it is not guaranteed to be stable. Gosu currently guarantees that stuff drawn on the same Z level will end up being drawn in the same order as draw*() was called. I still guess that using a vector and a stable sorting approach might be faster. Do you know any algo in the std library or boost that we could abuse?

FWIW, I am approaching performance from a few sides too right now. There is a lot of unnecessary std::wstring copying happening in the current Font implementation. Sadly, few STL implementations use the Small String Optimization right now, resulting in malloc/free for even single-character strings. Yuck. :(
As you saw in compileTo, there is work towards generating Vertex Arrays for large static collections of images, this also improves performance a lot when applicable.
Then I want to make the internal format of Gosu::Color better suited for direct memcpy() into OpenGL.
And finally, some stuff could be moved into a seperate thread. Being able to asynchronously load images or render text would make the *perceived* performance of Gosu games much better for in-game loading.

That is just WIP trivia because I *am* running into performance trouble on the iPhone 3G, so thanks for the research, all input is welcome :)
Parent - - By Antikythera Date 2010-10-04 01:04
Fair point on the stability of std::sort. I realised stability would be necessary, but for some reason assumed std::sort was stable. However I notice that there is a std::stable_sort. I will alter my stress test to use it tomorrow and see how much that reduces the frame rate.

I am interested to hear your plans for performance improvement. Asynchronously loading sounds like a neat idea. I saw the work on using vertex arrays for the iPhone version. Are you thinking of using this method on all platforms? It looks like a good idea from a performance perspective, but it would take rather more work to get drawTriangle, drawQuad, macros, etc to work nicely with it.

Are macros supposed to work at the moment? They would not for me, though I did not try very hard to make them and may have been doing something stupid. Images in a macro were drawn as black squares with fuzz in them. Drawing quads in a macro caused an exception.
Parent - - By jlnr (dev) Date 2010-10-04 10:15
Oh, good call on std::stable_sort. For some reason I'd never heard of it. Well, a patch is welcome then. ;)

Well, macros *are* vertex arrays. :) It's just that on the iPhone version, I have to use VAs for everything because GLES comes without glBegin. But no, they are not supported and it will take another few months until they are.
drawTriangle and drawQuad will likely be approximated using a single-pixel white texture. drawLine should be replaced by a quad too as GL_LINE is too different across drivers.

And on a related note, with DrawOpQueue being a vector and some DrawOp layout tweaking, maybe there is a way to use DrawOpQueue as a VertexArray *directly*? That would certainly be convenient.
Parent - - By Antikythera Date 2010-10-04 22:24 Edited 2010-10-04 23:27
I had also not heard of stable_sort either until I went to look up the stability of std::sort.

Unfortunately when I swapped in stable_sort I found my stress test dropped back to about 670 FPS. I also realised that I was unfairly favouring sorting over multiset by doing my drawing in z order, with most z values the same. When I moved to random z values, the std::sort approach fell back to about 700FPS. My profiling suggested that these drops were due to time spent copying DrawOps, which are quite bulky, so I have altered DrawOpQueue to sort a vector of pointers to DrawOps with associated z-values rather than DrawOps themselves. This takes the FPS back up to about 900. I am still keeping the DrawOps themselves in a vector so that memory is allocated for them in large chunks rather than for each in turn.

I have uploaded the patch to http://www.mediafire.com/?x34uexq6vpj4wmc . I have also included a full copy of DrawOpQueue.hpp just in case, as I have not  made a patch with svn before. I will be interested to hear if it makes any difference where performance is a limitation.

An improvement might be to avoid the copying of existing DrawOps when the vector is expanded - push pointers to DrawOps into a vector as they were added, but allocate the memory for the DrawOps themselves in large chunks. It would be nice to also reuse the DrawOp memory between frames - a DrawOpPool maybe? [EDIT] Actually I just checked and I am not sure copying of existing DrawOps is worth worrying about, reserving a large vector in advance does not seem to make any difference in my stress test. [/EDIT]

It would be very elegant to be able to use a DrawOps vector for the VertexArray directly, however I expect the copying costs during sorting might outweigh the benefits of not copying into a separate VertexArray.

Using a single-pixel texture to approximate drawTriangle and drawQuad sounds like a very neat way of unifying the drawing operations.

An idea that just occurred to me is that you could easily offer the option of disabling the sort operation, disabling most of the overhead of z-order, for those who wanted just to rely on order of drawing.
Attachment: ZSortingPatch.zip - Z-Ordering Patch (3k)
Parent - - By erisdiscord Date 2010-10-05 05:12 Edited 2010-10-05 05:48
Oh dear, you should put your patch on Pastebin if it's too large for a forum post. It even has syntax highlighting! For patches, even.

No, scratch that, I completely forgot about forum attachments, as Julian has pointed out.
Parent - - By jlnr (dev) Date 2010-10-05 05:45
I guess I should add a "you can attach files after posting" line under the input text box. This board software is kinda alone with the way attachments work. ;)
Parent - - By erisdiscord Date 2010-10-05 05:50
Oh jeeze, I'm pretty sure I've even used the attachment feature before and I completely forgot about it. Thanks for catching my mistake, sir!

On the other hand, I think Pastebin is a useful service and it can't hurt to let people know about it. :)
Parent - - By jlnr (dev) Date 2010-10-05 05:52
Only embedding and automatic syntax-highlighting of attached code files here could be more awesome. Sadly, my Perl is too weak, so +1 for pastebin. :)
Parent - By erisdiscord Date 2010-10-05 14:10
Calling pygments as an external script would probably be the way to go if you really wanted to do it. I think this is what GitHub does.

I used to be pretty competent in Perl but I haven't touched it in a long time since I found Ruby. Hmm.
Parent - - By banister Date 2010-10-05 20:50
gist is better (if you use github) as you can fork them, find them again easily, favorite them. They also get you high.
Parent - - By erisdiscord Date 2010-10-05 22:08
You're absolutely right, but you don't have to have an account to use Pastebin. :)
Parent - By Antikythera Date 2010-10-07 19:51
Well I am spoiled for choice if I come up with any more patches. :)

Thanks for your recommendations all.
Parent - - By jlnr (dev) Date 2010-10-05 05:50
Disabling Z ordering is not an option imho. The interface should not be made more complicated, and Gosu code should be able to mix and match without quirks. :)

The stable_sort of pointers sounds like a good option. I'll take a look at the patch. Thanks again.

BTW, if you just clear() the vector, it will keep its capacity. I haven't seen the code but you might just have an accidental DrawOpPool in there? ;)
Parent - By Antikythera Date 2010-10-07 19:49
I agree with your objections on disabling Z ordering. The potential speed up from the tweak would be small anyway.

I am just clearing the vector in the DrawOpQueue clear method, and you are quite right that this means it maintains its capacity between frames. This would explain why my sticking in calls to reserve had no effect - its all working better than I expected. The only downside is if someone decides to draw thousands of objects in one frame and then fall back to only a few for a long period, but I suspect that is not a usual pattern of use.
Parent - - By jlnr (dev) Date 2011-01-08 23:31
Hello, I have now finally changed the data structure. To be honest I have not used the patch because I wanted to keep the code simple while adding custom Z-ordered OpenGL at the same time. I used your suggestion of stable_sort though, and I got 3 invaluable FPS more on my iPhone 3G running a typical game scene. Thank you :)

Let me know how the upcoming 0.7.26 release performs in your benchmark.
Parent - - By Antikythera Date 2011-01-09 22:39
No problem about the patch: I am happy if the idea has been useful.

I did another test after my last post, and it seemed the inclusion of a copy of the z-order with the pointer to make up the DrawOpRef class did not actually have a significant effect, so the code was probably overcomplicated. However sorting a vector of pointers rather than the original DrawOps vector did have a significant effect and might be worth considering if that part of the code is still a significant bottleneck.

I am pleased to hear you have added Z-ordered custom OpenGL. This is a feature I thought I would need to add for my game. I note you are saving the OpenGL state and restoring it afterwards. I am guessing this could mean a fairly large speed hit with lots of custom operations, but leaving it to the user does give an easy way to shoot yourself in the foot. I suppose can always tweak Gosu if I find I need the speed.

I will certainly give the new release a test and let you know. I will post the benchmark code if I get a chance to clean it up a little.
Parent - - By jlnr (dev) Date 2011-01-09 23:15
Yeah, I'll still consider custom GL a special case that I won't optimize much for until some usage patterns emerge in (semi)finished games. If you have lots of GL objects on the same Z level, you can still work around the state push/pop by collecting all objects and drawing them in one run, which sounds as if it would be faster anyway.
BTW I reduced the DrawOp size quite a bit by using float instead of double. (I am not sure if I should change Gosu's interface to use only float, or just the interface on iOS, different story…) That should have sped up things by making the copied data quite a bit smaller, but didn't make the slightest bump in my benchmark. Optimization is a weird game.
Parent - By Antikythera Date 2011-01-10 00:48
Leaving optimisation until you see some usage seems reasonable. Your workaround might work for me - I am not sure yet if the individual custom drawn entities will need different levels.

It is interesting that going from double to float made no difference at all. That does seem to imply copying is not a bottleneck. I certainly won't dispute that optimization is a weird game.
Up Topic Gosu / Gosu Exchange / Rendering Performance Suggestion

Powered by mwForum 2.29.7 © 1999-2015 Markus Wichitill