Device runs very very slow C++ code (even compared to an iPad2)

mlfarrell · September 2016

What gives? I'm seeing 30 millisecond frame times and am CPU bound for VERY simple rendering passes through my engine. An ancient iPad 2 blows this performance out of the water? Am I doing something wrong? I have optimizations turned all the way up for my release build on visual studio. This can't be the fastest the HoloLens can go, if it is, I'm in big trouble.

edit: hmm, some of this seems to be the diagnostic tools lying to me. The analyzer says my frame time is 30 ms but when I measure it in the actual render thread its only 6. So I'm assuming the GPU has more of at role in this.

mlfarrell · September 2016

Here's the problem. No idea how I can do anything about this since most of that is ANGLE (OpenGL running on top of D3D11). Am I screwed here or what? is there some magic silver bullet setting in MSVC compiler settings to have it generate faster CPU code?

stephenhodgson · September 2016

I don't think ANGLE is completely supported yet, but there's been talk about a road map.

https://github.com/Microsoft/HoloToolkit/issues/38#issuecomment-246819579

Patrick · September 2016

You said you were CPU bound...is the CPU at 100%?

spt · September 2016

First, Ive noticed that the shaders seem to be the biggest bottleneck, so slim those down.
Next, there is a way to speed up msvc, and that is to use the intel compiler.

mlfarrell · September 2016

@spt said:
First, Ive noticed that the shaders seem to be the biggest bottleneck, so slim those down.
Next, there is a way to speed up msvc, and that is to use the intel compiler.

Will the intel compiler work with HoloLens?? If so, where can I grab that thing now.

My bottleneck here is 110% the CPU taking too long to prepare the context to render. The shaders are barely causing a problem (note the relatively short burst of GPU activity compared to the CPU). I gained a massive speedup by inserting a context flush call after my biggest draw calls but I still need more optimization.

spt · September 2016

@mlfarrell said:
Will the intel compiler work with HoloLens??

I have not tried it on hololens. I see no reason why not since it is an Intel chip.

Patrick · September 2016

How much memory are you allocating each frame?

mlfarrell · September 2016

@spt said:

@mlfarrell said:
Will the intel compiler work with HoloLens??

I have not tried it on hololens. I see no reason why not since it is an Intel chip.

I'm downloading the trial for it now. I'm hoping this will help since I'm losing most of my time in the overhead of ANGLE's glDrawElements calls, which I have no idea if can be optimized further.

mlfarrell · September 2016

..okay so after wasting an hour and a half, intel doesn't work. It just breaks on VS2015 update 3..

so once again I'm dead in the water in need of a silver bullet here. Is there a "best practices" guide somewhere as to compiler settings for utmost speed out of MSVC compiler? Ie, I have no idea if some of these settings such as /GS are hurting or not and I'd rather not have to blindly rebuild thousands of lines of code checking and measuring each one.

Patrick · September 2016

As an experiment I took the ANGLE template and slightly modified it to draw 20 cubes instead of 1 (I just repeat the draw call over and over) and I'm able to keep 60hz. If I start adding more draw calls I do start seeing the frame rate dip, but I'm GPU bound, not CPU bound.

My geometry is just the color cube, so it's small and simple. From your screen shot your app is drawing something different than I am, and I suspect what you are drawing has more vertices / indices. What is your app trying to draw? Does the app create or update the buffers (calls to glGenBuffers or glBufferData) each frame?

And for a potential bullet... have you tried calling eglSwapInterval with 0 as the interval?

mlfarrell · September 2016

Change the VAO and draw a different mesh (30-60 different meshes), this dirties the angle vertex array state and causes a very expensive synchronization.

0 swap interval would screw up the HoloLens predictions.

spt · September 2016

@mlfarrell said:
, intel doesn't work. It just breaks on VS2015 update 3..

Would it be possible to compile it without VS2015, and then link it in as a obj/library?

mlfarrell · September 2016

I suppose I could. Downloading another VS isn't something I have time for currently. Last night I busted my rump to shave off as much time from both my rendering code and from angle itself. I disabled parts of GL validation within the angle code and got my performance up to 50-60 FPS.

That's drawing the entire spatial mesh, flushing, then making about 40 unique GL draw calls. I suppose that's fine for now.

I was able to profile the angle code and determine the biggest bottleneck per draw call is the D3D Map/Unmap calls that angle uses to update the uniforms via one global constant buffer. I tried swapping that out for smaller stacked update sub resource calls but couldn't yet get that to work. Once I do, I'll see if that's faster since my default shader uses a ton of uniforms and I suspect the global constant buffer is quite large for that shader, despite the update ranges being rather small in comparison.

Man.. HoloLens engine dev is like 1st gen iPhone dev. Can't wait for the faster generations of this hardware where all this stuff won't matter as much.

Patrick · September 2016

@mlfarrell said:
Change the VAO and draw a different mesh (30-60 different meshes), this dirties the angle vertex array state and causes a very expensive synchronization.

0 swap interval would screw up the HoloLens predictions.

0 swap interval might screw up the HoloLens predictions, but I'm concerned about the big gaps in the trace where the system looks idle. if you look at the documentation for rendering in DirectX you'll notice that 'present' has a special requirement for best results. If GL is VSyncing and you are following the guidance for rendering, you might be waiting for Vsync twice.

mlfarrell · September 2016

@Patrick said:

@mlfarrell said:
Change the VAO and draw a different mesh (30-60 different meshes), this dirties the angle vertex array state and causes a very expensive synchronization.

0 swap interval would screw up the HoloLens predictions.

0 swap interval might screw up the HoloLens predictions, but I'm concerned about the big gaps in the trace where the system looks idle. if you look at the documentation for rendering in DirectX you'll notice that 'present' has a special requirement for best results. If GL is VSyncing and you are following the guidance for rendering, you might be waiting for Vsync twice.

You're right about the dead time, however on hololens fork of angle, they don't honor egl swap interval at all, so there's no double wait. Also on my latest profile, I'm not missing the deadline nearly as much anymore. I'd still rather have Vsync always on and provide a better UX

Patrick · September 2016

okay... I was investigating in master. I'm checking out the holographic branch.

Looks like the spinning cube template has a memory leak.

mlfarrell · September 2016

Per frame leak? I don't use much of template anymore since I gutted it and route all rendering through my engine now

the branch I'm on is actually a fork that I'm heavily modifying to work with my needs: https://github.com/mlfarrell/angle

Patrick · September 2016

okay, the leak seems to be in your branch as well, it is likely per frame, I'll diagnose. Is the project you are working on in your branch?

mlfarrell · September 2016

No it isn't.

thanks, yea that could def be a problem in the long run.

MikeRiches · September 2016

I have a solution for the memory leak, will publish it soon. Thanks @Patrick for the heads-up.

The recent update to the holographic branch includes a way of telling it to wait on VBlank (or not) when presenting the holographic frame. You might try toggling it on and off and see if there is a measurable difference. To use this feature, add this code after creating the window surface and EGL context, and after making the context current:

// By default, allow HolographicFrame::PresentUsingCurrentPrediction() to wait for the current frame to 
// finish before it returns.
eglSurfaceAttrib(mEglDisplay, mEglSurface, EGLEXT_WAIT_FOR_VBLANK_ANGLE, true);

The graphics diagnostics tools are known to reduce framerate to 30 FPS. So, while the graphics diagnostics tools are a great way to inspect the graphics pipeline - I wouldn't recommend measuring FPS that way. For that, you should use your own QPC timer while running in release mode.

@stephenhodgson: We have experimental support for ANGLE on HoloLens available here: https://github.com/Microsoft/angle/tree/ms-holographic-experimental

mlfarrell · September 2016

Haha we meet again Mike. Outside of github this time.

I use a timer to measure frames. The tool just confirmed it. Doesn't not waiting for VBLANK cause tearing and or bad holo frame predictions?

Right now I can do about 30-45 unique draw calls after rendering most of the spatial meshes in D3D. That should be enough for my needs for now.

I may try (if I can get the thing to work) compiling angle itself using intel's commercial compiler at some point and seeing if it is noticeably faster as well.

MikeRiches · September 2016

@mlfarrell: Yes, it is true! I am able to respond here as well.

Apps should wait for VBlank because the holographic frame prediction will be more up-to-date that way, and it ensures the GPU work is synchronized. However, if your app is going to take more than 1 frame to finish rendering most of the time, you might get better results by not waiting for VBlank. VSync is still enabled either way; the only difference is whether or not PresentUsingCurrentPrediction() waits until the frame is done before it returns. If it does not wait, your app can start doing work for the next frame sooner. In this case I recommend partitioning drawing work to the end of the pipeline as much as possible, so that you can use an updated frame prediction to render with.

I just pushed a change to fix the per-frame memory leak identified by @Patrick. The fix is in ANGLE itself, as opposed to the app template, so you might want to take a look at it for your branch.

Would love to know the performance comparison with the Intel compiler, if you decide to get it working.

Device runs very very slow C++ code (even compared to an iPad2)

Answers