University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3
- Ruoyu Fan
- Tested on:
- Windows 10 x64, i7-4720HQ @ 2.60GHz, 16GB Memory, GTX 970M 3072MB (personal laptop)
- Visual Studio 2015 + CUDA 8.0
Additional third-party library used: tinyobjloader by syoyo (http://syoyo.github.io/tinyobjloader/)
Thanks @itoupeter for pointing out an error in identifying inner/outer surfaces during refraction evaluation
preview | preview |
---|---|
![]() |
![]() |
flat shading | smooth_shading |
---|---|
![]() |
![]() |
-
Basics:
- Path tracing diffusive and perfect specular materials
- Original glfw3 lib files doesn't support Visual Studio 2015. I updated glfw3 and put the source version into the
external/
folder and configuredCMakeLists.txt
so it becomes compatible with Visual Studio 2015 while can also build on other compilers supported. Also upgraded CMake FindCuda module to solve linker errors in CUDA 8. - Used
thrust::remove_if
to compact the path segment array... but only to find that the rendering speed after stream compaction on structs is slower. However I did some optimization and did stream compaction on an index array, and keep the original array in place... Now it is faster than without compaction. - Sorts by material after getting intersections... but results are much slower. (Toggleable by changing
SORT_PATH_BY_MATERIAL
inpathtrace.cu
) - Caching first intersections. (Toggleable by changing
CACHE_FIRST_INTERSECTION
inpathtrace.cu
) - Performance tests for core features.
- Additional test: sort paths by sorting indices then reshuffle instead of sorting in place
- Additional test: access structs in global memory vs copy to local memory first
- Additional optimization: compact index array instead of
PathSegments
array. Raised render speed to 120.6% of no stream compaction and 212.9% of the approach that directly do stream compaction onPathSegments
array, see below.
-
Features:
- Loading obj model (with tinyobjloader). If the vertex normal is different from triangle normal, my ray-triangle intersection can give interpolated normal (aka smooth shading).
- Refraction with Frensel effects
- Stochastic Sampled Antialiasing
Since I may abandon some of the features during development (such as depth of field or somewhat sampling), I tested the results by toggling on and off ENABLE_STREAM_COMPACTION
, SORT_PATH_BY_MATERIAL
and CACHE_FIRST_INTERSECTION
, based on this commit: core_features
ENABLE_STREAM_COMPACTION
: whether doing stream compaction usingthrust::remove_if
after shading. (Which, however, involves moving somewhat big objects around and slows the program)SORT_PATH_BY_MATERIAL
: whether sorting paths by material. (by thrust::sort_by_key). Currently sorting both theShadeableIntersections
andPathSegments
arrays. (Which involves moving somewhat big objects around and slows the program)CACHE_FIRST_INTERSECTION
: the intersections from camera rays are cached. It sightly improves performance.
The results are as follows (tested with default "cornell.txt" scene, with diffusive walls and a perfect specular sphere):
Test Case Id | ENABLE_STREAM_COMPACTION | SORT_PATH_BY_MATERIAL | CACHE_FIRST_INTERSECTION | Time for 5000 iterations (s) | Iterations per second |
---|---|---|---|---|---|
000 | OFF | OFF | OFF | 188.283 | 26.5557 |
100 | ON | OFF | OFF | 332.301 | 15.0466 |
010 | OFF | ON | OFF | 1356.74 | 3.68532 |
001 | OFF | OFF | ON | 169.077 | 29.5723 |
110 | ON | ON | OFF | 970.748 | 5.15067 |
111 | ON | ON | ON | 956.356 | 5.22818 |
Interestingly, while both ENABLE_STREAM_COMPACTION
(100) and SORT_PATH_BY_MATERIAL
(010) are slower than the naive way (000), enabling them both (110) is faster than enabling SORT_PATH_BY_MATERIAL
(010) only. That is because enabling stream compaction reduces a lot work of sorting.
To make things more clear and write a more efficient path tracer, I did some additional tests below before implementing extra features.
- try to reduce the sorting bottleneck. Maybe instead of directly sorting the structs, sort proxy buffers of ints and then reshuffle the structs? If you want to give this a try, please document your results no matter what you end up with, interesting experiments are always good for your project (and... your grade :O)
I made an experimental change at this tag: sort_indices_rather_than_structs
. Instead of sorting the PathSegment
and ShadeableIntersection
structs directly, I created an array of indices and sorted it instead, and then reshuffled the path segments by new indices. Core changes I have made was:
#if SORT_PATH_BY_MATERIAL
thrust::sort_by_key(thrust::device, dev_intersections, dev_intersections + num_paths_active, dev_paths, compMaterialId());
#endif
to:
#if SORT_PATH_BY_MATERIAL
// DONE: reorder paths by material
// DONE: sort indices only
thrust::sort_by_key(thrust::device, dev_material_ids, dev_material_ids + num_paths_active, dev_active_path_indices);
kernReshufflePaths <<<numblocksPathSegmentTracing, blockSize1d>>> (num_paths_active, dev_active_path_indices, dev_paths, dev_intersections, reshuffled_dev_paths, reshuffled_dev_intersections);
std::swap(dev_paths, reshuffled_dev_paths);
std::swap(dev_intersections, reshuffled_dev_intersections);
#endif
(dev_active_path_indices
and dev_material_ids
are to int array buffers I introduced)
Test Case Id | ENABLE_STREAM_COMPACTION | SORT_PATH_BY_MATERIAL | CACHE_FIRST_INTERSECTION | Time for 5000 iterations (s) | Iterations per second |
---|---|---|---|---|---|
000 | OFF | OFF | OFF | 188.283 | 26.5557 |
100 | ON | OFF | OFF | 332.301 | 15.0466 |
110 | ON | ON | OFF | 970.748 | 5.15067 |
110* - sort indices and shuffle | ON | ON | OFF | 510.159 | 9.80087 |
As the result shows, with ENABLE_STREAM_COMPACTION
also enabled, sorting indices by material and reshuffle (110*) is significantly faster than directly sorting the structs. BUT it is still slower than the approaches without sorting by materials (naive or stream compaction only).
There may be two reasons: 1. expense of sorting; 2. it still costs to move large structs around, even if only once per bounce.
I was thinking about leaving the PathSegment
s and ShadeableIntersection
s in place and just use the sorted/compacted indices to access the data (during both sorting stage and compaction stage). But I did a tiny experiment by just sorting the indices array and nothing else... (no reshuffling) It turned out that compared to 26.5557
ips of 000, just sorting an indices array will slow the performance to 17.0504
ips. So, it may be true that sorting itself is costly.
In both intersection and shading stage there are a lot of objects in global memory that needn't to be changed but was accessed. For example, pathSegment
and geom
in computeIntersections
... When I started working on the project I naively change them from copying value to storing a reference...
But I decided to change them back and do some tests. For science. The commit is marked with this tag: copy_and_local_access_vs_global_access
In computeIntersections()
, from:
auto& pathSegment = pathSegments[path_index];
...
auto& geom = geoms[i];
To:
auto pathSegment = pathSegments[path_index];
...
auto geom = geoms[i];
In kernShadeScatterAndGatherTerminated()
, from:
auto& intersection = intersections[path_index];
auto& material = materials[intersection.materialId];
To:
auto intersection = intersections[path_index];
auto material = materials[intersection.materialId];
Here is the result:
Test Case Id | ENABLE_STREAM_COMPACTION | SORT_PATH_BY_MATERIAL | CACHE_FIRST_INTERSECTION | Time for 5000 iterations (s) | Iterations per second |
---|---|---|---|---|---|
000 | OFF | OFF | OFF | 188.283 | 26.5557 |
000** - copy structs to local memory first | OFF | OFF | OFF | 171.12 | 29.2193 |
Yup, copying them to local first is faster in comparison to accessing them directly in global memory... at least for their size.
I'll leave at copying them to local memory first during the remainder of my assignment.
I tried to do some stream compaction on index array instead of PathSegments
array, and forward the threads with the new indices array to find corresponding PathSegment
on intersection and shading stages. This approach raised render speed to 120.6% of no stream compaction and 212.9% of the approach that directly do stream compaction on PathSegments
array, see below. The changes can be found in compact_index_array
tag.
This is done by changing:
auto new_end = thrust::remove_if(thrust::device, dev_paths, dev_paths + num_paths_active, isPathTerminated());
num_paths_active = new_end - dev_paths;
auto new_end = thrust::remove_if(thrust::device, dev_active_path_indices, dev_active_path_indices + num_paths_active, isPathTerminatedForIndex()); // slower
num_paths_active = new_end - dev_active_path_indices;
...
// in intersection and shading stages
path_index = indices[index];
...
Test Case Id | ENABLE_STREAM_COMPACTION | SORT_PATH_BY_MATERIAL | CACHE_FIRST_INTERSECTION | Time for 5000 iterations (s) | Iterations per second |
---|---|---|---|---|---|
000 | OFF | OFF | OFF | 188.283 | 26.5557 |
100 | ON | OFF | OFF | 332.301 | 15.0466 |
100*** - compact index array instead of PathSegments array |
ON | OFF | OFF | 156.11 | 32.0286 |
So, I have proven both sorting and moving large structs are costly, I decided to go without sorting but with index compaction for the remainder of this project (If I decided not to do Wavefront Pathtracer).
I enabled obj model loading feature with tinyobjloader. I describe scene like this so my path tracer will load the file and save the vertex data into a vertex buffer, which will be copied into GPU global memory.
// Dragon
OBJECT 6
mesh
material 4
TRANS 0 0 0
ROTAT 0 45 0
SCALE 3 3 3
FILE dragon.obj
If the vertex normal is different from triangle normal, my ray-triangle intersection can give interpolated normal (aka smooth shading).
flat shading | smooth_shading |
---|---|
![]() |
![]() |
During loading stage, a bounding box is generated for the model. It can be toggled on and off by ENABLE_MESH_BBOX
macro in intersections.h
.
However... I found that enabling bounding box or not doesn't have much effect on the rendering time.
It took me 4452.84 seconds to render the image below (~1750 triangles, my previous dragon model) without bounding box, which is 1.12288 iterations per second.
It took me 89.0504 seconds to render the same image with bounding box for 101 iterations (I terminated it on 101 samples), which is 1.13419 iterations per second.
It is not efficient is because, in my opinion, my dragon has big wings, and it has a fairly large bounding box. If any thread in a memory block hits the bounding box, the whole block needs to wait until it is finished. I guess I need to sort the ray, use a better path tracing model, or use a scene hierarchy.
100 iterations without Stochastic Sampling | 100 iterations with Stochastic Sampling |
---|---|
![]() |
![]() |
Not much difference
5000 iterations without Stochastic Sampling | 5000 iterations with Stochastic Sampling |
---|---|
![]() |
![]() |
The reflection on the ball is smoother.
The indices of refraction of the front balls are 1.31 (ice), 1.62 (glass?), 2.614 (titanium dioxide) (from left to right)
The pink ball is 10% perfect specular, 60% refraction (ior: 1.66), 30% diffusive.