CUDA Path Tracer

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 3

Ruoyu Fan
Tested on:
- Windows 10 x64, i7-4720HQ @ 2.60GHz, 16GB Memory, GTX 970M 3072MB (personal laptop)
- Visual Studio 2015 + CUDA 8.0

Additional third-party library used: tinyobjloader by syoyo (http://syoyo.github.io/tinyobjloader/)

Thanks @itoupeter for pointing out an error in identifying inner/outer surfaces during refraction evaluation

preview	preview

flat shading	smooth_shading

Things I have done

Basics:
- Path tracing diffusive and perfect specular materials
- Original glfw3 lib files doesn't support Visual Studio 2015. I updated glfw3 and put the source version into the external/ folder and configured CMakeLists.txt so it becomes compatible with Visual Studio 2015 while can also build on other compilers supported. Also upgraded CMake FindCuda module to solve linker errors in CUDA 8.
- Used thrust::remove_if to compact the path segment array... but only to find that the rendering speed after stream compaction on structs is slower. However I did some optimization and did stream compaction on an index array, and keep the original array in place... Now it is faster than without compaction.
- Sorts by material after getting intersections... but results are much slower. (Toggleable by changing SORT_PATH_BY_MATERIAL in pathtrace.cu)
- Caching first intersections. (Toggleable by changing CACHE_FIRST_INTERSECTION in pathtrace.cu)
- Performance tests for core features.
- Additional test: sort paths by sorting indices then reshuffle instead of sorting in place
- Additional test: access structs in global memory vs copy to local memory first
- Additional optimization: compact index array instead of PathSegments array. Raised render speed to 120.6% of no stream compaction and 212.9% of the approach that directly do stream compaction on PathSegments array, see below.
Features:
- Loading obj model (with tinyobjloader). If the vertex normal is different from triangle normal, my ray-triangle intersection can give interpolated normal (aka smooth shading).
- Refraction with Frensel effects
- Stochastic Sampled Antialiasing

Performance Test: Core Features

Since I may abandon some of the features during development (such as depth of field or somewhat sampling), I tested the results by toggling on and off ENABLE_STREAM_COMPACTION, SORT_PATH_BY_MATERIAL and CACHE_FIRST_INTERSECTION, based on this commit: core_features

ENABLE_STREAM_COMPACTION: whether doing stream compaction using thrust::remove_if after shading. (Which, however, involves moving somewhat big objects around and slows the program)
SORT_PATH_BY_MATERIAL: whether sorting paths by material. (by thrust::sort_by_key). Currently sorting both the ShadeableIntersections and PathSegments arrays. (Which involves moving somewhat big objects around and slows the program)
CACHE_FIRST_INTERSECTION : the intersections from camera rays are cached. It sightly improves performance.

The results are as follows (tested with default "cornell.txt" scene, with diffusive walls and a perfect specular sphere):

Test Case Id	ENABLE_STREAM_COMPACTION	SORT_PATH_BY_MATERIAL	CACHE_FIRST_INTERSECTION	Time for 5000 iterations (s)	Iterations per second
000	OFF	OFF	OFF	188.283	26.5557
100	ON	OFF	OFF	332.301	15.0466
010	OFF	ON	OFF	1356.74	3.68532
001	OFF	OFF	ON	169.077	29.5723
110	ON	ON	OFF	970.748	5.15067
111	ON	ON	ON	956.356	5.22818

Interestingly, while both ENABLE_STREAM_COMPACTION (100) and SORT_PATH_BY_MATERIAL (010) are slower than the naive way (000), enabling them both (110) is faster than enabling SORT_PATH_BY_MATERIAL (010) only. That is because enabling stream compaction reduces a lot work of sorting.

To make things more clear and write a more efficient path tracer, I did some additional tests below before implementing extra features.

Additional Test: Sort Paths by Sorting Indices Then Reshuffle Instead of Sorting in Place

try to reduce the sorting bottleneck. Maybe instead of directly sorting the structs, sort proxy buffers of ints and then reshuffle the structs? If you want to give this a try, please document your results no matter what you end up with, interesting experiments are always good for your project (and... your grade :O)

I made an experimental change at this tag: sort_indices_rather_than_structs . Instead of sorting the PathSegment and ShadeableIntersection structs directly, I created an array of indices and sorted it instead, and then reshuffled the path segments by new indices. Core changes I have made was:

#if SORT_PATH_BY_MATERIAL
		thrust::sort_by_key(thrust::device, dev_intersections, dev_intersections + num_paths_active, dev_paths, compMaterialId());
#endif

to:

#if SORT_PATH_BY_MATERIAL
		// DONE: reorder paths by material
		// DONE: sort indices only
		thrust::sort_by_key(thrust::device, dev_material_ids, dev_material_ids + num_paths_active, dev_active_path_indices);
		kernReshufflePaths <<<numblocksPathSegmentTracing, blockSize1d>>> (num_paths_active, dev_active_path_indices, dev_paths, dev_intersections, reshuffled_dev_paths, reshuffled_dev_intersections);
		std::swap(dev_paths, reshuffled_dev_paths);
		std::swap(dev_intersections, reshuffled_dev_intersections);

#endif

(dev_active_path_indices and dev_material_ids are to int array buffers I introduced)

Test Case Id	ENABLE_STREAM_COMPACTION	SORT_PATH_BY_MATERIAL	CACHE_FIRST_INTERSECTION	Time for 5000 iterations (s)	Iterations per second
000	OFF	OFF	OFF	188.283	26.5557
100	ON	OFF	OFF	332.301	15.0466
110	ON	ON	OFF	970.748	5.15067
110* - sort indices and shuffle	ON	ON	OFF	510.159	9.80087

As the result shows, with ENABLE_STREAM_COMPACTION also enabled, sorting indices by material and reshuffle (110*) is significantly faster than directly sorting the structs. BUT it is still slower than the approaches without sorting by materials (naive or stream compaction only).

There may be two reasons: 1. expense of sorting; 2. it still costs to move large structs around, even if only once per bounce.

I was thinking about leaving the PathSegments and ShadeableIntersections in place and just use the sorted/compacted indices to access the data (during both sorting stage and compaction stage). But I did a tiny experiment by just sorting the indices array and nothing else... (no reshuffling) It turned out that compared to 26.5557ips of 000, just sorting an indices array will slow the performance to 17.0504ips. So, it may be true that sorting itself is costly.

Additional Test: Access Structs in Global Memory vs Copy to Local Memory First

In both intersection and shading stage there are a lot of objects in global memory that needn't to be changed but was accessed. For example, pathSegment and geom in computeIntersections... When I started working on the project I naively change them from copying value to storing a reference...

But I decided to change them back and do some tests. For science. The commit is marked with this tag: copy_and_local_access_vs_global_access

In computeIntersections(), from:

auto& pathSegment = pathSegments[path_index];
...
    auto& geom = geoms[i];

To:

auto pathSegment = pathSegments[path_index];
...
    auto geom = geoms[i];

In kernShadeScatterAndGatherTerminated(), from:

auto& intersection = intersections[path_index];
auto& material = materials[intersection.materialId];

To:

auto intersection = intersections[path_index];
auto material = materials[intersection.materialId];

Here is the result:

Test Case Id	ENABLE_STREAM_COMPACTION	SORT_PATH_BY_MATERIAL	CACHE_FIRST_INTERSECTION	Time for 5000 iterations (s)	Iterations per second
000	OFF	OFF	OFF	188.283	26.5557
000** - copy structs to local memory first	OFF	OFF	OFF	171.12	29.2193

Yup, copying them to local first is faster in comparison to accessing them directly in global memory... at least for their size.

I'll leave at copying them to local memory first during the remainder of my assignment.

Additional optimization: compact index array instead of `PathSegments` array

I tried to do some stream compaction on index array instead of PathSegments array, and forward the threads with the new indices array to find corresponding PathSegment on intersection and shading stages. This approach raised render speed to 120.6% of no stream compaction and 212.9% of the approach that directly do stream compaction on PathSegments array, see below. The changes can be found in compact_index_array tag.

This is done by changing:

auto new_end = thrust::remove_if(thrust::device, dev_paths, dev_paths + num_paths_active, isPathTerminated());
num_paths_active = new_end - dev_paths;

auto new_end = thrust::remove_if(thrust::device, dev_active_path_indices, dev_active_path_indices + num_paths_active, isPathTerminatedForIndex()); // slower
num_paths_active = new_end - dev_active_path_indices;
...
// in intersection and shading stages
    path_index = indices[index];
...

Test Case Id	ENABLE_STREAM_COMPACTION	SORT_PATH_BY_MATERIAL	CACHE_FIRST_INTERSECTION	Time for 5000 iterations (s)	Iterations per second
000	OFF	OFF	OFF	188.283	26.5557
100	ON	OFF	OFF	332.301	15.0466
100*** - compact index array instead of `PathSegments` array	ON	OFF	OFF	156.11	32.0286

So, I have proven both sorting and moving large structs are costly, I decided to go without sorting but with index compaction for the remainder of this project (If I decided not to do Wavefront Pathtracer).

Feature: OBJ model loading

I enabled obj model loading feature with tinyobjloader. I describe scene like this so my path tracer will load the file and save the vertex data into a vertex buffer, which will be copied into GPU global memory.

// Dragon
OBJECT 6
mesh
material 4
TRANS       0 0 0
ROTAT       0 45 0
SCALE       3 3 3
FILE        dragon.obj

If the vertex normal is different from triangle normal, my ray-triangle intersection can give interpolated normal (aka smooth shading).

flat shading	smooth_shading

During loading stage, a bounding box is generated for the model. It can be toggled on and off by ENABLE_MESH_BBOX macro in intersections.h.

However... I found that enabling bounding box or not doesn't have much effect on the rendering time.

It took me 4452.84 seconds to render the image below (~1750 triangles, my previous dragon model) without bounding box, which is 1.12288 iterations per second.

It took me 89.0504 seconds to render the same image with bounding box for 101 iterations (I terminated it on 101 samples), which is 1.13419 iterations per second.

It is not efficient is because, in my opinion, my dragon has big wings, and it has a fairly large bounding box. If any thread in a memory block hits the bounding box, the whole block needs to wait until it is finished. I guess I need to sort the ray, use a better path tracing model, or use a scene hierarchy.

Minor Features: Stochastic Sampling and Refrative Material with Fresnel Approximation

Stochastic Sampling

100 iterations without Stochastic Sampling	100 iterations with Stochastic Sampling

Not much difference

5000 iterations without Stochastic Sampling	5000 iterations with Stochastic Sampling

The reflection on the ball is smoother.

Refrative Material

$refractive$

The indices of refraction of the front balls are 1.31 (ice), 1.62 (glass?), 2.614 (titanium dioxide) (from left to right)

The pink ball is 10% perfect specular, 60% refraction (ior: 1.66), 30% diffusive.

Name		Name	Last commit message	Last commit date
Latest commit History 138 Commits
external		external
img		img
rendered_images		rendered_images
scenes		scenes
screenshots		screenshots
src		src
test_results		test_results
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
Project3-CUDA-Path-Tracer.launch		Project3-CUDA-Path-Tracer.launch
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Path Tracer

Things I have done

Performance Test: Core Features

Additional Test: Sort Paths by Sorting Indices Then Reshuffle Instead of Sorting in Place

Additional Test: Access Structs in Global Memory vs Copy to Local Memory First

Additional optimization: compact index array instead of `PathSegments` array

Feature: OBJ model loading

Minor Features: Stochastic Sampling and Refrative Material with Fresnel Approximation

Stochastic Sampling

Refrative Material

Current State

That is what I started from

About

Releases

Packages

Languages

WindyDarian/Project3-CUDA-Path-Tracer

Folders and files

Latest commit

History

Repository files navigation

CUDA Path Tracer

Things I have done

Performance Test: Core Features

Additional Test: Sort Paths by Sorting Indices Then Reshuffle Instead of Sorting in Place

Additional Test: Access Structs in Global Memory vs Copy to Local Memory First

Additional optimization: compact index array instead of PathSegments array

Feature: OBJ model loading

Minor Features: Stochastic Sampling and Refrative Material with Fresnel Approximation

Stochastic Sampling

Refrative Material

Current State

That is what I started from

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Additional optimization: compact index array instead of `PathSegments` array

Packages