diff --git a/README.md b/README.md
index ae6088d..306f98f 100644
--- a/README.md
+++ b/README.md
@@ -1,66 +1,5 @@
 CIS 565 Project3 : CUDA Pathtracer
-===================
-
-Fall 2014
-
-Due Wed, 10/8 (submit without penalty until Sun, 10/12)
-
-## INTRODUCTION
-In this project, you will implement a CUDA based pathtracer capable of
-generating pathtraced rendered images extremely quickly. Building a pathtracer can be viewed as a generalization of building a raytracer, so for those of you who have taken 460/560, the basic concept should not be very new to you. For those of you that have not taken
-CIS460/560, raytracing is a technique for generating images by tracing rays of
-light through pixels in an image plane out into a scene and following the way
-the rays of light bounce and interact with objects in the scene. More
-information can be found here:
-http://en.wikipedia.org/wiki/Ray_tracing_(graphics). Pathtracing is a generalization of this technique by considering more than just the contribution of direct lighting to a surface.
-
-Since in this class we are concerned with working in generating actual images
-and less so with mundane tasks like file I/O, this project includes basecode
-for loading a scene description file format, described below, and various other
-things that generally make up the render "harness" that takes care of
-everything up to the rendering itself. The core renderer is left for you to
-implement.  Finally, note that while this basecode is meant to serve as a
-strong starting point for a CUDA pathtracer, you are not required to use this
-basecode if you wish, and you may also change any part of the basecode
-specification as you please, so long as the final rendered result is correct.
-
-## CONTENTS
-The Project3 root directory contains the following subdirectories:
-	
-* src/ contains the source code for the project. Both the Windows Visual Studio
-  solution and the OSX and Linux makefiles reference this folder for all 
-  source; the base source code compiles on Linux, OSX and Windows without 
-  modification.  If you are building on OSX, be sure to uncomment lines 4 & 5 of
-  the CMakeLists.txt in order to make sure CMake builds against clang.
-* data/scenes/ contains an example scene description file.
-* renders/ contains an example render of the given example scene file. 
-* windows/ contains a Windows Visual Studio 2010 project and all dependencies
-  needed for building and running on Windows 7. If you would like to create a
-  Visual Studio 2012 or 2013 projects, there are static libraries that you can
-  use for GLFW that are in external/bin/GLFW (Visual Studio 2012 uses msvc110, 
-  and Visual Studio 2013 uses msvc120)
-* external/ contains all the header, static libraries and built binaries for
-  3rd party libraries (i.e. glm, GLEW, GLFW) that we use for windowing and OpenGL
-  extensions
-
-## RUNNING THE CODE
-The main function requires a scene description file (that is provided in data/scenes). 
-The main function reads in the scene file by an argument as such :
-'scene=[sceneFileName]'
-
-If you are using Visual Studio, you can set this in the Debugging > Command Arguments section
-in the Project properties.
-
-## REQUIREMENTS
-In this project, you are given code for:
-
-* Loading, reading, and storing the scene scene description format
-* Example functions that can run on both the CPU and GPU for generating random
-  numbers, spherical intersection testing, and surface point sampling on cubes
-* A class for handling image operations and saving images
-* Working code for CUDA-GL interop
-
-You will need to implement the following features:
+Features implemented:
 
 * Raycasting from a camera into a scene through a pixel grid
 * Diffuse surfaces
@@ -69,205 +8,21 @@ You will need to implement the following features:
 * Sphere surface point sampling
 * Stream compaction optimization
 
-You are also required to implement at least 2 of the following features:
+![alt tag](https://github.com/zxm5010/Project3-Pathtracer/blob/master/test.0.jpg)
 
-* Texture mapping 
-* Bump mapping
+Extra Features:
 * Depth of field
+![alt tag](https://github.com/zxm5010/Project3-Pathtracer/blob/master/depth_field.jpg)
 * Refraction, i.e. glass
 * OBJ Mesh loading and rendering
-* Interactive camera
-* Motion blur
-* Subsurface scattering
-
-The 'extra features' list is not comprehensive.  If you have a particular feature
-you would like to implement (e.g. acceleration structures, etc.) please contact us 
-first!
-
-For each 'extra feature' you must provide the following analysis :
-* overview write up of the feature
-* performance impact of the feature
-* if you did something to accelerate the feature, why did you do what you did
-* compare your GPU version to a CPU version of this feature (you do NOT need to 
-  implement a CPU version)
-* how can this feature be further optimized (again, not necessary to implement it, but
-  should give a roadmap of how to further optimize and why you believe this is the next
-  step)
-
-## BASE CODE TOUR
-You will be working in three files: raytraceKernel.cu, intersections.h, and
-interactions.h. Within these files, areas that you need to complete are marked
-with a TODO comment. Areas that are useful to and serve as hints for optional
-features are marked with TODO (Optional). Functions that are useful for
-reference are marked with the comment LOOK.
-
-* raytraceKernel.cu contains the core raytracing CUDA kernel. You will need to
-  complete:
-    * cudaRaytraceCore() handles kernel launches and memory management; this
-      function already contains example code for launching kernels,
-      transferring geometry and cameras from the host to the device, and transferring
-      image buffers from the host to the device and back. You will have to complete
-      this function to support passing materials and lights to CUDA.
-    * raycastFromCameraKernel() is a function that you need to implement. This
-      function once correctly implemented should handle camera raycasting. 
-    * raytraceRay() is the core raytracing CUDA kernel; all of your pathtracing
-      logic should be implemented in this CUDA kernel. raytraceRay() should
-      take in a camera, image buffer, geometry, materials, and lights, and should
-      trace a ray through the scene and write the resultant color to a pixel in the
-      image buffer.
-
-* intersections.h contains functions for geometry intersection testing and
-  point generation. You will need to complete:
-    * boxIntersectionTest(), which takes in a box and a ray and performs an
-      intersection test. This function should work in the same way as
-      sphereIntersectionTest().
-    * getRandomPointOnSphere(), which takes in a sphere and returns a random
-      point on the surface of the sphere with an even probability distribution.
-      This function should work in the same way as getRandomPointOnCube(). You can
-      (although do not necessarily have to) use this to generate points on a sphere
-      to use a point lights, or can use this for area lighting.
-
-* interactions.h contains functions for ray-object interactions that define how
-  rays behave upon hitting materials and objects. You will need to complete:
-    * getRandomDirectionInSphere(), which generates a random direction in a
-      sphere with a uniform probability. This function works in a fashion
-      similar to that of calculateRandomDirectionInHemisphere(), which generates a
-      random cosine-weighted direction in a hemisphere.
-    * calculateBSDF(), which takes in an incoming ray, normal, material, and
-      other information, and returns an outgoing ray. You can either implement
-      this function for ray-surface interactions, or you can replace it with your own
-      function(s).
-
-You will also want to familiarize yourself with:
-
-* sceneStructs.h, which contains definitions for how geometry, materials,
-  lights, cameras, and animation frames are stored in the renderer. 
-* utilities.h, which serves as a kitchen-sink of useful functions
-
-## NOTES ON GLM
-This project uses GLM, the GL Math library, for linear algebra. You need to
-know two important points on how GLM is used in this project:
-
-* In this project, indices in GLM vectors (such as vec3, vec4), are accessed
-  via swizzling. So, instead of v[0], v.x is used, and instead of v[1], v.y is
-  used, and so on and so forth.
-* GLM Matrix operations work fine on NVIDIA Fermi cards and later, but
-  pre-Fermi cards do not play nice with GLM matrices. As such, in this project,
-  GLM matrices are replaced with a custom matrix struct, called a cudaMat4, found
-  in cudaMat4.h. A custom function for multiplying glm::vec4s and cudaMat4s is
-  provided as multiplyMV() in intersections.h.
-
-## SCENE FORMAT
-This project uses a custom scene description format.
-Scene files are flat text files that describe all geometry, materials,
-lights, cameras, render settings, and animation frames inside of the scene.
-Items in the format are delimited by new lines, and comments can be added at
-the end of each line preceded with a double-slash.
-
-Materials are defined in the following fashion:
-
-* MATERIAL (material ID)								//material header
-* RGB (float r) (float g) (float b)					//diffuse color
-* SPECX (float specx)									//specular exponent
-* SPECRGB (float r) (float g) (float b)				//specular color
-* REFL (bool refl)									//reflectivity flag, 0 for
-  no, 1 for yes
-* REFR (bool refr)									//refractivity flag, 0 for
-  no, 1 for yes
-* REFRIOR (float ior)									//index of refraction
-  for Fresnel effects
-* SCATTER (float scatter)								//scatter flag, 0 for
-  no, 1 for yes
-* ABSCOEFF (float r) (float b) (float g)				//absorption
-  coefficient for scattering
-* RSCTCOEFF (float rsctcoeff)							//reduced scattering
-  coefficient
-* EMITTANCE (float emittance)							//the emittance of the
-  material. Anything >0 makes the material a light source.
-
-Cameras are defined in the following fashion:
-
-* CAMERA 												//camera header
-* RES (float x) (float y)								//resolution
-* FOVY (float fovy)										//vertical field of
-  view half-angle. the horizonal angle is calculated from this and the
-  reslution
-* ITERATIONS (float interations)							//how many
-  iterations to refine the image, only relevant for supersampled antialiasing,
-  depth of field, area lights, and other distributed raytracing applications
-* FILE (string filename)									//file to output
-  render to upon completion
-* frame (frame number)									//start of a frame
-* EYE (float x) (float y) (float z)						//camera's position in
-  worldspace
-* VIEW (float x) (float y) (float z)						//camera's view
-  direction
-* UP (float x) (float y) (float z)						//camera's up vector
-
-Objects are defined in the following fashion:
-* OBJECT (object ID)										//object header
-* (cube OR sphere OR mesh)								//type of object, can
-  be either "cube", "sphere", or "mesh". Note that cubes and spheres are unit
-  sized and centered at the origin.
-* material (material ID)									//material to
-  assign this object
-* frame (frame number)									//start of a frame
-* TRANS (float transx) (float transy) (float transz)		//translation
-* ROTAT (float rotationx) (float rotationy) (float rotationz)		//rotation
-* SCALE (float scalex) (float scaley) (float scalez)		//scale
-
-An example scene file setting up two frames inside of a Cornell Box can be
-found in the scenes/ directory.
-
-For meshes, note that the base code will only read in .obj files. For more 
-information on the .obj specification see http://en.wikipedia.org/wiki/Wavefront_.obj_file.
-
-An example of a mesh object is as follows:
-
-OBJECT 0
-mesh tetra.obj
-material 0
-frame 0
-TRANS       0 5 -5
-ROTAT       0 90 0
-SCALE       .01 10 10 
-
-Check the Google group for some sample .obj files of varying complexity.
+![alt tag](https://github.com/zxm5010/Project3-Pathtracer/blob/master/obj_loader.jpg)
 
-## THIRD PARTY CODE POLICY
-* Use of any third-party code must be approved by asking on our Google Group.  
-  If it is approved, all students are welcome to use it.  Generally, we approve 
-  use of third-party code that is not a core part of the project.  For example, 
-  for the ray tracer, we would approve using a third-party library for loading 
-  models, but would not approve copying and pasting a CUDA function for doing 
-  refraction.
-* Third-party code must be credited in README.md.
-* Using third-party code without its approval, including using another
-  student's code, is an academic integrity violation, and will result in you
-  receiving an F for the semester.
+Third parth software:
 
-## SELF-GRADING
-* On the submission date, email your grade, on a scale of 0 to 100, to Harmony,
-  harmoli+cis565@seas.upenn.com, with a one paragraph explanation.  Be concise and
-  realistic.  Recall that we reserve 30 points as a sanity check to adjust your
-  grade.  Your actual grade will be (0.7 * your grade) + (0.3 * our grade).  We
-  hope to only use this in extreme cases when your grade does not realistically
-  reflect your work - it is either too high or too low.  In most cases, we plan
-  to give you the exact grade you suggest.
-* Projects are not weighted evenly, e.g., Project 0 doesn't count as much as
-  the path tracer.  We will determine the weighting at the end of the semester
-  based on the size of each project.
+*I used tinyobjloader for OBJ mesh loading. 
 
-## SUBMISSION
-Please change the README to reflect the answers to the questions we have posed
-above.  Remember:
-* this is a renderer, so include images that you've made!
-* be sure to back your claims for optimization with numbers and comparisons
-* if you reference any other material, please provide a link to it
-* you wil not e graded on how fast your path tracer runs, but getting close to
-  real-time is always nice
-* if you have a fast GPU renderer, it is good to show case this with a video to
-  show interactivity.  If you do so, please include a link.
+Still working on:
+*texture mapping 
+*subsurface scattering
 
-Be sure to open a pull request and to send Harmony your grade and why you
-believe this is the grade you should get.
+The stream compaction won't affect performance very much unless the depth is too high. For OBJ mesh rendering, it cosumes too much GPU memory and performance for intersection test. To faster OBJ mesh rendering, we need extra bounding box technique to reduce the intersection test, such as AABB bounding box. 
\ No newline at end of file
diff --git a/data/scenes/sampleScene.txt b/data/scenes/sampleScene.txt
index 6a9f5cc..777050b 100644
--- a/data/scenes/sampleScene.txt
+++ b/data/scenes/sampleScene.txt
@@ -9,9 +9,10 @@ SCATTER     0
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
 EMITTANCE   0
+TEXTURE		0
 
 MATERIAL 1 				//red diffuse
-RGB         .63 .06 .04       
+RGB         .99 .5 .60         
 SPECEX      0      
 SPECRGB     1 1 1      
 REFL        0       
@@ -21,9 +22,10 @@ SCATTER     0
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
 EMITTANCE   0
+TEXTURE		0
 
 MATERIAL 2 				//green diffuse
-RGB         .15 .48 .09      
+RGB         .0 .8 .8      
 SPECEX      0      
 SPECRGB     1 1 1      
 REFL        0       
@@ -33,6 +35,7 @@ SCATTER     0
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
 EMITTANCE   0
+TEXTURE		0
 
 MATERIAL 3 				//red glossy
 RGB         .63 .06 .04      
@@ -45,30 +48,33 @@ SCATTER     0
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
 EMITTANCE   0
+TEXTURE		0
 
-MATERIAL 4 				//white glossy
+MATERIAL 4 				// mirror
 RGB         1 1 1     
 SPECEX      0      
 SPECRGB     1 1 1      
-REFL        0       
+REFL        1       
 REFR        0        
 REFRIOR     2      
 SCATTER     0        
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
 EMITTANCE   0
+TEXTURE		0
 
 MATERIAL 5 				//glass
-RGB         0 0 0    
+RGB         1 1 1    
 SPECEX      0      
 SPECRGB     1 1 1      
-REFL        0       
+REFL        1       
 REFR        1        
 REFRIOR     2.2       
 SCATTER     0        
 ABSCOEFF    .02 5.1 5.7      
 RSCTCOEFF   13
 EMITTANCE   0
+TEXTURE		0
 
 MATERIAL 6 				//green glossy
 RGB         .15 .48 .09      
@@ -76,11 +82,12 @@ SPECEX      0
 SPECRGB     1 1 1     
 REFL        0       
 REFR        0        
-REFRIOR     2.6       
+REFRIOR     0      
 SCATTER     0        
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
 EMITTANCE   0
+TEXTURE		0
 
 MATERIAL 7				//light
 RGB         1 1 1       
@@ -92,7 +99,8 @@ REFRIOR     0
 SCATTER     0        
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
-EMITTANCE   1
+EMITTANCE   5
+TEXTURE		0
 
 MATERIAL 8				//light
 RGB         1 1 1       
@@ -105,16 +113,19 @@ SCATTER     0
 ABSCOEFF    0 0 0      
 RSCTCOEFF   0
 EMITTANCE   15
+TEXTURE		0
 
 CAMERA
-RES         800 800
+RES         1000 1000
 FOVY        25
-ITERATIONS  5000
+ITERATIONS  15000
 FILE        test.bmp
 frame 0
 EYE         0 4.5 12
 VIEW        0 0 -1
 UP          0 1 0
+FOCL		7.5
+APTR		0.5
 
 OBJECT 0
 cube
@@ -122,7 +133,7 @@ material 0
 frame 0
 TRANS       0 0 0
 ROTAT       0 0 90
-SCALE       .01 10 10 
+SCALE       .1 10 10 
 
 OBJECT 1
 cube
@@ -130,7 +141,7 @@ material 0
 frame 0
 TRANS       0 5 -5
 ROTAT       0 90 0
-SCALE       .01 10 10 
+SCALE       .1 10 10
 
 OBJECT 2
 cube
@@ -138,7 +149,7 @@ material 0
 frame 0
 TRANS       0 10 0
 ROTAT       0 0 90
-SCALE       .01 10 10
+SCALE       .1 10 10
 
 OBJECT 3
 cube
@@ -146,7 +157,7 @@ material 1
 frame 0
 TRANS       -5 5 0
 ROTAT       0 0 0
-SCALE       .01 10 10
+SCALE       .1 10 10
 
 OBJECT 4
 cube
@@ -154,19 +165,19 @@ material 2
 frame 0
 TRANS       5 5 0
 ROTAT       0 0 0
-SCALE       .01 10 10
+SCALE       .1 10 10
 
 OBJECT 5
 sphere
-material 4
+material 5
 frame 0
-TRANS       0 2 0
+TRANS       0 2 -2
 ROTAT       0 180 0
 SCALE       3 3 3
 
 OBJECT 6
 sphere
-material 3
+material 4
 frame 0
 TRANS       2 5 2
 ROTAT       0 180 0
@@ -180,11 +191,11 @@ TRANS       -2 5 -2
 ROTAT       0 180 0
 SCALE       3 3 3
 
-
 OBJECT 8
 cube
 material 8 
 frame 0
 TRANS       0 10 0
 ROTAT       0 0 90
-SCALE       .3 3 3
\ No newline at end of file
+SCALE       0.3 3 3
+
diff --git a/data/scenes/sampleScene2.txt b/data/scenes/sampleScene2.txt
new file mode 100644
index 0000000..16c054e
--- /dev/null
+++ b/data/scenes/sampleScene2.txt
@@ -0,0 +1,217 @@
+MATERIAL 0				//white diffuse
+RGB         1 1 1       
+SPECEX      0      
+SPECRGB     1 1 1      
+REFL        0       
+REFR        0        
+REFRIOR     0       
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   0
+TEXTURE		0
+
+MATERIAL 1 				//red diffuse
+RGB         .63 .06 .04       
+SPECEX      0      
+SPECRGB     1 1 1      
+REFL        0       
+REFR        0        
+REFRIOR     0       
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   0
+TEXTURE		0
+
+MATERIAL 2 				//green diffuse
+RGB         .15 .48 .09      
+SPECEX      0      
+SPECRGB     1 1 1      
+REFL        0       
+REFR        0        
+REFRIOR     0       
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   0
+TEXTURE		0
+
+MATERIAL 3 				//red glossy
+RGB         .63 .06 .04      
+SPECEX      0      
+SPECRGB     1 1 1       
+REFL        0       
+REFR        0        
+REFRIOR     2       
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   0
+TEXTURE		0
+
+MATERIAL 4 				// mirror
+RGB         1 1 1     
+SPECEX      0      
+SPECRGB     1 1 1      
+REFL        1       
+REFR        0        
+REFRIOR     2      
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   0
+TEXTURE		0
+
+MATERIAL 5 				//glass
+RGB         1 1 1    
+SPECEX      0      
+SPECRGB     1 1 1      
+REFL        1       
+REFR        1        
+REFRIOR     2.2       
+SCATTER     0        
+ABSCOEFF    .02 5.1 5.7      
+RSCTCOEFF   13
+EMITTANCE   0
+TEXTURE		0
+
+MATERIAL 6 				//green glossy
+RGB         .15 .48 .09      
+SPECEX      0      
+SPECRGB     1 1 1     
+REFL        0       
+REFR        0        
+REFRIOR     2.6       
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   0
+TEXTURE		0
+
+MATERIAL 7				//light
+RGB         1 1 1       
+SPECEX      0      
+SPECRGB     0 0 0       
+REFL        0       
+REFR        0        
+REFRIOR     0       
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   5
+TEXTURE		0
+
+MATERIAL 8				//light
+RGB         1 1 1       
+SPECEX      0      
+SPECRGB     0 0 0       
+REFL        0       
+REFR        0        
+REFRIOR     0       
+SCATTER     0        
+ABSCOEFF    0 0 0      
+RSCTCOEFF   0
+EMITTANCE   15
+TEXTURE		0
+
+CAMERA
+RES         1000 1000
+FOVY        25
+ITERATIONS  10000
+FILE        test.bmp
+frame 0
+EYE         0 4.5 12
+VIEW        0 0 -1
+UP          0 1 0
+FOCL		7.5
+APTR		0.5
+
+OBJECT 0
+cube
+material 0
+frame 0
+TRANS       0 0 0
+ROTAT       0 0 90
+SCALE       .1 10 10 
+
+OBJECT 1
+cube
+material 0
+frame 0
+TRANS       0 5 -5
+ROTAT       0 90 0
+SCALE       .1 10 10
+
+OBJECT 2
+cube
+material 0
+frame 0
+TRANS       0 10 0
+ROTAT       0 0 90
+SCALE       .1 10 10
+
+OBJECT 3
+cube
+material 1
+frame 0
+TRANS       -5 5 0
+ROTAT       0 0 0
+SCALE       .1 10 10
+
+OBJECT 4
+cube
+material 2
+frame 0
+TRANS       5 5 0
+ROTAT       0 0 0
+SCALE       .1 10 10
+
+OBJECT 5
+cube
+material 8 
+frame 0
+TRANS       0 10 0
+ROTAT       0 0 90
+SCALE       0.4 4 4
+
+OBJECT 6
+diamond.obj
+material 4
+frame 0
+TRANS       0 5 -4
+ROTAT       20 30 50
+SCALE       3 3 3
+
+OBJECT 7
+sphere
+material 5
+frame 0
+TRANS       0 2 2
+ROTAT       0 180 0
+SCALE       3 3 3
+
+OBJECT 8
+sphere
+material 4
+frame 0
+TRANS       2.5 5 2
+ROTAT       0 0 0
+SCALE       2.5 2.5 2.5
+
+OBJECT 9
+sphere
+material 6
+frame 0
+TRANS       -3 5 -2
+ROTAT       0 180 0
+SCALE       3 3 3
+
+OBJECT 10
+sphere
+material 1
+frame 0
+TRANS       -2 1 0
+ROTAT       0 180 0
+SCALE       2 2 2
+
diff --git a/depth_field.jpg b/depth_field.jpg
new file mode 100644
index 0000000..ac56121
Binary files /dev/null and b/depth_field.jpg differ
diff --git a/external/include/SOIL/SOIL.c b/external/include/SOIL/SOIL.c
new file mode 100644
index 0000000..1ee4daf
--- /dev/null
+++ b/external/include/SOIL/SOIL.c
@@ -0,0 +1,2024 @@
+/*
+	Jonathan Dummer
+	2007-07-26-10.36
+
+	Simple OpenGL Image Library
+
+	Public Domain
+	using Sean Barret's stb_image as a base
+
+	Thanks to:
+	* Sean Barret - for the awesome stb_image
+	* Dan Venkitachalam - for finding some non-compliant DDS files, and patching some explicit casts
+	* everybody at gamedev.net
+*/
+
+#define SOIL_CHECK_FOR_GL_ERRORS 0
+
+#ifdef WIN32
+	#define WIN32_LEAN_AND_MEAN
+	#include <windows.h>
+	#include <wingdi.h>
+	#include <GL/gl.h>
+#elif defined(__APPLE__) || defined(__APPLE_CC__)
+	/*	I can't test this Apple stuff!	*/
+	#include <OpenGL/gl.h>
+	#include <Carbon/Carbon.h>
+	#define APIENTRY
+#else
+	#include <GL/gl.h>
+	#include <GL/glx.h>
+#endif
+
+#include "SOIL.h"
+#include "stb_image_aug.h"
+#include "image_helper.h"
+#include "image_DXT.h"
+
+#include <stdlib.h>
+#include <string.h>
+
+/*	error reporting	*/
+char *result_string_pointer = "SOIL initialized";
+
+/*	for loading cube maps	*/
+enum{
+	SOIL_CAPABILITY_UNKNOWN = -1,
+	SOIL_CAPABILITY_NONE = 0,
+	SOIL_CAPABILITY_PRESENT = 1
+};
+static int has_cubemap_capability = SOIL_CAPABILITY_UNKNOWN;
+int query_cubemap_capability( void );
+#define SOIL_TEXTURE_WRAP_R					0x8072
+#define SOIL_CLAMP_TO_EDGE					0x812F
+#define SOIL_NORMAL_MAP						0x8511
+#define SOIL_REFLECTION_MAP					0x8512
+#define SOIL_TEXTURE_CUBE_MAP				0x8513
+#define SOIL_TEXTURE_BINDING_CUBE_MAP		0x8514
+#define SOIL_TEXTURE_CUBE_MAP_POSITIVE_X	0x8515
+#define SOIL_TEXTURE_CUBE_MAP_NEGATIVE_X	0x8516
+#define SOIL_TEXTURE_CUBE_MAP_POSITIVE_Y	0x8517
+#define SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Y	0x8518
+#define SOIL_TEXTURE_CUBE_MAP_POSITIVE_Z	0x8519
+#define SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Z	0x851A
+#define SOIL_PROXY_TEXTURE_CUBE_MAP			0x851B
+#define SOIL_MAX_CUBE_MAP_TEXTURE_SIZE		0x851C
+/*	for non-power-of-two texture	*/
+static int has_NPOT_capability = SOIL_CAPABILITY_UNKNOWN;
+int query_NPOT_capability( void );
+/*	for texture rectangles	*/
+static int has_tex_rectangle_capability = SOIL_CAPABILITY_UNKNOWN;
+int query_tex_rectangle_capability( void );
+#define SOIL_TEXTURE_RECTANGLE_ARB				0x84F5
+#define SOIL_MAX_RECTANGLE_TEXTURE_SIZE_ARB		0x84F8
+/*	for using DXT compression	*/
+static int has_DXT_capability = SOIL_CAPABILITY_UNKNOWN;
+int query_DXT_capability( void );
+#define SOIL_RGB_S3TC_DXT1		0x83F0
+#define SOIL_RGBA_S3TC_DXT1		0x83F1
+#define SOIL_RGBA_S3TC_DXT3		0x83F2
+#define SOIL_RGBA_S3TC_DXT5		0x83F3
+typedef void (APIENTRY * P_SOIL_GLCOMPRESSEDTEXIMAGE2DPROC) (GLenum target, GLint level, GLenum internalformat, GLsizei width, GLsizei height, GLint border, GLsizei imageSize, const GLvoid * data);
+P_SOIL_GLCOMPRESSEDTEXIMAGE2DPROC soilGlCompressedTexImage2D = NULL;
+unsigned int SOIL_direct_load_DDS(
+		const char *filename,
+		unsigned int reuse_texture_ID,
+		int flags,
+		int loading_as_cubemap );
+unsigned int SOIL_direct_load_DDS_from_memory(
+		const unsigned char *const buffer,
+		int buffer_length,
+		unsigned int reuse_texture_ID,
+		int flags,
+		int loading_as_cubemap );
+/*	other functions	*/
+unsigned int
+	SOIL_internal_create_OGL_texture
+	(
+		const unsigned char *const data,
+		int width, int height, int channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags,
+		unsigned int opengl_texture_type,
+		unsigned int opengl_texture_target,
+		unsigned int texture_check_size_enum
+	);
+
+/*	and the code magic begins here [8^)	*/
+unsigned int
+	SOIL_load_OGL_texture
+	(
+		const char *filename,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	int width, height, channels;
+	unsigned int tex_id;
+	/*	does the user want direct uploading of the image as a DDS file?	*/
+	if( flags & SOIL_FLAG_DDS_LOAD_DIRECT )
+	{
+		/*	1st try direct loading of the image as a DDS file
+			note: direct uploading will only load what is in the
+			DDS file, no MIPmaps will be generated, the image will
+			not be flipped, etc.	*/
+		tex_id = SOIL_direct_load_DDS( filename, reuse_texture_ID, flags, 0 );
+		if( tex_id )
+		{
+			/*	hey, it worked!!	*/
+			return tex_id;
+		}
+	}
+	/*	try to load the image	*/
+	img = SOIL_load_image( filename, &width, &height, &channels, force_channels );
+	/*	channels holds the original number of channels, which may have been forced	*/
+	if( (force_channels >= 1) && (force_channels <= 4) )
+	{
+		channels = force_channels;
+	}
+	if( NULL == img )
+	{
+		/*	image loading failed	*/
+		result_string_pointer = stbi_failure_reason();
+		return 0;
+	}
+	/*	OK, make it a texture!	*/
+	tex_id = SOIL_internal_create_OGL_texture(
+			img, width, height, channels,
+			reuse_texture_ID, flags,
+			GL_TEXTURE_2D, GL_TEXTURE_2D,
+			GL_MAX_TEXTURE_SIZE );
+	/*	and nuke the image data	*/
+	SOIL_free_image_data( img );
+	/*	and return the handle, such as it is	*/
+	return tex_id;
+}
+
+unsigned int
+	SOIL_load_OGL_HDR_texture
+	(
+		const char *filename,
+		int fake_HDR_format,
+		int rescale_to_max,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	int width, height, channels;
+	unsigned int tex_id;
+	/*	no direct uploading of the image as a DDS file	*/
+	/* error check */
+	if( (fake_HDR_format != SOIL_HDR_RGBE) &&
+		(fake_HDR_format != SOIL_HDR_RGBdivA) &&
+		(fake_HDR_format != SOIL_HDR_RGBdivA2) )
+	{
+		result_string_pointer = "Invalid fake HDR format specified";
+		return 0;
+	}
+	/*	try to load the image (only the HDR type) */
+	img = stbi_hdr_load_rgbe( filename, &width, &height, &channels, 4 );
+	/*	channels holds the original number of channels, which may have been forced	*/
+	if( NULL == img )
+	{
+		/*	image loading failed	*/
+		result_string_pointer = stbi_failure_reason();
+		return 0;
+	}
+	/* the load worked, do I need to convert it? */
+	if( fake_HDR_format == SOIL_HDR_RGBdivA )
+	{
+		RGBE_to_RGBdivA( img, width, height, rescale_to_max );
+	} else if( fake_HDR_format == SOIL_HDR_RGBdivA2 )
+	{
+		RGBE_to_RGBdivA2( img, width, height, rescale_to_max );
+	}
+	/*	OK, make it a texture!	*/
+	tex_id = SOIL_internal_create_OGL_texture(
+			img, width, height, channels,
+			reuse_texture_ID, flags,
+			GL_TEXTURE_2D, GL_TEXTURE_2D,
+			GL_MAX_TEXTURE_SIZE );
+	/*	and nuke the image data	*/
+	SOIL_free_image_data( img );
+	/*	and return the handle, such as it is	*/
+	return tex_id;
+}
+
+unsigned int
+	SOIL_load_OGL_texture_from_memory
+	(
+		const unsigned char *const buffer,
+		int buffer_length,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	int width, height, channels;
+	unsigned int tex_id;
+	/*	does the user want direct uploading of the image as a DDS file?	*/
+	if( flags & SOIL_FLAG_DDS_LOAD_DIRECT )
+	{
+		/*	1st try direct loading of the image as a DDS file
+			note: direct uploading will only load what is in the
+			DDS file, no MIPmaps will be generated, the image will
+			not be flipped, etc.	*/
+		tex_id = SOIL_direct_load_DDS_from_memory(
+				buffer, buffer_length,
+				reuse_texture_ID, flags, 0 );
+		if( tex_id )
+		{
+			/*	hey, it worked!!	*/
+			return tex_id;
+		}
+	}
+	/*	try to load the image	*/
+	img = SOIL_load_image_from_memory(
+					buffer, buffer_length,
+					&width, &height, &channels,
+					force_channels );
+	/*	channels holds the original number of channels, which may have been forced	*/
+	if( (force_channels >= 1) && (force_channels <= 4) )
+	{
+		channels = force_channels;
+	}
+	if( NULL == img )
+	{
+		/*	image loading failed	*/
+		result_string_pointer = stbi_failure_reason();
+		return 0;
+	}
+	/*	OK, make it a texture!	*/
+	tex_id = SOIL_internal_create_OGL_texture(
+			img, width, height, channels,
+			reuse_texture_ID, flags,
+			GL_TEXTURE_2D, GL_TEXTURE_2D,
+			GL_MAX_TEXTURE_SIZE );
+	/*	and nuke the image data	*/
+	SOIL_free_image_data( img );
+	/*	and return the handle, such as it is	*/
+	return tex_id;
+}
+
+unsigned int
+	SOIL_load_OGL_cubemap
+	(
+		const char *x_pos_file,
+		const char *x_neg_file,
+		const char *y_pos_file,
+		const char *y_neg_file,
+		const char *z_pos_file,
+		const char *z_neg_file,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	int width, height, channels;
+	unsigned int tex_id;
+	/*	error checking	*/
+	if( (x_pos_file == NULL) ||
+		(x_neg_file == NULL) ||
+		(y_pos_file == NULL) ||
+		(y_neg_file == NULL) ||
+		(z_pos_file == NULL) ||
+		(z_neg_file == NULL) )
+	{
+		result_string_pointer = "Invalid cube map files list";
+		return 0;
+	}
+	/*	capability checking	*/
+	if( query_cubemap_capability() != SOIL_CAPABILITY_PRESENT )
+	{
+		result_string_pointer = "No cube map capability present";
+		return 0;
+	}
+	/*	1st face: try to load the image	*/
+	img = SOIL_load_image( x_pos_file, &width, &height, &channels, force_channels );
+	/*	channels holds the original number of channels, which may have been forced	*/
+	if( (force_channels >= 1) && (force_channels <= 4) )
+	{
+		channels = force_channels;
+	}
+	if( NULL == img )
+	{
+		/*	image loading failed	*/
+		result_string_pointer = stbi_failure_reason();
+		return 0;
+	}
+	/*	upload the texture, and create a texture ID if necessary	*/
+	tex_id = SOIL_internal_create_OGL_texture(
+			img, width, height, channels,
+			reuse_texture_ID, flags,
+			SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_POSITIVE_X,
+			SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+	/*	and nuke the image data	*/
+	SOIL_free_image_data( img );
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image( x_neg_file, &width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_NEGATIVE_X,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image( y_pos_file, &width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_POSITIVE_Y,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image( y_neg_file, &width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Y,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image( z_pos_file, &width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_POSITIVE_Z,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image( z_neg_file, &width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Z,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	and return the handle, such as it is	*/
+	return tex_id;
+}
+
+unsigned int
+	SOIL_load_OGL_cubemap_from_memory
+	(
+		const unsigned char *const x_pos_buffer,
+		int x_pos_buffer_length,
+		const unsigned char *const x_neg_buffer,
+		int x_neg_buffer_length,
+		const unsigned char *const y_pos_buffer,
+		int y_pos_buffer_length,
+		const unsigned char *const y_neg_buffer,
+		int y_neg_buffer_length,
+		const unsigned char *const z_pos_buffer,
+		int z_pos_buffer_length,
+		const unsigned char *const z_neg_buffer,
+		int z_neg_buffer_length,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	int width, height, channels;
+	unsigned int tex_id;
+	/*	error checking	*/
+	if( (x_pos_buffer == NULL) ||
+		(x_neg_buffer == NULL) ||
+		(y_pos_buffer == NULL) ||
+		(y_neg_buffer == NULL) ||
+		(z_pos_buffer == NULL) ||
+		(z_neg_buffer == NULL) )
+	{
+		result_string_pointer = "Invalid cube map buffers list";
+		return 0;
+	}
+	/*	capability checking	*/
+	if( query_cubemap_capability() != SOIL_CAPABILITY_PRESENT )
+	{
+		result_string_pointer = "No cube map capability present";
+		return 0;
+	}
+	/*	1st face: try to load the image	*/
+	img = SOIL_load_image_from_memory(
+			x_pos_buffer, x_pos_buffer_length,
+			&width, &height, &channels, force_channels );
+	/*	channels holds the original number of channels, which may have been forced	*/
+	if( (force_channels >= 1) && (force_channels <= 4) )
+	{
+		channels = force_channels;
+	}
+	if( NULL == img )
+	{
+		/*	image loading failed	*/
+		result_string_pointer = stbi_failure_reason();
+		return 0;
+	}
+	/*	upload the texture, and create a texture ID if necessary	*/
+	tex_id = SOIL_internal_create_OGL_texture(
+			img, width, height, channels,
+			reuse_texture_ID, flags,
+			SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_POSITIVE_X,
+			SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+	/*	and nuke the image data	*/
+	SOIL_free_image_data( img );
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image_from_memory(
+				x_neg_buffer, x_neg_buffer_length,
+				&width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_NEGATIVE_X,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image_from_memory(
+				y_pos_buffer, y_pos_buffer_length,
+				&width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_POSITIVE_Y,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image_from_memory(
+				y_neg_buffer, y_neg_buffer_length,
+				&width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Y,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image_from_memory(
+				z_pos_buffer, z_pos_buffer_length,
+				&width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_POSITIVE_Z,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	continue?	*/
+	if( tex_id != 0 )
+	{
+		/*	1st face: try to load the image	*/
+		img = SOIL_load_image_from_memory(
+				z_neg_buffer, z_neg_buffer_length,
+				&width, &height, &channels, force_channels );
+		/*	channels holds the original number of channels, which may have been forced	*/
+		if( (force_channels >= 1) && (force_channels <= 4) )
+		{
+			channels = force_channels;
+		}
+		if( NULL == img )
+		{
+			/*	image loading failed	*/
+			result_string_pointer = stbi_failure_reason();
+			return 0;
+		}
+		/*	upload the texture, but reuse the assigned texture ID	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				img, width, height, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP, SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Z,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+		/*	and nuke the image data	*/
+		SOIL_free_image_data( img );
+	}
+	/*	and return the handle, such as it is	*/
+	return tex_id;
+}
+
+unsigned int
+	SOIL_load_OGL_single_cubemap
+	(
+		const char *filename,
+		const char face_order[6],
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	int width, height, channels, i;
+	unsigned int tex_id = 0;
+	/*	error checking	*/
+	if( filename == NULL )
+	{
+		result_string_pointer = "Invalid single cube map file name";
+		return 0;
+	}
+	/*	does the user want direct uploading of the image as a DDS file?	*/
+	if( flags & SOIL_FLAG_DDS_LOAD_DIRECT )
+	{
+		/*	1st try direct loading of the image as a DDS file
+			note: direct uploading will only load what is in the
+			DDS file, no MIPmaps will be generated, the image will
+			not be flipped, etc.	*/
+		tex_id = SOIL_direct_load_DDS( filename, reuse_texture_ID, flags, 1 );
+		if( tex_id )
+		{
+			/*	hey, it worked!!	*/
+			return tex_id;
+		}
+	}
+	/*	face order checking	*/
+	for( i = 0; i < 6; ++i )
+	{
+		if( (face_order[i] != 'N') &&
+			(face_order[i] != 'S') &&
+			(face_order[i] != 'W') &&
+			(face_order[i] != 'E') &&
+			(face_order[i] != 'U') &&
+			(face_order[i] != 'D') )
+		{
+			result_string_pointer = "Invalid single cube map face order";
+			return 0;
+		};
+	}
+	/*	capability checking	*/
+	if( query_cubemap_capability() != SOIL_CAPABILITY_PRESENT )
+	{
+		result_string_pointer = "No cube map capability present";
+		return 0;
+	}
+	/*	1st off, try to load the full image	*/
+	img = SOIL_load_image( filename, &width, &height, &channels, force_channels );
+	/*	channels holds the original number of channels, which may have been forced	*/
+	if( (force_channels >= 1) && (force_channels <= 4) )
+	{
+		channels = force_channels;
+	}
+	if( NULL == img )
+	{
+		/*	image loading failed	*/
+		result_string_pointer = stbi_failure_reason();
+		return 0;
+	}
+	/*	now, does this image have the right dimensions?	*/
+	if( (width != 6*height) &&
+		(6*width != height) )
+	{
+		SOIL_free_image_data( img );
+		result_string_pointer = "Single cubemap image must have a 6:1 ratio";
+		return 0;
+	}
+	/*	try the image split and create	*/
+	tex_id = SOIL_create_OGL_single_cubemap(
+			img, width, height, channels,
+			face_order, reuse_texture_ID, flags
+			);
+	/*	nuke the temporary image data and return the texture handle	*/
+	SOIL_free_image_data( img );
+	return tex_id;
+}
+
+unsigned int
+	SOIL_load_OGL_single_cubemap_from_memory
+	(
+		const unsigned char *const buffer,
+		int buffer_length,
+		const char face_order[6],
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	int width, height, channels, i;
+	unsigned int tex_id = 0;
+	/*	error checking	*/
+	if( buffer == NULL )
+	{
+		result_string_pointer = "Invalid single cube map buffer";
+		return 0;
+	}
+	/*	does the user want direct uploading of the image as a DDS file?	*/
+	if( flags & SOIL_FLAG_DDS_LOAD_DIRECT )
+	{
+		/*	1st try direct loading of the image as a DDS file
+			note: direct uploading will only load what is in the
+			DDS file, no MIPmaps will be generated, the image will
+			not be flipped, etc.	*/
+		tex_id = SOIL_direct_load_DDS_from_memory(
+				buffer, buffer_length,
+				reuse_texture_ID, flags, 1 );
+		if( tex_id )
+		{
+			/*	hey, it worked!!	*/
+			return tex_id;
+		}
+	}
+	/*	face order checking	*/
+	for( i = 0; i < 6; ++i )
+	{
+		if( (face_order[i] != 'N') &&
+			(face_order[i] != 'S') &&
+			(face_order[i] != 'W') &&
+			(face_order[i] != 'E') &&
+			(face_order[i] != 'U') &&
+			(face_order[i] != 'D') )
+		{
+			result_string_pointer = "Invalid single cube map face order";
+			return 0;
+		};
+	}
+	/*	capability checking	*/
+	if( query_cubemap_capability() != SOIL_CAPABILITY_PRESENT )
+	{
+		result_string_pointer = "No cube map capability present";
+		return 0;
+	}
+	/*	1st off, try to load the full image	*/
+	img = SOIL_load_image_from_memory(
+			buffer, buffer_length,
+			&width, &height, &channels,
+			force_channels );
+	/*	channels holds the original number of channels, which may have been forced	*/
+	if( (force_channels >= 1) && (force_channels <= 4) )
+	{
+		channels = force_channels;
+	}
+	if( NULL == img )
+	{
+		/*	image loading failed	*/
+		result_string_pointer = stbi_failure_reason();
+		return 0;
+	}
+	/*	now, does this image have the right dimensions?	*/
+	if( (width != 6*height) &&
+		(6*width != height) )
+	{
+		SOIL_free_image_data( img );
+		result_string_pointer = "Single cubemap image must have a 6:1 ratio";
+		return 0;
+	}
+	/*	try the image split and create	*/
+	tex_id = SOIL_create_OGL_single_cubemap(
+			img, width, height, channels,
+			face_order, reuse_texture_ID, flags
+			);
+	/*	nuke the temporary image data and return the texture handle	*/
+	SOIL_free_image_data( img );
+	return tex_id;
+}
+
+unsigned int
+	SOIL_create_OGL_single_cubemap
+	(
+		const unsigned char *const data,
+		int width, int height, int channels,
+		const char face_order[6],
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	variables	*/
+	unsigned char* sub_img;
+	int dw, dh, sz, i;
+	unsigned int tex_id;
+	/*	error checking	*/
+	if( data == NULL )
+	{
+		result_string_pointer = "Invalid single cube map image data";
+		return 0;
+	}
+	/*	face order checking	*/
+	for( i = 0; i < 6; ++i )
+	{
+		if( (face_order[i] != 'N') &&
+			(face_order[i] != 'S') &&
+			(face_order[i] != 'W') &&
+			(face_order[i] != 'E') &&
+			(face_order[i] != 'U') &&
+			(face_order[i] != 'D') )
+		{
+			result_string_pointer = "Invalid single cube map face order";
+			return 0;
+		};
+	}
+	/*	capability checking	*/
+	if( query_cubemap_capability() != SOIL_CAPABILITY_PRESENT )
+	{
+		result_string_pointer = "No cube map capability present";
+		return 0;
+	}
+	/*	now, does this image have the right dimensions?	*/
+	if( (width != 6*height) &&
+		(6*width != height) )
+	{
+		result_string_pointer = "Single cubemap image must have a 6:1 ratio";
+		return 0;
+	}
+	/*	which way am I stepping?	*/
+	if( width > height )
+	{
+		dw = height;
+		dh = 0;
+	} else
+	{
+		dw = 0;
+		dh = width;
+	}
+	sz = dw+dh;
+	sub_img = (unsigned char *)malloc( sz*sz*channels );
+	/*	do the splitting and uploading	*/
+	tex_id = reuse_texture_ID;
+	for( i = 0; i < 6; ++i )
+	{
+		int x, y, idx = 0;
+		unsigned int cubemap_target = 0;
+		/*	copy in the sub-image	*/
+		for( y = i*dh; y < i*dh+sz; ++y )
+		{
+			for( x = i*dw*channels; x < (i*dw+sz)*channels; ++x )
+			{
+				sub_img[idx++] = data[y*width*channels+x];
+			}
+		}
+		/*	what is my texture target?
+			remember, this coordinate system is
+			LHS if viewed from inside the cube!	*/
+		switch( face_order[i] )
+		{
+		case 'N':
+			cubemap_target = SOIL_TEXTURE_CUBE_MAP_POSITIVE_Z;
+			break;
+		case 'S':
+			cubemap_target = SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Z;
+			break;
+		case 'W':
+			cubemap_target = SOIL_TEXTURE_CUBE_MAP_NEGATIVE_X;
+			break;
+		case 'E':
+			cubemap_target = SOIL_TEXTURE_CUBE_MAP_POSITIVE_X;
+			break;
+		case 'U':
+			cubemap_target = SOIL_TEXTURE_CUBE_MAP_POSITIVE_Y;
+			break;
+		case 'D':
+			cubemap_target = SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Y;
+			break;
+		}
+		/*	upload it as a texture	*/
+		tex_id = SOIL_internal_create_OGL_texture(
+				sub_img, sz, sz, channels,
+				tex_id, flags,
+				SOIL_TEXTURE_CUBE_MAP,
+				cubemap_target,
+				SOIL_MAX_CUBE_MAP_TEXTURE_SIZE );
+	}
+	/*	and nuke the image and sub-image data	*/
+	SOIL_free_image_data( sub_img );
+	/*	and return the handle, such as it is	*/
+	return tex_id;
+}
+
+unsigned int
+	SOIL_create_OGL_texture
+	(
+		const unsigned char *const data,
+		int width, int height, int channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	)
+{
+	/*	wrapper function for 2D textures	*/
+	return SOIL_internal_create_OGL_texture(
+				data, width, height, channels,
+				reuse_texture_ID, flags,
+				GL_TEXTURE_2D, GL_TEXTURE_2D,
+				GL_MAX_TEXTURE_SIZE );
+}
+
+#if SOIL_CHECK_FOR_GL_ERRORS
+void check_for_GL_errors( const char *calling_location )
+{
+	/*	check for errors	*/
+	GLenum err_code = glGetError();
+	while( GL_NO_ERROR != err_code )
+	{
+		printf( "OpenGL Error @ %s: %i", calling_location, err_code );
+		err_code = glGetError();
+	}
+}
+#else
+void check_for_GL_errors( const char *calling_location )
+{
+	/*	no check for errors	*/
+}
+#endif
+
+unsigned int
+	SOIL_internal_create_OGL_texture
+	(
+		const unsigned char *const data,
+		int width, int height, int channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags,
+		unsigned int opengl_texture_type,
+		unsigned int opengl_texture_target,
+		unsigned int texture_check_size_enum
+	)
+{
+	/*	variables	*/
+	unsigned char* img;
+	unsigned int tex_id;
+	unsigned int internal_texture_format = 0, original_texture_format = 0;
+	int DXT_mode = SOIL_CAPABILITY_UNKNOWN;
+	int max_supported_size;
+	/*	If the user wants to use the texture rectangle I kill a few flags	*/
+	if( flags & SOIL_FLAG_TEXTURE_RECTANGLE )
+	{
+		/*	well, the user asked for it, can we do that?	*/
+		if( query_tex_rectangle_capability() == SOIL_CAPABILITY_PRESENT )
+		{
+			/*	only allow this if the user in _NOT_ trying to do a cubemap!	*/
+			if( opengl_texture_type == GL_TEXTURE_2D )
+			{
+				/*	clean out the flags that cannot be used with texture rectangles	*/
+				flags &= ~(
+						SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS |
+						SOIL_FLAG_TEXTURE_REPEATS
+					);
+				/*	and change my target	*/
+				opengl_texture_target = SOIL_TEXTURE_RECTANGLE_ARB;
+				opengl_texture_type = SOIL_TEXTURE_RECTANGLE_ARB;
+			} else
+			{
+				/*	not allowed for any other uses (yes, I'm looking at you, cubemaps!)	*/
+				flags &= ~SOIL_FLAG_TEXTURE_RECTANGLE;
+			}
+
+		} else
+		{
+			/*	can't do it, and that is a breakable offense (uv coords use pixels instead of [0,1]!)	*/
+			result_string_pointer = "Texture Rectangle extension unsupported";
+			return 0;
+		}
+	}
+	/*	create a copy the image data	*/
+	img = (unsigned char*)malloc( width*height*channels );
+	memcpy( img, data, width*height*channels );
+	/*	does the user want me to invert the image?	*/
+	if( flags & SOIL_FLAG_INVERT_Y )
+	{
+		int i, j;
+		for( j = 0; j*2 < height; ++j )
+		{
+			int index1 = j * width * channels;
+			int index2 = (height - 1 - j) * width * channels;
+			for( i = width * channels; i > 0; --i )
+			{
+				unsigned char temp = img[index1];
+				img[index1] = img[index2];
+				img[index2] = temp;
+				++index1;
+				++index2;
+			}
+		}
+	}
+	/*	does the user want me to scale the colors into the NTSC safe RGB range?	*/
+	if( flags & SOIL_FLAG_NTSC_SAFE_RGB )
+	{
+		scale_image_RGB_to_NTSC_safe( img, width, height, channels );
+	}
+	/*	does the user want me to convert from straight to pre-multiplied alpha?
+		(and do we even _have_ alpha?)	*/
+	if( flags & SOIL_FLAG_MULTIPLY_ALPHA )
+	{
+		int i;
+		switch( channels )
+		{
+		case 2:
+			for( i = 0; i < 2*width*height; i += 2 )
+			{
+				img[i] = (img[i] * img[i+1] + 128) >> 8;
+			}
+			break;
+		case 4:
+			for( i = 0; i < 4*width*height; i += 4 )
+			{
+				img[i+0] = (img[i+0] * img[i+3] + 128) >> 8;
+				img[i+1] = (img[i+1] * img[i+3] + 128) >> 8;
+				img[i+2] = (img[i+2] * img[i+3] + 128) >> 8;
+			}
+			break;
+		default:
+			/*	no other number of channels contains alpha data	*/
+			break;
+		}
+	}
+	/*	if the user can't support NPOT textures, make sure we force the POT option	*/
+	if( (query_NPOT_capability() == SOIL_CAPABILITY_NONE) &&
+		!(flags & SOIL_FLAG_TEXTURE_RECTANGLE) )
+	{
+		/*	add in the POT flag */
+		flags |= SOIL_FLAG_POWER_OF_TWO;
+	}
+	/*	how large of a texture can this OpenGL implementation handle?	*/
+	/*	texture_check_size_enum will be GL_MAX_TEXTURE_SIZE or SOIL_MAX_CUBE_MAP_TEXTURE_SIZE	*/
+	glGetIntegerv( texture_check_size_enum, &max_supported_size );
+	/*	do I need to make it a power of 2?	*/
+	if(
+		(flags & SOIL_FLAG_POWER_OF_TWO) ||	/*	user asked for it	*/
+		(flags & SOIL_FLAG_MIPMAPS) ||		/*	need it for the MIP-maps	*/
+		(width > max_supported_size) ||		/*	it's too big, (make sure it's	*/
+		(height > max_supported_size) )		/*	2^n for later down-sampling)	*/
+	{
+		int new_width = 1;
+		int new_height = 1;
+		while( new_width < width )
+		{
+			new_width *= 2;
+		}
+		while( new_height < height )
+		{
+			new_height *= 2;
+		}
+		/*	still?	*/
+		if( (new_width != width) || (new_height != height) )
+		{
+			/*	yep, resize	*/
+			unsigned char *resampled = (unsigned char*)malloc( channels*new_width*new_height );
+			up_scale_image(
+					img, width, height, channels,
+					resampled, new_width, new_height );
+			/*	OJO	this is for debug only!	*/
+			/*
+			SOIL_save_image( "\\showme.bmp", SOIL_SAVE_TYPE_BMP,
+							new_width, new_height, channels,
+							resampled );
+			*/
+			/*	nuke the old guy, then point it at the new guy	*/
+			SOIL_free_image_data( img );
+			img = resampled;
+			width = new_width;
+			height = new_height;
+		}
+	}
+	/*	now, if it is too large...	*/
+	if( (width > max_supported_size) || (height > max_supported_size) )
+	{
+		/*	I've already made it a power of two, so simply use the MIPmapping
+			code to reduce its size to the allowable maximum.	*/
+		unsigned char *resampled;
+		int reduce_block_x = 1, reduce_block_y = 1;
+		int new_width, new_height;
+		if( width > max_supported_size )
+		{
+			reduce_block_x = width / max_supported_size;
+		}
+		if( height > max_supported_size )
+		{
+			reduce_block_y = height / max_supported_size;
+		}
+		new_width = width / reduce_block_x;
+		new_height = height / reduce_block_y;
+		resampled = (unsigned char*)malloc( channels*new_width*new_height );
+		/*	perform the actual reduction	*/
+		mipmap_image(	img, width, height, channels,
+						resampled, reduce_block_x, reduce_block_y );
+		/*	nuke the old guy, then point it at the new guy	*/
+		SOIL_free_image_data( img );
+		img = resampled;
+		width = new_width;
+		height = new_height;
+	}
+	/*	does the user want us to use YCoCg color space?	*/
+	if( flags & SOIL_FLAG_CoCg_Y )
+	{
+		/*	this will only work with RGB and RGBA images */
+		convert_RGB_to_YCoCg( img, width, height, channels );
+		/*
+		save_image_as_DDS( "CoCg_Y.dds", width, height, channels, img );
+		*/
+	}
+	/*	create the OpenGL texture ID handle
+    	(note: allowing a forced texture ID lets me reload a texture)	*/
+    tex_id = reuse_texture_ID;
+    if( tex_id == 0 )
+    {
+		glGenTextures( 1, &tex_id );
+    }
+	check_for_GL_errors( "glGenTextures" );
+	/* Note: sometimes glGenTextures fails (usually no OpenGL context)	*/
+	if( tex_id )
+	{
+		/*	and what type am I using as the internal texture format?	*/
+		switch( channels )
+		{
+		case 1:
+			original_texture_format = GL_LUMINANCE;
+			break;
+		case 2:
+			original_texture_format = GL_LUMINANCE_ALPHA;
+			break;
+		case 3:
+			original_texture_format = GL_RGB;
+			break;
+		case 4:
+			original_texture_format = GL_RGBA;
+			break;
+		}
+		internal_texture_format = original_texture_format;
+		/*	does the user want me to, and can I, save as DXT?	*/
+		if( flags & SOIL_FLAG_COMPRESS_TO_DXT )
+		{
+			DXT_mode = query_DXT_capability();
+			if( DXT_mode == SOIL_CAPABILITY_PRESENT )
+			{
+				/*	I can use DXT, whether I compress it or OpenGL does	*/
+				if( (channels & 1) == 1 )
+				{
+					/*	1 or 3 channels = DXT1	*/
+					internal_texture_format = SOIL_RGB_S3TC_DXT1;
+				} else
+				{
+					/*	2 or 4 channels = DXT5	*/
+					internal_texture_format = SOIL_RGBA_S3TC_DXT5;
+				}
+			}
+		}
+		/*  bind an OpenGL texture ID	*/
+		glBindTexture( opengl_texture_type, tex_id );
+		check_for_GL_errors( "glBindTexture" );
+		/*  upload the main image	*/
+		if( DXT_mode == SOIL_CAPABILITY_PRESENT )
+		{
+			/*	user wants me to do the DXT conversion!	*/
+			int DDS_size;
+			unsigned char *DDS_data = NULL;
+			if( (channels & 1) == 1 )
+			{
+				/*	RGB, use DXT1	*/
+				DDS_data = convert_image_to_DXT1( img, width, height, channels, &DDS_size );
+			} else
+			{
+				/*	RGBA, use DXT5	*/
+				DDS_data = convert_image_to_DXT5( img, width, height, channels, &DDS_size );
+			}
+			if( DDS_data )
+			{
+				soilGlCompressedTexImage2D(
+					opengl_texture_target, 0,
+					internal_texture_format, width, height, 0,
+					DDS_size, DDS_data );
+				check_for_GL_errors( "glCompressedTexImage2D" );
+				SOIL_free_image_data( DDS_data );
+				/*	printf( "Internal DXT compressor\n" );	*/
+			} else
+			{
+				/*	my compression failed, try the OpenGL driver's version	*/
+				glTexImage2D(
+					opengl_texture_target, 0,
+					internal_texture_format, width, height, 0,
+					original_texture_format, GL_UNSIGNED_BYTE, img );
+				check_for_GL_errors( "glTexImage2D" );
+				/*	printf( "OpenGL DXT compressor\n" );	*/
+			}
+		} else
+		{
+			/*	user want OpenGL to do all the work!	*/
+			glTexImage2D(
+				opengl_texture_target, 0,
+				internal_texture_format, width, height, 0,
+				original_texture_format, GL_UNSIGNED_BYTE, img );
+			check_for_GL_errors( "glTexImage2D" );
+			/*printf( "OpenGL DXT compressor\n" );	*/
+		}
+		/*	are any MIPmaps desired?	*/
+		if( flags & SOIL_FLAG_MIPMAPS )
+		{
+			int MIPlevel = 1;
+			int MIPwidth = (width+1) / 2;
+			int MIPheight = (height+1) / 2;
+			unsigned char *resampled = (unsigned char*)malloc( channels*MIPwidth*MIPheight );
+			while( ((1<<MIPlevel) <= width) || ((1<<MIPlevel) <= height) )
+			{
+				/*	do this MIPmap level	*/
+				mipmap_image(
+						img, width, height, channels,
+						resampled,
+						(1 << MIPlevel), (1 << MIPlevel) );
+				/*  upload the MIPmaps	*/
+				if( DXT_mode == SOIL_CAPABILITY_PRESENT )
+				{
+					/*	user wants me to do the DXT conversion!	*/
+					int DDS_size;
+					unsigned char *DDS_data = NULL;
+					if( (channels & 1) == 1 )
+					{
+						/*	RGB, use DXT1	*/
+						DDS_data = convert_image_to_DXT1(
+								resampled, MIPwidth, MIPheight, channels, &DDS_size );
+					} else
+					{
+						/*	RGBA, use DXT5	*/
+						DDS_data = convert_image_to_DXT5(
+								resampled, MIPwidth, MIPheight, channels, &DDS_size );
+					}
+					if( DDS_data )
+					{
+						soilGlCompressedTexImage2D(
+							opengl_texture_target, MIPlevel,
+							internal_texture_format, MIPwidth, MIPheight, 0,
+							DDS_size, DDS_data );
+						check_for_GL_errors( "glCompressedTexImage2D" );
+						SOIL_free_image_data( DDS_data );
+					} else
+					{
+						/*	my compression failed, try the OpenGL driver's version	*/
+						glTexImage2D(
+							opengl_texture_target, MIPlevel,
+							internal_texture_format, MIPwidth, MIPheight, 0,
+							original_texture_format, GL_UNSIGNED_BYTE, resampled );
+						check_for_GL_errors( "glTexImage2D" );
+					}
+				} else
+				{
+					/*	user want OpenGL to do all the work!	*/
+					glTexImage2D(
+						opengl_texture_target, MIPlevel,
+						internal_texture_format, MIPwidth, MIPheight, 0,
+						original_texture_format, GL_UNSIGNED_BYTE, resampled );
+					check_for_GL_errors( "glTexImage2D" );
+				}
+				/*	prep for the next level	*/
+				++MIPlevel;
+				MIPwidth = (MIPwidth + 1) / 2;
+				MIPheight = (MIPheight + 1) / 2;
+			}
+			SOIL_free_image_data( resampled );
+			/*	instruct OpenGL to use the MIPmaps	*/
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR );
+			check_for_GL_errors( "GL_TEXTURE_MIN/MAG_FILTER" );
+		} else
+		{
+			/*	instruct OpenGL _NOT_ to use the MIPmaps	*/
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
+			check_for_GL_errors( "GL_TEXTURE_MIN/MAG_FILTER" );
+		}
+		/*	does the user want clamping, or wrapping?	*/
+		if( flags & SOIL_FLAG_TEXTURE_REPEATS )
+		{
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_S, GL_REPEAT );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_T, GL_REPEAT );
+			if( opengl_texture_type == SOIL_TEXTURE_CUBE_MAP )
+			{
+				/*	SOIL_TEXTURE_WRAP_R is invalid if cubemaps aren't supported	*/
+				glTexParameteri( opengl_texture_type, SOIL_TEXTURE_WRAP_R, GL_REPEAT );
+			}
+			check_for_GL_errors( "GL_TEXTURE_WRAP_*" );
+		} else
+		{
+			/*	unsigned int clamp_mode = SOIL_CLAMP_TO_EDGE;	*/
+			unsigned int clamp_mode = GL_CLAMP;
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_S, clamp_mode );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_T, clamp_mode );
+			if( opengl_texture_type == SOIL_TEXTURE_CUBE_MAP )
+			{
+				/*	SOIL_TEXTURE_WRAP_R is invalid if cubemaps aren't supported	*/
+				glTexParameteri( opengl_texture_type, SOIL_TEXTURE_WRAP_R, clamp_mode );
+			}
+			check_for_GL_errors( "GL_TEXTURE_WRAP_*" );
+		}
+		/*	done	*/
+		result_string_pointer = "Image loaded as an OpenGL texture";
+	} else
+	{
+		/*	failed	*/
+		result_string_pointer = "Failed to generate an OpenGL texture name; missing OpenGL context?";
+	}
+	SOIL_free_image_data( img );
+	return tex_id;
+}
+
+int
+	SOIL_save_screenshot
+	(
+		const char *filename,
+		int image_type,
+		int x, int y,
+		int width, int height
+	)
+{
+	unsigned char *pixel_data;
+	int i, j;
+	int save_result;
+
+	/*	error checks	*/
+	if( (width < 1) || (height < 1) )
+	{
+		result_string_pointer = "Invalid screenshot dimensions";
+		return 0;
+	}
+	if( (x < 0) || (y < 0) )
+	{
+		result_string_pointer = "Invalid screenshot location";
+		return 0;
+	}
+	if( filename == NULL )
+	{
+		result_string_pointer = "Invalid screenshot filename";
+		return 0;
+	}
+
+    /*  Get the data from OpenGL	*/
+    pixel_data = (unsigned char*)malloc( 3*width*height );
+    glReadPixels (x, y, width, height, GL_RGB, GL_UNSIGNED_BYTE, pixel_data);
+
+    /*	invert the image	*/
+    for( j = 0; j*2 < height; ++j )
+	{
+		int index1 = j * width * 3;
+		int index2 = (height - 1 - j) * width * 3;
+		for( i = width * 3; i > 0; --i )
+		{
+			unsigned char temp = pixel_data[index1];
+			pixel_data[index1] = pixel_data[index2];
+			pixel_data[index2] = temp;
+			++index1;
+			++index2;
+		}
+	}
+
+    /*	save the image	*/
+    save_result = SOIL_save_image( filename, image_type, width, height, 3, pixel_data);
+
+    /*  And free the memory	*/
+    SOIL_free_image_data( pixel_data );
+	return save_result;
+}
+
+unsigned char*
+	SOIL_load_image
+	(
+		const char *filename,
+		int *width, int *height, int *channels,
+		int force_channels
+	)
+{
+	unsigned char *result = stbi_load( filename,
+			width, height, channels, force_channels );
+	if( result == NULL )
+	{
+		result_string_pointer = stbi_failure_reason();
+	} else
+	{
+		result_string_pointer = "Image loaded";
+	}
+	return result;
+}
+
+unsigned char*
+	SOIL_load_image_from_memory
+	(
+		const unsigned char *const buffer,
+		int buffer_length,
+		int *width, int *height, int *channels,
+		int force_channels
+	)
+{
+	unsigned char *result = stbi_load_from_memory(
+				buffer, buffer_length,
+				width, height, channels,
+				force_channels );
+	if( result == NULL )
+	{
+		result_string_pointer = stbi_failure_reason();
+	} else
+	{
+		result_string_pointer = "Image loaded from memory";
+	}
+	return result;
+}
+
+int
+	SOIL_save_image
+	(
+		const char *filename,
+		int image_type,
+		int width, int height, int channels,
+		const unsigned char *const data
+	)
+{
+	int save_result;
+
+	/*	error check	*/
+	if( (width < 1) || (height < 1) ||
+		(channels < 1) || (channels > 4) ||
+		(data == NULL) ||
+		(filename == NULL) )
+	{
+		return 0;
+	}
+	if( image_type == SOIL_SAVE_TYPE_BMP )
+	{
+		save_result = stbi_write_bmp( filename,
+				width, height, channels, (void*)data );
+	} else
+	if( image_type == SOIL_SAVE_TYPE_TGA )
+	{
+		save_result = stbi_write_tga( filename,
+				width, height, channels, (void*)data );
+	} else
+	if( image_type == SOIL_SAVE_TYPE_DDS )
+	{
+		save_result = save_image_as_DDS( filename,
+				width, height, channels, (const unsigned char *const)data );
+	} else
+	{
+		save_result = 0;
+	}
+	if( save_result == 0 )
+	{
+		result_string_pointer = "Saving the image failed";
+	} else
+	{
+		result_string_pointer = "Image saved";
+	}
+	return save_result;
+}
+
+void
+	SOIL_free_image_data
+	(
+		unsigned char *img_data
+	)
+{
+	free( (void*)img_data );
+}
+
+const char*
+	SOIL_last_result
+	(
+		void
+	)
+{
+	return result_string_pointer;
+}
+
+unsigned int SOIL_direct_load_DDS_from_memory(
+		const unsigned char *const buffer,
+		int buffer_length,
+		unsigned int reuse_texture_ID,
+		int flags,
+		int loading_as_cubemap )
+{
+	/*	variables	*/
+	DDS_header header;
+	unsigned int buffer_index = 0;
+	unsigned int tex_ID = 0;
+	/*	file reading variables	*/
+	unsigned int S3TC_type = 0;
+	unsigned char *DDS_data;
+	unsigned int DDS_main_size;
+	unsigned int DDS_full_size;
+	unsigned int width, height;
+	int mipmaps, cubemap, uncompressed, block_size = 16;
+	unsigned int flag;
+	unsigned int cf_target, ogl_target_start, ogl_target_end;
+	unsigned int opengl_texture_type;
+	int i;
+	/*	1st off, does the filename even exist?	*/
+	if( NULL == buffer )
+	{
+		/*	we can't do it!	*/
+		result_string_pointer = "NULL buffer";
+		return 0;
+	}
+	if( buffer_length < sizeof( DDS_header ) )
+	{
+		/*	we can't do it!	*/
+		result_string_pointer = "DDS file was too small to contain the DDS header";
+		return 0;
+	}
+	/*	try reading in the header	*/
+	memcpy ( (void*)(&header), (const void *)buffer, sizeof( DDS_header ) );
+	buffer_index = sizeof( DDS_header );
+	/*	guilty until proven innocent	*/
+	result_string_pointer = "Failed to read a known DDS header";
+	/*	validate the header (warning, "goto"'s ahead, shield your eyes!!)	*/
+	flag = ('D'<<0)|('D'<<8)|('S'<<16)|(' '<<24);
+	if( header.dwMagic != flag ) {goto quick_exit;}
+	if( header.dwSize != 124 ) {goto quick_exit;}
+	/*	I need all of these	*/
+	flag = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH | DDSD_PIXELFORMAT;
+	if( (header.dwFlags & flag) != flag ) {goto quick_exit;}
+	/*	According to the MSDN spec, the dwFlags should contain
+		DDSD_LINEARSIZE if it's compressed, or DDSD_PITCH if
+		uncompressed.  Some DDS writers do not conform to the
+		spec, so I need to make my reader more tolerant	*/
+	/*	I need one of these	*/
+	flag = DDPF_FOURCC | DDPF_RGB;
+	if( (header.sPixelFormat.dwFlags & flag) == 0 ) {goto quick_exit;}
+	if( header.sPixelFormat.dwSize != 32 ) {goto quick_exit;}
+	if( (header.sCaps.dwCaps1 & DDSCAPS_TEXTURE) == 0 ) {goto quick_exit;}
+	/*	make sure it is a type we can upload	*/
+	if( (header.sPixelFormat.dwFlags & DDPF_FOURCC) &&
+		!(
+		(header.sPixelFormat.dwFourCC == (('D'<<0)|('X'<<8)|('T'<<16)|('1'<<24))) ||
+		(header.sPixelFormat.dwFourCC == (('D'<<0)|('X'<<8)|('T'<<16)|('3'<<24))) ||
+		(header.sPixelFormat.dwFourCC == (('D'<<0)|('X'<<8)|('T'<<16)|('5'<<24)))
+		) )
+	{
+		goto quick_exit;
+	}
+	/*	OK, validated the header, let's load the image data	*/
+	result_string_pointer = "DDS header loaded and validated";
+	width = header.dwWidth;
+	height = header.dwHeight;
+	uncompressed = 1 - (header.sPixelFormat.dwFlags & DDPF_FOURCC) / DDPF_FOURCC;
+	cubemap = (header.sCaps.dwCaps2 & DDSCAPS2_CUBEMAP) / DDSCAPS2_CUBEMAP;
+	if( uncompressed )
+	{
+		S3TC_type = GL_RGB;
+		block_size = 3;
+		if( header.sPixelFormat.dwFlags & DDPF_ALPHAPIXELS )
+		{
+			S3TC_type = GL_RGBA;
+			block_size = 4;
+		}
+		DDS_main_size = width * height * block_size;
+	} else
+	{
+		/*	can we even handle direct uploading to OpenGL DXT compressed images?	*/
+		if( query_DXT_capability() != SOIL_CAPABILITY_PRESENT )
+		{
+			/*	we can't do it!	*/
+			result_string_pointer = "Direct upload of S3TC images not supported by the OpenGL driver";
+			return 0;
+		}
+		/*	well, we know it is DXT1/3/5, because we checked above	*/
+		switch( (header.sPixelFormat.dwFourCC >> 24) - '0' )
+		{
+		case 1:
+			S3TC_type = SOIL_RGBA_S3TC_DXT1;
+			block_size = 8;
+			break;
+		case 3:
+			S3TC_type = SOIL_RGBA_S3TC_DXT3;
+			block_size = 16;
+			break;
+		case 5:
+			S3TC_type = SOIL_RGBA_S3TC_DXT5;
+			block_size = 16;
+			break;
+		}
+		DDS_main_size = ((width+3)>>2)*((height+3)>>2)*block_size;
+	}
+	if( cubemap )
+	{
+		/* does the user want a cubemap?	*/
+		if( !loading_as_cubemap )
+		{
+			/*	we can't do it!	*/
+			result_string_pointer = "DDS image was a cubemap";
+			return 0;
+		}
+		/*	can we even handle cubemaps with the OpenGL driver?	*/
+		if( query_cubemap_capability() != SOIL_CAPABILITY_PRESENT )
+		{
+			/*	we can't do it!	*/
+			result_string_pointer = "Direct upload of cubemap images not supported by the OpenGL driver";
+			return 0;
+		}
+		ogl_target_start = SOIL_TEXTURE_CUBE_MAP_POSITIVE_X;
+		ogl_target_end =   SOIL_TEXTURE_CUBE_MAP_NEGATIVE_Z;
+		opengl_texture_type = SOIL_TEXTURE_CUBE_MAP;
+	} else
+	{
+		/* does the user want a non-cubemap?	*/
+		if( loading_as_cubemap )
+		{
+			/*	we can't do it!	*/
+			result_string_pointer = "DDS image was not a cubemap";
+			return 0;
+		}
+		ogl_target_start = GL_TEXTURE_2D;
+		ogl_target_end =   GL_TEXTURE_2D;
+		opengl_texture_type = GL_TEXTURE_2D;
+	}
+	if( (header.sCaps.dwCaps1 & DDSCAPS_MIPMAP) && (header.dwMipMapCount > 1) )
+	{
+		int shift_offset;
+		mipmaps = header.dwMipMapCount - 1;
+		DDS_full_size = DDS_main_size;
+		if( uncompressed )
+		{
+			/*	uncompressed DDS, simple MIPmap size calculation	*/
+			shift_offset = 0;
+		} else
+		{
+			/*	compressed DDS, MIPmap size calculation is block based	*/
+			shift_offset = 2;
+		}
+		for( i = 1; i <= mipmaps; ++ i )
+		{
+			int w, h;
+			w = width >> (shift_offset + i);
+			h = height >> (shift_offset + i);
+			if( w < 1 )
+			{
+				w = 1;
+			}
+			if( h < 1 )
+			{
+				h = 1;
+			}
+			DDS_full_size += w*h*block_size;
+		}
+	} else
+	{
+		mipmaps = 0;
+		DDS_full_size = DDS_main_size;
+	}
+	DDS_data = (unsigned char*)malloc( DDS_full_size );
+	/*	got the image data RAM, create or use an existing OpenGL texture handle	*/
+	tex_ID = reuse_texture_ID;
+	if( tex_ID == 0 )
+	{
+		glGenTextures( 1, &tex_ID );
+	}
+	/*  bind an OpenGL texture ID	*/
+	glBindTexture( opengl_texture_type, tex_ID );
+	/*	do this for each face of the cubemap!	*/
+	for( cf_target = ogl_target_start; cf_target <= ogl_target_end; ++cf_target )
+	{
+		if( buffer_index + DDS_full_size <= buffer_length )
+		{
+			unsigned int byte_offset = DDS_main_size;
+			memcpy( (void*)DDS_data, (const void*)(&buffer[buffer_index]), DDS_full_size );
+			buffer_index += DDS_full_size;
+			/*	upload the main chunk	*/
+			if( uncompressed )
+			{
+				/*	and remember, DXT uncompressed uses BGR(A),
+					so swap to RGB(A) for ALL MIPmap levels	*/
+				for( i = 0; i < DDS_full_size; i += block_size )
+				{
+					unsigned char temp = DDS_data[i];
+					DDS_data[i] = DDS_data[i+2];
+					DDS_data[i+2] = temp;
+				}
+				glTexImage2D(
+					cf_target, 0,
+					S3TC_type, width, height, 0,
+					S3TC_type, GL_UNSIGNED_BYTE, DDS_data );
+			} else
+			{
+				soilGlCompressedTexImage2D(
+					cf_target, 0,
+					S3TC_type, width, height, 0,
+					DDS_main_size, DDS_data );
+			}
+			/*	upload the mipmaps, if we have them	*/
+			for( i = 1; i <= mipmaps; ++i )
+			{
+				int w, h, mip_size;
+				w = width >> i;
+				h = height >> i;
+				if( w < 1 )
+				{
+					w = 1;
+				}
+				if( h < 1 )
+				{
+					h = 1;
+				}
+				/*	upload this mipmap	*/
+				if( uncompressed )
+				{
+					mip_size = w*h*block_size;
+					glTexImage2D(
+						cf_target, i,
+						S3TC_type, w, h, 0,
+						S3TC_type, GL_UNSIGNED_BYTE, &DDS_data[byte_offset] );
+				} else
+				{
+					mip_size = ((w+3)/4)*((h+3)/4)*block_size;
+					soilGlCompressedTexImage2D(
+						cf_target, i,
+						S3TC_type, w, h, 0,
+						mip_size, &DDS_data[byte_offset] );
+				}
+				/*	and move to the next mipmap	*/
+				byte_offset += mip_size;
+			}
+			/*	it worked!	*/
+			result_string_pointer = "DDS file loaded";
+		} else
+		{
+			glDeleteTextures( 1, & tex_ID );
+			tex_ID = 0;
+			cf_target = ogl_target_end + 1;
+			result_string_pointer = "DDS file was too small for expected image data";
+		}
+	}/* end reading each face */
+	SOIL_free_image_data( DDS_data );
+	if( tex_ID )
+	{
+		/*	did I have MIPmaps?	*/
+		if( mipmaps > 0 )
+		{
+			/*	instruct OpenGL to use the MIPmaps	*/
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR );
+		} else
+		{
+			/*	instruct OpenGL _NOT_ to use the MIPmaps	*/
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MAG_FILTER, GL_LINEAR );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_MIN_FILTER, GL_LINEAR );
+		}
+		/*	does the user want clamping, or wrapping?	*/
+		if( flags & SOIL_FLAG_TEXTURE_REPEATS )
+		{
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_S, GL_REPEAT );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_T, GL_REPEAT );
+			glTexParameteri( opengl_texture_type, SOIL_TEXTURE_WRAP_R, GL_REPEAT );
+		} else
+		{
+			/*	unsigned int clamp_mode = SOIL_CLAMP_TO_EDGE;	*/
+			unsigned int clamp_mode = GL_CLAMP;
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_S, clamp_mode );
+			glTexParameteri( opengl_texture_type, GL_TEXTURE_WRAP_T, clamp_mode );
+			glTexParameteri( opengl_texture_type, SOIL_TEXTURE_WRAP_R, clamp_mode );
+		}
+	}
+
+quick_exit:
+	/*	report success or failure	*/
+	return tex_ID;
+}
+
+unsigned int SOIL_direct_load_DDS(
+		const char *filename,
+		unsigned int reuse_texture_ID,
+		int flags,
+		int loading_as_cubemap )
+{
+	FILE *f;
+	unsigned char *buffer;
+	size_t buffer_length, bytes_read;
+	unsigned int tex_ID = 0;
+	/*	error checks	*/
+	if( NULL == filename )
+	{
+		result_string_pointer = "NULL filename";
+		return 0;
+	}
+	f = fopen( filename, "rb" );
+	if( NULL == f )
+	{
+		/*	the file doesn't seem to exist (or be open-able)	*/
+		result_string_pointer = "Can not find DDS file";
+		return 0;
+	}
+	fseek( f, 0, SEEK_END );
+	buffer_length = ftell( f );
+	fseek( f, 0, SEEK_SET );
+	buffer = (unsigned char *) malloc( buffer_length );
+	if( NULL == buffer )
+	{
+		result_string_pointer = "malloc failed";
+		fclose( f );
+		return 0;
+	}
+	bytes_read = fread( (void*)buffer, 1, buffer_length, f );
+	fclose( f );
+	if( bytes_read < buffer_length )
+	{
+		/*	huh?	*/
+		buffer_length = bytes_read;
+	}
+	/*	now try to do the loading	*/
+	tex_ID = SOIL_direct_load_DDS_from_memory(
+		(const unsigned char *const)buffer, buffer_length,
+		reuse_texture_ID, flags, loading_as_cubemap );
+	SOIL_free_image_data( buffer );
+	return tex_ID;
+}
+
+int query_NPOT_capability( void )
+{
+	/*	check for the capability	*/
+	if( has_NPOT_capability == SOIL_CAPABILITY_UNKNOWN )
+	{
+		/*	we haven't yet checked for the capability, do so	*/
+		if(
+			(NULL == strstr( (char const*)glGetString( GL_EXTENSIONS ),
+				"GL_ARB_texture_non_power_of_two" ) )
+			)
+		{
+			/*	not there, flag the failure	*/
+			has_NPOT_capability = SOIL_CAPABILITY_NONE;
+		} else
+		{
+			/*	it's there!	*/
+			has_NPOT_capability = SOIL_CAPABILITY_PRESENT;
+		}
+	}
+	/*	let the user know if we can do non-power-of-two textures or not	*/
+	return has_NPOT_capability;
+}
+
+int query_tex_rectangle_capability( void )
+{
+	/*	check for the capability	*/
+	if( has_tex_rectangle_capability == SOIL_CAPABILITY_UNKNOWN )
+	{
+		/*	we haven't yet checked for the capability, do so	*/
+		if(
+			(NULL == strstr( (char const*)glGetString( GL_EXTENSIONS ),
+				"GL_ARB_texture_rectangle" ) )
+		&&
+			(NULL == strstr( (char const*)glGetString( GL_EXTENSIONS ),
+				"GL_EXT_texture_rectangle" ) )
+		&&
+			(NULL == strstr( (char const*)glGetString( GL_EXTENSIONS ),
+				"GL_NV_texture_rectangle" ) )
+			)
+		{
+			/*	not there, flag the failure	*/
+			has_tex_rectangle_capability = SOIL_CAPABILITY_NONE;
+		} else
+		{
+			/*	it's there!	*/
+			has_tex_rectangle_capability = SOIL_CAPABILITY_PRESENT;
+		}
+	}
+	/*	let the user know if we can do texture rectangles or not	*/
+	return has_tex_rectangle_capability;
+}
+
+int query_cubemap_capability( void )
+{
+	/*	check for the capability	*/
+	if( has_cubemap_capability == SOIL_CAPABILITY_UNKNOWN )
+	{
+		/*	we haven't yet checked for the capability, do so	*/
+		if(
+			(NULL == strstr( (char const*)glGetString( GL_EXTENSIONS ),
+				"GL_ARB_texture_cube_map" ) )
+		&&
+			(NULL == strstr( (char const*)glGetString( GL_EXTENSIONS ),
+				"GL_EXT_texture_cube_map" ) )
+			)
+		{
+			/*	not there, flag the failure	*/
+			has_cubemap_capability = SOIL_CAPABILITY_NONE;
+		} else
+		{
+			/*	it's there!	*/
+			has_cubemap_capability = SOIL_CAPABILITY_PRESENT;
+		}
+	}
+	/*	let the user know if we can do cubemaps or not	*/
+	return has_cubemap_capability;
+}
+
+int query_DXT_capability( void )
+{
+	/*	check for the capability	*/
+	if( has_DXT_capability == SOIL_CAPABILITY_UNKNOWN )
+	{
+		/*	we haven't yet checked for the capability, do so	*/
+		if( NULL == strstr(
+				(char const*)glGetString( GL_EXTENSIONS ),
+				"GL_EXT_texture_compression_s3tc" ) )
+		{
+			/*	not there, flag the failure	*/
+			has_DXT_capability = SOIL_CAPABILITY_NONE;
+		} else
+		{
+			/*	and find the address of the extension function	*/
+			P_SOIL_GLCOMPRESSEDTEXIMAGE2DPROC ext_addr = NULL;
+			#ifdef WIN32
+				ext_addr = (P_SOIL_GLCOMPRESSEDTEXIMAGE2DPROC)
+						wglGetProcAddress
+						(
+							"glCompressedTexImage2DARB"
+						);
+			#elif defined(__APPLE__) || defined(__APPLE_CC__)
+				/*	I can't test this Apple stuff!	*/
+				CFBundleRef bundle;
+				CFURLRef bundleURL =
+					CFURLCreateWithFileSystemPath(
+						kCFAllocatorDefault,
+						CFSTR("/System/Library/Frameworks/OpenGL.framework"),
+						kCFURLPOSIXPathStyle,
+						true );
+				CFStringRef extensionName =
+					CFStringCreateWithCString(
+						kCFAllocatorDefault,
+						"glCompressedTexImage2DARB",
+						kCFStringEncodingASCII );
+				bundle = CFBundleCreate( kCFAllocatorDefault, bundleURL );
+				assert( bundle != NULL );
+				ext_addr = (P_SOIL_GLCOMPRESSEDTEXIMAGE2DPROC)
+						CFBundleGetFunctionPointerForName
+						(
+							bundle, extensionName
+						);
+				CFRelease( bundleURL );
+				CFRelease( extensionName );
+				CFRelease( bundle );
+			#else
+				ext_addr = (P_SOIL_GLCOMPRESSEDTEXIMAGE2DPROC)
+						glXGetProcAddressARB
+						(
+							(const GLubyte *)"glCompressedTexImage2DARB"
+						);
+			#endif
+			/*	Flag it so no checks needed later	*/
+			if( NULL == ext_addr )
+			{
+				/*	hmm, not good!!  This should not happen, but does on my
+					laptop's VIA chipset.  The GL_EXT_texture_compression_s3tc
+					spec requires that ARB_texture_compression be present too.
+					this means I can upload and have the OpenGL drive do the
+					conversion, but I can't use my own routines or load DDS files
+					from disk and upload them directly [8^(	*/
+				has_DXT_capability = SOIL_CAPABILITY_NONE;
+			} else
+			{
+				/*	all's well!	*/
+				soilGlCompressedTexImage2D = ext_addr;
+				has_DXT_capability = SOIL_CAPABILITY_PRESENT;
+			}
+		}
+	}
+	/*	let the user know if we can do DXT or not	*/
+	return has_DXT_capability;
+}
diff --git a/external/include/SOIL/SOIL.h b/external/include/SOIL/SOIL.h
new file mode 100644
index 0000000..43f634f
--- /dev/null
+++ b/external/include/SOIL/SOIL.h
@@ -0,0 +1,433 @@
+/**
+	@mainpage SOIL
+
+	Jonathan Dummer
+	2007-07-26-10.36
+
+	Simple OpenGL Image Library
+
+	A tiny c library for uploading images as
+	textures into OpenGL.  Also saving and
+	loading of images is supported.
+
+	I'm using Sean's Tool Box image loader as a base:
+	http://www.nothings.org/
+
+	I'm upgrading it to load TGA and DDS files, and a direct
+	path for loading DDS files straight into OpenGL textures,
+	when applicable.
+
+	Image Formats:
+	- BMP		load & save
+	- TGA		load & save
+	- DDS		load & save
+	- PNG		load
+	- JPG		load
+
+	OpenGL Texture Features:
+	- resample to power-of-two sizes
+	- MIPmap generation
+	- compressed texture S3TC formats (if supported)
+	- can pre-multiply alpha for you, for better compositing
+	- can flip image about the y-axis (except pre-compressed DDS files)
+
+	Thanks to:
+	* Sean Barret - for the awesome stb_image
+	* Dan Venkitachalam - for finding some non-compliant DDS files, and patching some explicit casts
+	* everybody at gamedev.net
+**/
+
+#ifndef HEADER_SIMPLE_OPENGL_IMAGE_LIBRARY
+#define HEADER_SIMPLE_OPENGL_IMAGE_LIBRARY
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+	The format of images that may be loaded (force_channels).
+	SOIL_LOAD_AUTO leaves the image in whatever format it was found.
+	SOIL_LOAD_L forces the image to load as Luminous (greyscale)
+	SOIL_LOAD_LA forces the image to load as Luminous with Alpha
+	SOIL_LOAD_RGB forces the image to load as Red Green Blue
+	SOIL_LOAD_RGBA forces the image to load as Red Green Blue Alpha
+**/
+enum
+{
+	SOIL_LOAD_AUTO = 0,
+	SOIL_LOAD_L = 1,
+	SOIL_LOAD_LA = 2,
+	SOIL_LOAD_RGB = 3,
+	SOIL_LOAD_RGBA = 4
+};
+
+/**
+	Passed in as reuse_texture_ID, will cause SOIL to
+	register a new texture ID using glGenTextures().
+	If the value passed into reuse_texture_ID > 0 then
+	SOIL will just re-use that texture ID (great for
+	reloading image assets in-game!)
+**/
+enum
+{
+	SOIL_CREATE_NEW_ID = 0
+};
+
+/**
+	flags you can pass into SOIL_load_OGL_texture()
+	and SOIL_create_OGL_texture().
+	(note that if SOIL_FLAG_DDS_LOAD_DIRECT is used
+	the rest of the flags with the exception of
+	SOIL_FLAG_TEXTURE_REPEATS will be ignored while
+	loading already-compressed DDS files.)
+
+	SOIL_FLAG_POWER_OF_TWO: force the image to be POT
+	SOIL_FLAG_MIPMAPS: generate mipmaps for the texture
+	SOIL_FLAG_TEXTURE_REPEATS: otherwise will clamp
+	SOIL_FLAG_MULTIPLY_ALPHA: for using (GL_ONE,GL_ONE_MINUS_SRC_ALPHA) blending
+	SOIL_FLAG_INVERT_Y: flip the image vertically
+	SOIL_FLAG_COMPRESS_TO_DXT: if the card can display them, will convert RGB to DXT1, RGBA to DXT5
+	SOIL_FLAG_DDS_LOAD_DIRECT: will load DDS files directly without _ANY_ additional processing
+	SOIL_FLAG_NTSC_SAFE_RGB: clamps RGB components to the range [16,235]
+	SOIL_FLAG_CoCg_Y: Google YCoCg; RGB=>CoYCg, RGBA=>CoCgAY
+	SOIL_FLAG_TEXTURE_RECTANGE: uses ARB_texture_rectangle ; pixel indexed & no repeat or MIPmaps or cubemaps
+**/
+enum
+{
+	SOIL_FLAG_POWER_OF_TWO = 1,
+	SOIL_FLAG_MIPMAPS = 2,
+	SOIL_FLAG_TEXTURE_REPEATS = 4,
+	SOIL_FLAG_MULTIPLY_ALPHA = 8,
+	SOIL_FLAG_INVERT_Y = 16,
+	SOIL_FLAG_COMPRESS_TO_DXT = 32,
+	SOIL_FLAG_DDS_LOAD_DIRECT = 64,
+	SOIL_FLAG_NTSC_SAFE_RGB = 128,
+	SOIL_FLAG_CoCg_Y = 256,
+	SOIL_FLAG_TEXTURE_RECTANGLE = 512
+};
+
+/**
+	The types of images that may be saved.
+	(TGA supports uncompressed RGB / RGBA)
+	(BMP supports uncompressed RGB)
+	(DDS supports DXT1 and DXT5)
+**/
+enum
+{
+	SOIL_SAVE_TYPE_TGA = 0,
+	SOIL_SAVE_TYPE_BMP = 1,
+	SOIL_SAVE_TYPE_DDS = 2
+};
+
+/**
+	Defines the order of faces in a DDS cubemap.
+	I recommend that you use the same order in single
+	image cubemap files, so they will be interchangeable
+	with DDS cubemaps when using SOIL.
+**/
+#define SOIL_DDS_CUBEMAP_FACE_ORDER "EWUDNS"
+
+/**
+	The types of internal fake HDR representations
+
+	SOIL_HDR_RGBE:		RGB * pow( 2.0, A - 128.0 )
+	SOIL_HDR_RGBdivA:	RGB / A
+	SOIL_HDR_RGBdivA2:	RGB / (A*A)
+**/
+enum
+{
+	SOIL_HDR_RGBE = 0,
+	SOIL_HDR_RGBdivA = 1,
+	SOIL_HDR_RGBdivA2 = 2
+};
+
+/**
+	Loads an image from disk into an OpenGL texture.
+	\param filename the name of the file to upload as a texture
+	\param force_channels 0-image format, 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT | SOIL_FLAG_DDS_LOAD_DIRECT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_load_OGL_texture
+	(
+		const char *filename,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Loads 6 images from disk into an OpenGL cubemap texture.
+	\param x_pos_file the name of the file to upload as the +x cube face
+	\param x_neg_file the name of the file to upload as the -x cube face
+	\param y_pos_file the name of the file to upload as the +y cube face
+	\param y_neg_file the name of the file to upload as the -y cube face
+	\param z_pos_file the name of the file to upload as the +z cube face
+	\param z_neg_file the name of the file to upload as the -z cube face
+	\param force_channels 0-image format, 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT | SOIL_FLAG_DDS_LOAD_DIRECT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_load_OGL_cubemap
+	(
+		const char *x_pos_file,
+		const char *x_neg_file,
+		const char *y_pos_file,
+		const char *y_neg_file,
+		const char *z_pos_file,
+		const char *z_neg_file,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Loads 1 image from disk and splits it into an OpenGL cubemap texture.
+	\param filename the name of the file to upload as a texture
+	\param face_order the order of the faces in the file, any combination of NSWEUD, for North, South, Up, etc.
+	\param force_channels 0-image format, 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT | SOIL_FLAG_DDS_LOAD_DIRECT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_load_OGL_single_cubemap
+	(
+		const char *filename,
+		const char face_order[6],
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Loads an HDR image from disk into an OpenGL texture.
+	\param filename the name of the file to upload as a texture
+	\param fake_HDR_format SOIL_HDR_RGBE, SOIL_HDR_RGBdivA, SOIL_HDR_RGBdivA2
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_load_OGL_HDR_texture
+	(
+		const char *filename,
+		int fake_HDR_format,
+		int rescale_to_max,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Loads an image from RAM into an OpenGL texture.
+	\param buffer the image data in RAM just as if it were still in a file
+	\param buffer_length the size of the buffer in bytes
+	\param force_channels 0-image format, 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT | SOIL_FLAG_DDS_LOAD_DIRECT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_load_OGL_texture_from_memory
+	(
+		const unsigned char *const buffer,
+		int buffer_length,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Loads 6 images from memory into an OpenGL cubemap texture.
+	\param x_pos_buffer the image data in RAM to upload as the +x cube face
+	\param x_pos_buffer_length the size of the above buffer
+	\param x_neg_buffer the image data in RAM to upload as the +x cube face
+	\param x_neg_buffer_length the size of the above buffer
+	\param y_pos_buffer the image data in RAM to upload as the +x cube face
+	\param y_pos_buffer_length the size of the above buffer
+	\param y_neg_buffer the image data in RAM to upload as the +x cube face
+	\param y_neg_buffer_length the size of the above buffer
+	\param z_pos_buffer the image data in RAM to upload as the +x cube face
+	\param z_pos_buffer_length the size of the above buffer
+	\param z_neg_buffer the image data in RAM to upload as the +x cube face
+	\param z_neg_buffer_length the size of the above buffer
+	\param force_channels 0-image format, 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT | SOIL_FLAG_DDS_LOAD_DIRECT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_load_OGL_cubemap_from_memory
+	(
+		const unsigned char *const x_pos_buffer,
+		int x_pos_buffer_length,
+		const unsigned char *const x_neg_buffer,
+		int x_neg_buffer_length,
+		const unsigned char *const y_pos_buffer,
+		int y_pos_buffer_length,
+		const unsigned char *const y_neg_buffer,
+		int y_neg_buffer_length,
+		const unsigned char *const z_pos_buffer,
+		int z_pos_buffer_length,
+		const unsigned char *const z_neg_buffer,
+		int z_neg_buffer_length,
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Loads 1 image from RAM and splits it into an OpenGL cubemap texture.
+	\param buffer the image data in RAM just as if it were still in a file
+	\param buffer_length the size of the buffer in bytes
+	\param face_order the order of the faces in the file, any combination of NSWEUD, for North, South, Up, etc.
+	\param force_channels 0-image format, 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT | SOIL_FLAG_DDS_LOAD_DIRECT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_load_OGL_single_cubemap_from_memory
+	(
+		const unsigned char *const buffer,
+		int buffer_length,
+		const char face_order[6],
+		int force_channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Creates a 2D OpenGL texture from raw image data.  Note that the raw data is
+	_NOT_ freed after the upload (so the user can load various versions).
+	\param data the raw data to be uploaded as an OpenGL texture
+	\param width the width of the image in pixels
+	\param height the height of the image in pixels
+	\param channels the number of channels: 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_create_OGL_texture
+	(
+		const unsigned char *const data,
+		int width, int height, int channels,
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Creates an OpenGL cubemap texture by splitting up 1 image into 6 parts.
+	\param data the raw data to be uploaded as an OpenGL texture
+	\param width the width of the image in pixels
+	\param height the height of the image in pixels
+	\param channels the number of channels: 1-luminous, 2-luminous/alpha, 3-RGB, 4-RGBA
+	\param face_order the order of the faces in the file, and combination of NSWEUD, for North, South, Up, etc.
+	\param reuse_texture_ID 0-generate a new texture ID, otherwise reuse the texture ID (overwriting the old texture)
+	\param flags can be any of SOIL_FLAG_POWER_OF_TWO | SOIL_FLAG_MIPMAPS | SOIL_FLAG_TEXTURE_REPEATS | SOIL_FLAG_MULTIPLY_ALPHA | SOIL_FLAG_INVERT_Y | SOIL_FLAG_COMPRESS_TO_DXT | SOIL_FLAG_DDS_LOAD_DIRECT
+	\return 0-failed, otherwise returns the OpenGL texture handle
+**/
+unsigned int
+	SOIL_create_OGL_single_cubemap
+	(
+		const unsigned char *const data,
+		int width, int height, int channels,
+		const char face_order[6],
+		unsigned int reuse_texture_ID,
+		unsigned int flags
+	);
+
+/**
+	Captures the OpenGL window (RGB) and saves it to disk
+	\return 0 if it failed, otherwise returns 1
+**/
+int
+	SOIL_save_screenshot
+	(
+		const char *filename,
+		int image_type,
+		int x, int y,
+		int width, int height
+	);
+
+/**
+	Loads an image from disk into an array of unsigned chars.
+	Note that *channels return the original channel count of the
+	image.  If force_channels was other than SOIL_LOAD_AUTO,
+	the resulting image has force_channels, but *channels may be
+	different (if the original image had a different channel
+	count).
+	\return 0 if failed, otherwise returns 1
+**/
+unsigned char*
+	SOIL_load_image
+	(
+		const char *filename,
+		int *width, int *height, int *channels,
+		int force_channels
+	);
+
+/**
+	Loads an image from memory into an array of unsigned chars.
+	Note that *channels return the original channel count of the
+	image.  If force_channels was other than SOIL_LOAD_AUTO,
+	the resulting image has force_channels, but *channels may be
+	different (if the original image had a different channel
+	count).
+	\return 0 if failed, otherwise returns 1
+**/
+unsigned char*
+	SOIL_load_image_from_memory
+	(
+		const unsigned char *const buffer,
+		int buffer_length,
+		int *width, int *height, int *channels,
+		int force_channels
+	);
+
+/**
+	Saves an image from an array of unsigned chars (RGBA) to disk
+	\return 0 if failed, otherwise returns 1
+**/
+int
+	SOIL_save_image
+	(
+		const char *filename,
+		int image_type,
+		int width, int height, int channels,
+		const unsigned char *const data
+	);
+
+/**
+	Frees the image data (note, this is just C's "free()"...this function is
+	present mostly so C++ programmers don't forget to use "free()" and call
+	"delete []" instead [8^)
+**/
+void
+	SOIL_free_image_data
+	(
+		unsigned char *img_data
+	);
+
+/**
+	This function resturn a pointer to a string describing the last thing
+	that happened inside SOIL.  It can be used to determine why an image
+	failed to load.
+**/
+const char*
+	SOIL_last_result
+	(
+		void
+	);
+
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* HEADER_SIMPLE_OPENGL_IMAGE_LIBRARY	*/
diff --git a/external/include/SOIL/image_DXT.c b/external/include/SOIL/image_DXT.c
new file mode 100644
index 0000000..4206a1b
--- /dev/null
+++ b/external/include/SOIL/image_DXT.c
@@ -0,0 +1,632 @@
+/*
+	Jonathan Dummer
+	2007-07-31-10.32
+
+	simple DXT compression / decompression code
+
+	public domain
+*/
+
+#include "image_DXT.h"
+#include <math.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdio.h>
+
+/*	set this =1 if you want to use the covarince matrix method...
+	which is better than my method of using standard deviations
+	overall, except on the infintesimal chance that the power
+	method fails for finding the largest eigenvector	*/
+#define USE_COV_MAT	1
+
+/********* Function Prototypes *********/
+/*
+	Takes a 4x4 block of pixels and compresses it into 8 bytes
+	in DXT1 format (color only, no alpha).  Speed is valued
+	over prettyness, at least for now.
+*/
+void compress_DDS_color_block(
+				int channels,
+				const unsigned char *const uncompressed,
+				unsigned char compressed[8] );
+/*
+	Takes a 4x4 block of pixels and compresses the alpha
+	component it into 8 bytes for use in DXT5 DDS files.
+	Speed is valued over prettyness, at least for now.
+*/
+void compress_DDS_alpha_block(
+				const unsigned char *const uncompressed,
+				unsigned char compressed[8] );
+
+/********* Actual Exposed Functions *********/
+int
+	save_image_as_DDS
+	(
+		const char *filename,
+		int width, int height, int channels,
+		const unsigned char *const data
+	)
+{
+	/*	variables	*/
+	FILE *fout;
+	unsigned char *DDS_data;
+	DDS_header header;
+	int DDS_size;
+	/*	error check	*/
+	if( (NULL == filename) ||
+		(width < 1) || (height < 1) ||
+		(channels < 1) || (channels > 4) ||
+		(data == NULL ) )
+	{
+		return 0;
+	}
+	/*	Convert the image	*/
+	if( (channels & 1) == 1 )
+	{
+		/*	no alpha, just use DXT1	*/
+		DDS_data = convert_image_to_DXT1( data, width, height, channels, &DDS_size );
+	} else
+	{
+		/*	has alpha, so use DXT5	*/
+		DDS_data = convert_image_to_DXT5( data, width, height, channels, &DDS_size );
+	}
+	/*	save it	*/
+	memset( &header, 0, sizeof( DDS_header ) );
+	header.dwMagic = ('D' << 0) | ('D' << 8) | ('S' << 16) | (' ' << 24);
+	header.dwSize = 124;
+	header.dwFlags = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH | DDSD_PIXELFORMAT | DDSD_LINEARSIZE;
+	header.dwWidth = width;
+	header.dwHeight = height;
+	header.dwPitchOrLinearSize = DDS_size;
+	header.sPixelFormat.dwSize = 32;
+	header.sPixelFormat.dwFlags = DDPF_FOURCC;
+	if( (channels & 1) == 1 )
+	{
+		header.sPixelFormat.dwFourCC = ('D' << 0) | ('X' << 8) | ('T' << 16) | ('1' << 24);
+	} else
+	{
+		header.sPixelFormat.dwFourCC = ('D' << 0) | ('X' << 8) | ('T' << 16) | ('5' << 24);
+	}
+	header.sCaps.dwCaps1 = DDSCAPS_TEXTURE;
+	/*	write it out	*/
+	fout = fopen( filename, "wb");
+	fwrite( &header, sizeof( DDS_header ), 1, fout );
+	fwrite( DDS_data, 1, DDS_size, fout );
+	fclose( fout );
+	/*	done	*/
+	free( DDS_data );
+	return 1;
+}
+
+unsigned char* convert_image_to_DXT1(
+		const unsigned char *const uncompressed,
+		int width, int height, int channels,
+		int *out_size )
+{
+	unsigned char *compressed;
+	int i, j, x, y;
+	unsigned char ublock[16*3];
+	unsigned char cblock[8];
+	int index = 0, chan_step = 1;
+	int block_count = 0;
+	/*	error check	*/
+	*out_size = 0;
+	if( (width < 1) || (height < 1) ||
+		(NULL == uncompressed) ||
+		(channels < 1) || (channels > 4) )
+	{
+		return NULL;
+	}
+	/*	for channels == 1 or 2, I do not step forward for R,G,B values	*/
+	if( channels < 3 )
+	{
+		chan_step = 0;
+	}
+	/*	get the RAM for the compressed image
+		(8 bytes per 4x4 pixel block)	*/
+	*out_size = ((width+3) >> 2) * ((height+3) >> 2) * 8;
+	compressed = (unsigned char*)malloc( *out_size );
+	/*	go through each block	*/
+	for( j = 0; j < height; j += 4 )
+	{
+		for( i = 0; i < width; i += 4 )
+		{
+			/*	copy this block into a new one	*/
+			int idx = 0;
+			int mx = 4, my = 4;
+			if( j+4 >= height )
+			{
+				my = height - j;
+			}
+			if( i+4 >= width )
+			{
+				mx = width - i;
+			}
+			for( y = 0; y < my; ++y )
+			{
+				for( x = 0; x < mx; ++x )
+				{
+					ublock[idx++] = uncompressed[(j+y)*width*channels+(i+x)*channels];
+					ublock[idx++] = uncompressed[(j+y)*width*channels+(i+x)*channels+chan_step];
+					ublock[idx++] = uncompressed[(j+y)*width*channels+(i+x)*channels+chan_step+chan_step];
+				}
+				for( x = mx; x < 4; ++x )
+				{
+					ublock[idx++] = ublock[0];
+					ublock[idx++] = ublock[1];
+					ublock[idx++] = ublock[2];
+				}
+			}
+			for( y = my; y < 4; ++y )
+			{
+				for( x = 0; x < 4; ++x )
+				{
+					ublock[idx++] = ublock[0];
+					ublock[idx++] = ublock[1];
+					ublock[idx++] = ublock[2];
+				}
+			}
+			/*	compress the block	*/
+			++block_count;
+			compress_DDS_color_block( 3, ublock, cblock );
+			/*	copy the data from the block into the main block	*/
+			for( x = 0; x < 8; ++x )
+			{
+				compressed[index++] = cblock[x];
+			}
+		}
+	}
+	return compressed;
+}
+
+unsigned char* convert_image_to_DXT5(
+		const unsigned char *const uncompressed,
+		int width, int height, int channels,
+		int *out_size )
+{
+	unsigned char *compressed;
+	int i, j, x, y;
+	unsigned char ublock[16*4];
+	unsigned char cblock[8];
+	int index = 0, chan_step = 1;
+	int block_count = 0, has_alpha;
+	/*	error check	*/
+	*out_size = 0;
+	if( (width < 1) || (height < 1) ||
+		(NULL == uncompressed) ||
+		(channels < 1) || ( channels > 4) )
+	{
+		return NULL;
+	}
+	/*	for channels == 1 or 2, I do not step forward for R,G,B vales	*/
+	if( channels < 3 )
+	{
+		chan_step = 0;
+	}
+	/*	# channels = 1 or 3 have no alpha, 2 & 4 do have alpha	*/
+	has_alpha = 1 - (channels & 1);
+	/*	get the RAM for the compressed image
+		(16 bytes per 4x4 pixel block)	*/
+	*out_size = ((width+3) >> 2) * ((height+3) >> 2) * 16;
+	compressed = (unsigned char*)malloc( *out_size );
+	/*	go through each block	*/
+	for( j = 0; j < height; j += 4 )
+	{
+		for( i = 0; i < width; i += 4 )
+		{
+			/*	local variables, and my block counter	*/
+			int idx = 0;
+			int mx = 4, my = 4;
+			if( j+4 >= height )
+			{
+				my = height - j;
+			}
+			if( i+4 >= width )
+			{
+				mx = width - i;
+			}
+			for( y = 0; y < my; ++y )
+			{
+				for( x = 0; x < mx; ++x )
+				{
+					ublock[idx++] = uncompressed[(j+y)*width*channels+(i+x)*channels];
+					ublock[idx++] = uncompressed[(j+y)*width*channels+(i+x)*channels+chan_step];
+					ublock[idx++] = uncompressed[(j+y)*width*channels+(i+x)*channels+chan_step+chan_step];
+					ublock[idx++] =
+						has_alpha * uncompressed[(j+y)*width*channels+(i+x)*channels+channels-1]
+						+ (1-has_alpha)*255;
+				}
+				for( x = mx; x < 4; ++x )
+				{
+					ublock[idx++] = ublock[0];
+					ublock[idx++] = ublock[1];
+					ublock[idx++] = ublock[2];
+					ublock[idx++] = ublock[3];
+				}
+			}
+			for( y = my; y < 4; ++y )
+			{
+				for( x = 0; x < 4; ++x )
+				{
+					ublock[idx++] = ublock[0];
+					ublock[idx++] = ublock[1];
+					ublock[idx++] = ublock[2];
+					ublock[idx++] = ublock[3];
+				}
+			}
+			/*	now compress the alpha block	*/
+			compress_DDS_alpha_block( ublock, cblock );
+			/*	copy the data from the compressed alpha block into the main buffer	*/
+			for( x = 0; x < 8; ++x )
+			{
+				compressed[index++] = cblock[x];
+			}
+			/*	then compress the color block	*/
+			++block_count;
+			compress_DDS_color_block( 4, ublock, cblock );
+			/*	copy the data from the compressed color block into the main buffer	*/
+			for( x = 0; x < 8; ++x )
+			{
+				compressed[index++] = cblock[x];
+			}
+		}
+	}
+	return compressed;
+}
+
+/********* Helper Functions *********/
+int convert_bit_range( int c, int from_bits, int to_bits )
+{
+	int b = (1 << (from_bits - 1)) + c * ((1 << to_bits) - 1);
+	return (b + (b >> from_bits)) >> from_bits;
+}
+
+int rgb_to_565( int r, int g, int b )
+{
+	return
+		(convert_bit_range( r, 8, 5 ) << 11) |
+		(convert_bit_range( g, 8, 6 ) << 05) |
+		(convert_bit_range( b, 8, 5 ) << 00);
+}
+
+void rgb_888_from_565( unsigned int c, int *r, int *g, int *b )
+{
+	*r = convert_bit_range( (c >> 11) & 31, 5, 8 );
+	*g = convert_bit_range( (c >> 05) & 63, 6, 8 );
+	*b = convert_bit_range( (c >> 00) & 31, 5, 8 );
+}
+
+void compute_color_line_STDEV(
+		const unsigned char *const uncompressed,
+		int channels,
+		float point[3], float direction[3] )
+{
+	const float inv_16 = 1.0f / 16.0f;
+	int i;
+	float sum_r = 0.0f, sum_g = 0.0f, sum_b = 0.0f;
+	float sum_rr = 0.0f, sum_gg = 0.0f, sum_bb = 0.0f;
+	float sum_rg = 0.0f, sum_rb = 0.0f, sum_gb = 0.0f;
+	/*	calculate all data needed for the covariance matrix
+		( to compare with _rygdxt code)	*/
+	for( i = 0; i < 16*channels; i += channels )
+	{
+		sum_r += uncompressed[i+0];
+		sum_rr += uncompressed[i+0] * uncompressed[i+0];
+		sum_g += uncompressed[i+1];
+		sum_gg += uncompressed[i+1] * uncompressed[i+1];
+		sum_b += uncompressed[i+2];
+		sum_bb += uncompressed[i+2] * uncompressed[i+2];
+		sum_rg += uncompressed[i+0] * uncompressed[i+1];
+		sum_rb += uncompressed[i+0] * uncompressed[i+2];
+		sum_gb += uncompressed[i+1] * uncompressed[i+2];
+	}
+	/*	convert the sums to averages	*/
+	sum_r *= inv_16;
+	sum_g *= inv_16;
+	sum_b *= inv_16;
+	/*	and convert the squares to the squares of the value - avg_value	*/
+	sum_rr -= 16.0f * sum_r * sum_r;
+	sum_gg -= 16.0f * sum_g * sum_g;
+	sum_bb -= 16.0f * sum_b * sum_b;
+	sum_rg -= 16.0f * sum_r * sum_g;
+	sum_rb -= 16.0f * sum_r * sum_b;
+	sum_gb -= 16.0f * sum_g * sum_b;
+	/*	the point on the color line is the average	*/
+	point[0] = sum_r;
+	point[1] = sum_g;
+	point[2] = sum_b;
+	#if USE_COV_MAT
+	/*
+		The following idea was from ryg.
+		(https://mollyrocket.com/forums/viewtopic.php?t=392)
+		The method worked great (less RMSE than mine) most of
+		the time, but had some issues handling some simple
+		boundary cases, like full green next to full red,
+		which would generate a covariance matrix like this:
+
+		| 1  -1  0 |
+		| -1  1  0 |
+		| 0   0  0 |
+
+		For a given starting vector, the power method can
+		generate all zeros!  So no starting with {1,1,1}
+		as I was doing!  This kind of error is still a
+		slight posibillity, but will be very rare.
+	*/
+	/*	use the covariance matrix directly
+		(1st iteration, don't use all 1.0 values!)	*/
+	sum_r = 1.0f;
+	sum_g = 2.718281828f;
+	sum_b = 3.141592654f;
+	direction[0] = sum_r*sum_rr + sum_g*sum_rg + sum_b*sum_rb;
+	direction[1] = sum_r*sum_rg + sum_g*sum_gg + sum_b*sum_gb;
+	direction[2] = sum_r*sum_rb + sum_g*sum_gb + sum_b*sum_bb;
+	/*	2nd iteration, use results from the 1st guy	*/
+	sum_r = direction[0];
+	sum_g = direction[1];
+	sum_b = direction[2];
+	direction[0] = sum_r*sum_rr + sum_g*sum_rg + sum_b*sum_rb;
+	direction[1] = sum_r*sum_rg + sum_g*sum_gg + sum_b*sum_gb;
+	direction[2] = sum_r*sum_rb + sum_g*sum_gb + sum_b*sum_bb;
+	/*	3rd iteration, use results from the 2nd guy	*/
+	sum_r = direction[0];
+	sum_g = direction[1];
+	sum_b = direction[2];
+	direction[0] = sum_r*sum_rr + sum_g*sum_rg + sum_b*sum_rb;
+	direction[1] = sum_r*sum_rg + sum_g*sum_gg + sum_b*sum_gb;
+	direction[2] = sum_r*sum_rb + sum_g*sum_gb + sum_b*sum_bb;
+	#else
+	/*	use my standard deviation method
+		(very robust, a tiny bit slower and less accurate)	*/
+	direction[0] = sqrt( sum_rr );
+	direction[1] = sqrt( sum_gg );
+	direction[2] = sqrt( sum_bb );
+	/*	which has a greater component	*/
+	if( sum_gg > sum_rr )
+	{
+		/*	green has greater component, so base the other signs off of green	*/
+		if( sum_rg < 0.0f )
+		{
+			direction[0] = -direction[0];
+		}
+		if( sum_gb < 0.0f )
+		{
+			direction[2] = -direction[2];
+		}
+	} else
+	{
+		/*	red has a greater component	*/
+		if( sum_rg < 0.0f )
+		{
+			direction[1] = -direction[1];
+		}
+		if( sum_rb < 0.0f )
+		{
+			direction[2] = -direction[2];
+		}
+	}
+	#endif
+}
+
+void LSE_master_colors_max_min(
+		int *cmax, int *cmin,
+		int channels,
+		const unsigned char *const uncompressed )
+{
+	int i, j;
+	/*	the master colors	*/
+	int c0[3], c1[3];
+	/*	used for fitting the line	*/
+	float sum_x[] = { 0.0f, 0.0f, 0.0f };
+	float sum_x2[] = { 0.0f, 0.0f, 0.0f };
+	float dot_max = 1.0f, dot_min = -1.0f;
+	float vec_len2 = 0.0f;
+	float dot;
+	/*	error check	*/
+	if( (channels < 3) || (channels > 4) )
+	{
+		return;
+	}
+	compute_color_line_STDEV( uncompressed, channels, sum_x, sum_x2 );
+	vec_len2 = 1.0f / ( 0.00001f +
+			sum_x2[0]*sum_x2[0] + sum_x2[1]*sum_x2[1] + sum_x2[2]*sum_x2[2] );
+	/*	finding the max and min vector values	*/
+	dot_max =
+			(
+				sum_x2[0] * uncompressed[0] +
+				sum_x2[1] * uncompressed[1] +
+				sum_x2[2] * uncompressed[2]
+			);
+	dot_min = dot_max;
+	for( i = 1; i < 16; ++i )
+	{
+		dot =
+			(
+				sum_x2[0] * uncompressed[i*channels+0] +
+				sum_x2[1] * uncompressed[i*channels+1] +
+				sum_x2[2] * uncompressed[i*channels+2]
+			);
+		if( dot < dot_min )
+		{
+			dot_min = dot;
+		} else if( dot > dot_max )
+		{
+			dot_max = dot;
+		}
+	}
+	/*	and the offset (from the average location)	*/
+	dot = sum_x2[0]*sum_x[0] + sum_x2[1]*sum_x[1] + sum_x2[2]*sum_x[2];
+	dot_min -= dot;
+	dot_max -= dot;
+	/*	post multiply by the scaling factor	*/
+	dot_min *= vec_len2;
+	dot_max *= vec_len2;
+	/*	OK, build the master colors	*/
+	for( i = 0; i < 3; ++i )
+	{
+		/*	color 0	*/
+		c0[i] = (int)(0.5f + sum_x[i] + dot_max * sum_x2[i]);
+		if( c0[i] < 0 )
+		{
+			c0[i] = 0;
+		} else if( c0[i] > 255 )
+		{
+			c0[i] = 255;
+		}
+		/*	color 1	*/
+		c1[i] = (int)(0.5f + sum_x[i] + dot_min * sum_x2[i]);
+		if( c1[i] < 0 )
+		{
+			c1[i] = 0;
+		} else if( c1[i] > 255 )
+		{
+			c1[i] = 255;
+		}
+	}
+	/*	down_sample (with rounding?)	*/
+	i = rgb_to_565( c0[0], c0[1], c0[2] );
+	j = rgb_to_565( c1[0], c1[1], c1[2] );
+	if( i > j )
+	{
+		*cmax = i;
+		*cmin = j;
+	} else
+	{
+		*cmax = j;
+		*cmin = i;
+	}
+}
+
+void
+	compress_DDS_color_block
+	(
+		int channels,
+		const unsigned char *const uncompressed,
+		unsigned char compressed[8]
+	)
+{
+	/*	variables	*/
+	int i;
+	int next_bit;
+	int enc_c0, enc_c1;
+	int c0[4], c1[4];
+	float color_line[] = { 0.0f, 0.0f, 0.0f, 0.0f };
+	float vec_len2 = 0.0f, dot_offset = 0.0f;
+	/*	stupid order	*/
+	int swizzle4[] = { 0, 2, 3, 1 };
+	/*	get the master colors	*/
+	LSE_master_colors_max_min( &enc_c0, &enc_c1, channels, uncompressed );
+	/*	store the 565 color 0 and color 1	*/
+	compressed[0] = (enc_c0 >> 0) & 255;
+	compressed[1] = (enc_c0 >> 8) & 255;
+	compressed[2] = (enc_c1 >> 0) & 255;
+	compressed[3] = (enc_c1 >> 8) & 255;
+	/*	zero out the compressed data	*/
+	compressed[4] = 0;
+	compressed[5] = 0;
+	compressed[6] = 0;
+	compressed[7] = 0;
+	/*	reconstitute the master color vectors	*/
+	rgb_888_from_565( enc_c0, &c0[0], &c0[1], &c0[2] );
+	rgb_888_from_565( enc_c1, &c1[0], &c1[1], &c1[2] );
+	/*	the new vector	*/
+	vec_len2 = 0.0f;
+	for( i = 0; i < 3; ++i )
+	{
+		color_line[i] = (float)(c1[i] - c0[i]);
+		vec_len2 += color_line[i] * color_line[i];
+	}
+	if( vec_len2 > 0.0f )
+	{
+		vec_len2 = 1.0f / vec_len2;
+	}
+	/*	pre-proform the scaling	*/
+	color_line[0] *= vec_len2;
+	color_line[1] *= vec_len2;
+	color_line[2] *= vec_len2;
+	/*	compute the offset (constant) portion of the dot product	*/
+	dot_offset = color_line[0]*c0[0] + color_line[1]*c0[1] + color_line[2]*c0[2];
+	/*	store the rest of the bits	*/
+	next_bit = 8*4;
+	for( i = 0; i < 16; ++i )
+	{
+		/*	find the dot product of this color, to place it on the line
+			(should be [-1,1])	*/
+		int next_value = 0;
+		float dot_product =
+			color_line[0] * uncompressed[i*channels+0] +
+			color_line[1] * uncompressed[i*channels+1] +
+			color_line[2] * uncompressed[i*channels+2] -
+			dot_offset;
+		/*	map to [0,3]	*/
+		next_value = (int)( dot_product * 3.0f + 0.5f );
+		if( next_value > 3 )
+		{
+			next_value = 3;
+		} else if( next_value < 0 )
+		{
+			next_value = 0;
+		}
+		/*	OK, store this value	*/
+		compressed[next_bit >> 3] |= swizzle4[ next_value ] << (next_bit & 7);
+		next_bit += 2;
+	}
+	/*	done compressing to DXT1	*/
+}
+
+void
+	compress_DDS_alpha_block
+	(
+		const unsigned char *const uncompressed,
+		unsigned char compressed[8]
+	)
+{
+	/*	variables	*/
+	int i;
+	int next_bit;
+	int a0, a1;
+	float scale_me;
+	/*	stupid order	*/
+	int swizzle8[] = { 1, 7, 6, 5, 4, 3, 2, 0 };
+	/*	get the alpha limits (a0 > a1)	*/
+	a0 = a1 = uncompressed[3];
+	for( i = 4+3; i < 16*4; i += 4 )
+	{
+		if( uncompressed[i] > a0 )
+		{
+			a0 = uncompressed[i];
+		} else if( uncompressed[i] < a1 )
+		{
+			a1 = uncompressed[i];
+		}
+	}
+	/*	store those limits, and zero the rest of the compressed dataset	*/
+	compressed[0] = a0;
+	compressed[1] = a1;
+	/*	zero out the compressed data	*/
+	compressed[2] = 0;
+	compressed[3] = 0;
+	compressed[4] = 0;
+	compressed[5] = 0;
+	compressed[6] = 0;
+	compressed[7] = 0;
+	/*	store the all of the alpha values	*/
+	next_bit = 8*2;
+	scale_me = 7.9999f / (a0 - a1);
+	for( i = 3; i < 16*4; i += 4 )
+	{
+		/*	convert this alpha value to a 3 bit number	*/
+		int svalue;
+		int value = (int)((uncompressed[i] - a1) * scale_me);
+		svalue = swizzle8[ value&7 ];
+		/*	OK, store this value, start with the 1st byte	*/
+		compressed[next_bit >> 3] |= svalue << (next_bit & 7);
+		if( (next_bit & 7) > 5 )
+		{
+			/*	spans 2 bytes, fill in the start of the 2nd byte	*/
+			compressed[1 + (next_bit >> 3)] |= svalue >> (8 - (next_bit & 7) );
+		}
+		next_bit += 3;
+	}
+	/*	done compressing to DXT1	*/
+}
diff --git a/external/include/SOIL/image_DXT.h b/external/include/SOIL/image_DXT.h
new file mode 100644
index 0000000..75f604f
--- /dev/null
+++ b/external/include/SOIL/image_DXT.h
@@ -0,0 +1,123 @@
+/*
+	Jonathan Dummer
+	2007-07-31-10.32
+
+	simple DXT compression / decompression code
+
+	public domain
+*/
+
+#ifndef HEADER_IMAGE_DXT
+#define HEADER_IMAGE_DXT
+
+/**
+	Converts an image from an array of unsigned chars (RGB or RGBA) to
+	DXT1 or DXT5, then saves the converted image to disk.
+	\return 0 if failed, otherwise returns 1
+**/
+int
+save_image_as_DDS
+(
+    const char *filename,
+    int width, int height, int channels,
+    const unsigned char *const data
+);
+
+/**
+	take an image and convert it to DXT1 (no alpha)
+**/
+unsigned char*
+convert_image_to_DXT1
+(
+    const unsigned char *const uncompressed,
+    int width, int height, int channels,
+    int *out_size
+);
+
+/**
+	take an image and convert it to DXT5 (with alpha)
+**/
+unsigned char*
+convert_image_to_DXT5
+(
+    const unsigned char *const uncompressed,
+    int width, int height, int channels,
+    int *out_size
+);
+
+/**	A bunch of DirectDraw Surface structures and flags **/
+typedef struct
+{
+    unsigned int    dwMagic;
+    unsigned int    dwSize;
+    unsigned int    dwFlags;
+    unsigned int    dwHeight;
+    unsigned int    dwWidth;
+    unsigned int    dwPitchOrLinearSize;
+    unsigned int    dwDepth;
+    unsigned int    dwMipMapCount;
+    unsigned int    dwReserved1[ 11 ];
+
+    /*  DDPIXELFORMAT	*/
+    struct
+    {
+        unsigned int    dwSize;
+        unsigned int    dwFlags;
+        unsigned int    dwFourCC;
+        unsigned int    dwRGBBitCount;
+        unsigned int    dwRBitMask;
+        unsigned int    dwGBitMask;
+        unsigned int    dwBBitMask;
+        unsigned int    dwAlphaBitMask;
+    }
+    sPixelFormat;
+
+    /*  DDCAPS2	*/
+    struct
+    {
+        unsigned int    dwCaps1;
+        unsigned int    dwCaps2;
+        unsigned int    dwDDSX;
+        unsigned int    dwReserved;
+    }
+    sCaps;
+    unsigned int    dwReserved2;
+}
+DDS_header ;
+
+/*	the following constants were copied directly off the MSDN website	*/
+
+/*	The dwFlags member of the original DDSURFACEDESC2 structure
+	can be set to one or more of the following values.	*/
+#define DDSD_CAPS	0x00000001
+#define DDSD_HEIGHT	0x00000002
+#define DDSD_WIDTH	0x00000004
+#define DDSD_PITCH	0x00000008
+#define DDSD_PIXELFORMAT	0x00001000
+#define DDSD_MIPMAPCOUNT	0x00020000
+#define DDSD_LINEARSIZE	0x00080000
+#define DDSD_DEPTH	0x00800000
+
+/*	DirectDraw Pixel Format	*/
+#define DDPF_ALPHAPIXELS	0x00000001
+#define DDPF_FOURCC	0x00000004
+#define DDPF_RGB	0x00000040
+
+/*	The dwCaps1 member of the DDSCAPS2 structure can be
+	set to one or more of the following values.	*/
+#define DDSCAPS_COMPLEX	0x00000008
+#define DDSCAPS_TEXTURE	0x00001000
+#define DDSCAPS_MIPMAP	0x00400000
+
+/*	The dwCaps2 member of the DDSCAPS2 structure can be
+	set to one or more of the following values.		*/
+#define DDSCAPS2_CUBEMAP	0x00000200
+#define DDSCAPS2_CUBEMAP_POSITIVEX	0x00000400
+#define DDSCAPS2_CUBEMAP_NEGATIVEX	0x00000800
+#define DDSCAPS2_CUBEMAP_POSITIVEY	0x00001000
+#define DDSCAPS2_CUBEMAP_NEGATIVEY	0x00002000
+#define DDSCAPS2_CUBEMAP_POSITIVEZ	0x00004000
+#define DDSCAPS2_CUBEMAP_NEGATIVEZ	0x00008000
+#define DDSCAPS2_VOLUME	0x00200000
+
+#endif /* HEADER_IMAGE_DXT	*/
diff --git a/external/include/SOIL/image_helper.c b/external/include/SOIL/image_helper.c
new file mode 100644
index 0000000..d22340f
--- /dev/null
+++ b/external/include/SOIL/image_helper.c
@@ -0,0 +1,435 @@
+/*
+    Jonathan Dummer
+
+    image helper functions
+
+    MIT license
+*/
+
+#include "image_helper.h"
+#include <stdlib.h>
+#include <math.h>
+
+/*	Upscaling the image uses simple bilinear interpolation	*/
+int
+	up_scale_image
+	(
+		const unsigned char* const orig,
+		int width, int height, int channels,
+		unsigned char* resampled,
+		int resampled_width, int resampled_height
+	)
+{
+	float dx, dy;
+	int x, y, c;
+
+    /* error(s) check	*/
+    if ( 	(width < 1) || (height < 1) ||
+            (resampled_width < 2) || (resampled_height < 2) ||
+            (channels < 1) ||
+            (NULL == orig) || (NULL == resampled) )
+    {
+        /*	signify badness	*/
+        return 0;
+    }
+    /*
+		for each given pixel in the new map, find the exact location
+		from the original map which would contribute to this guy
+	*/
+    dx = (width - 1.0f) / (resampled_width - 1.0f);
+    dy = (height - 1.0f) / (resampled_height - 1.0f);
+    for ( y = 0; y < resampled_height; ++y )
+    {
+    	/* find the base y index and fractional offset from that	*/
+    	float sampley = y * dy;
+    	int inty = (int)sampley;
+    	/*	if( inty < 0 ) { inty = 0; } else	*/
+		if( inty > height - 2 ) { inty = height - 2; }
+		sampley -= inty;
+        for ( x = 0; x < resampled_width; ++x )
+        {
+			float samplex = x * dx;
+			int intx = (int)samplex;
+			int base_index;
+			/* find the base x index and fractional offset from that	*/
+			/*	if( intx < 0 ) { intx = 0; } else	*/
+			if( intx > width - 2 ) { intx = width - 2; }
+			samplex -= intx;
+			/*	base index into the original image	*/
+			base_index = (inty * width + intx) * channels;
+            for ( c = 0; c < channels; ++c )
+            {
+            	/*	do the sampling	*/
+				float value = 0.5f;
+				value += orig[base_index]
+							*(1.0f-samplex)*(1.0f-sampley);
+				value += orig[base_index+channels]
+							*(samplex)*(1.0f-sampley);
+				value += orig[base_index+width*channels]
+							*(1.0f-samplex)*(sampley);
+				value += orig[base_index+width*channels+channels]
+							*(samplex)*(sampley);
+				/*	move to the next channel	*/
+				++base_index;
+            	/*	save the new value	*/
+            	resampled[y*resampled_width*channels+x*channels+c] =
+						(unsigned char)(value);
+            }
+        }
+    }
+    /*	done	*/
+    return 1;
+}
+
+int
+	mipmap_image
+	(
+		const unsigned char* const orig,
+		int width, int height, int channels,
+		unsigned char* resampled,
+		int block_size_x, int block_size_y
+	)
+{
+	int mip_width, mip_height;
+	int i, j, c;
+
+	/*	error check	*/
+	if( (width < 1) || (height < 1) ||
+		(channels < 1) || (orig == NULL) ||
+		(resampled == NULL) ||
+		(block_size_x < 1) || (block_size_y < 1) )
+	{
+		/*	nothing to do	*/
+		return 0;
+	}
+	mip_width = width / block_size_x;
+	mip_height = height / block_size_y;
+	if( mip_width < 1 )
+	{
+		mip_width = 1;
+	}
+	if( mip_height < 1 )
+	{
+		mip_height = 1;
+	}
+	for( j = 0; j < mip_height; ++j )
+	{
+		for( i = 0; i < mip_width; ++i )
+		{
+			for( c = 0; c < channels; ++c )
+			{
+				const int index = (j*block_size_y)*width*channels + (i*block_size_x)*channels + c;
+				int sum_value;
+				int u,v;
+				int u_block = block_size_x;
+				int v_block = block_size_y;
+				int block_area;
+				/*	do a bit of checking so we don't over-run the boundaries
+					(necessary for non-square textures!)	*/
+				if( block_size_x * (i+1) > width )
+				{
+					u_block = width - i*block_size_y;
+				}
+				if( block_size_y * (j+1) > height )
+				{
+					v_block = height - j*block_size_y;
+				}
+				block_area = u_block*v_block;
+				/*	for this pixel, see what the average
+					of all the values in the block are.
+					note: start the sum at the rounding value, not at 0	*/
+				sum_value = block_area >> 1;
+				for( v = 0; v < v_block; ++v )
+				for( u = 0; u < u_block; ++u )
+				{
+					sum_value += orig[index + v*width*channels + u*channels];
+				}
+				resampled[j*mip_width*channels + i*channels + c] = sum_value / block_area;
+			}
+		}
+	}
+	return 1;
+}
+
+int
+	scale_image_RGB_to_NTSC_safe
+	(
+		unsigned char* orig,
+		int width, int height, int channels
+	)
+{
+	const float scale_lo = 16.0f - 0.499f;
+	const float scale_hi = 235.0f + 0.499f;
+	int i, j;
+	int nc = channels;
+	unsigned char scale_LUT[256];
+	/*	error check	*/
+	if( (width < 1) || (height < 1) ||
+		(channels < 1) || (orig == NULL) )
+	{
+		/*	nothing to do	*/
+		return 0;
+	}
+	/*	set up the scaling Look Up Table	*/
+	for( i = 0; i < 256; ++i )
+	{
+		scale_LUT[i] = (unsigned char)((scale_hi - scale_lo) * i / 255.0f + scale_lo);
+	}
+	/*	for channels = 2 or 4, ignore the alpha component	*/
+	nc -= 1 - (channels & 1);
+	/*	OK, go through the image and scale any non-alpha components	*/
+	for( i = 0; i < width*height*channels; i += channels )
+	{
+		for( j = 0; j < nc; ++j )
+		{
+			orig[i+j] = scale_LUT[orig[i+j]];
+		}
+	}
+	return 1;
+}
+
+unsigned char clamp_byte( int x ) { return ( (x) < 0 ? (0) : ( (x) > 255 ? 255 : (x) ) ); }
+
+/*
+	This function takes the RGB components of the image
+	and converts them into YCoCg.  3 components will be
+	re-ordered to CoYCg (for optimum DXT1 compression),
+	while 4 components will be ordered CoCgAY (for DXT5
+	compression).
+*/
+int
+	convert_RGB_to_YCoCg
+	(
+		unsigned char* orig,
+		int width, int height, int channels
+	)
+{
+	int i;
+	/*	error check	*/
+	if( (width < 1) || (height < 1) ||
+		(channels < 3) || (channels > 4) ||
+		(orig == NULL) )
+	{
+		/*	nothing to do	*/
+		return -1;
+	}
+	/*	do the conversion	*/
+	if( channels == 3 )
+	{
+		for( i = 0; i < width*height*3; i += 3 )
+		{
+			int r = orig[i+0];
+			int g = (orig[i+1] + 1) >> 1;
+			int b = orig[i+2];
+			int tmp = (2 + r + b) >> 2;
+			/*	Co	*/
+			orig[i+0] = clamp_byte( 128 + ((r - b + 1) >> 1) );
+			/*	Y	*/
+			orig[i+1] = clamp_byte( g + tmp );
+			/*	Cg	*/
+			orig[i+2] = clamp_byte( 128 + g - tmp );
+		}
+	} else
+	{
+		for( i = 0; i < width*height*4; i += 4 )
+		{
+			int r = orig[i+0];
+			int g = (orig[i+1] + 1) >> 1;
+			int b = orig[i+2];
+			unsigned char a = orig[i+3];
+			int tmp = (2 + r + b) >> 2;
+			/*	Co	*/
+			orig[i+0] = clamp_byte( 128 + ((r - b + 1) >> 1) );
+			/*	Cg	*/
+			orig[i+1] = clamp_byte( 128 + g - tmp );
+			/*	Alpha	*/
+			orig[i+2] = a;
+			/*	Y	*/
+			orig[i+3] = clamp_byte( g + tmp );
+		}
+	}
+	/*	done	*/
+	return 0;
+}
+
+/*
+	This function takes the YCoCg components of the image
+	and converts them into RGB.  See above.
+*/
+int
+	convert_YCoCg_to_RGB
+	(
+		unsigned char* orig,
+		int width, int height, int channels
+	)
+{
+	int i;
+	/*	error check	*/
+	if( (width < 1) || (height < 1) ||
+		(channels < 3) || (channels > 4) ||
+		(orig == NULL) )
+	{
+		/*	nothing to do	*/
+		return -1;
+	}
+	/*	do the conversion	*/
+	if( channels == 3 )
+	{
+		for( i = 0; i < width*height*3; i += 3 )
+		{
+			int co = orig[i+0] - 128;
+			int y  = orig[i+1];
+			int cg = orig[i+2] - 128;
+			/*	R	*/
+			orig[i+0] = clamp_byte( y + co - cg );
+			/*	G	*/
+			orig[i+1] = clamp_byte( y + cg );
+			/*	B	*/
+			orig[i+2] = clamp_byte( y - co - cg );
+		}
+	} else
+	{
+		for( i = 0; i < width*height*4; i += 4 )
+		{
+			int co = orig[i+0] - 128;
+			int cg = orig[i+1] - 128;
+			unsigned char a  = orig[i+2];
+			int y  = orig[i+3];
+			/*	R	*/
+			orig[i+0] = clamp_byte( y + co - cg );
+			/*	G	*/
+			orig[i+1] = clamp_byte( y + cg );
+			/*	B	*/
+			orig[i+2] = clamp_byte( y - co - cg );
+			/*	A	*/
+			orig[i+3] = a;
+		}
+	}
+	/*	done	*/
+	return 0;
+}
+
+float
+find_max_RGBE
+(
+	unsigned char *image,
+    int width, int height
+)
+{
+	float max_val = 0.0f;
+	unsigned char *img = image;
+	int i, j;
+	for( i = width * height; i > 0; --i )
+	{
+		/* float scale = powf( 2.0f, img[3] - 128.0f ) / 255.0f; */
+		float scale = ldexp( 1.0f / 255.0f, (int)(img[3]) - 128 );
+		for( j = 0; j < 3; ++j )
+		{
+			if( img[j] * scale > max_val )
+			{
+				max_val = img[j] * scale;
+			}
+		}
+		/* next pixel */
+		img += 4;
+	}
+	return max_val;
+}
+
+int
+RGBE_to_RGBdivA
+(
+    unsigned char *image,
+    int width, int height,
+    int rescale_to_max
+)
+{
+	/* local variables */
+	int i, iv;
+	unsigned char *img = image;
+	float scale = 1.0f;
+	/* error check */
+	if( (!image) || (width < 1) || (height < 1) )
+	{
+		return 0;
+	}
+	/* convert (note: no negative numbers, but 0.0 is possible) */
+	if( rescale_to_max )
+	{
+		scale = 255.0f / find_max_RGBE( image, width, height );
+	}
+	for( i = width * height; i > 0; --i )
+	{
+		/* decode this pixel, and find the max */
+		float r,g,b,e, m;
+		/* e = scale * powf( 2.0f, img[3] - 128.0f ) / 255.0f; */
+		e = scale * ldexp( 1.0f / 255.0f, (int)(img[3]) - 128 );
+		r = e * img[0];
+		g = e * img[1];
+		b = e * img[2];
+		m = (r > g) ? r : g;
+		m = (b > m) ? b : m;
+		/* and encode it into RGBdivA */
+		iv = (m != 0.0f) ? (int)(255.0f / m) : 1.0f;
+		iv = (iv < 1) ? 1 : iv;
+		img[3] = (iv > 255) ? 255 : iv;
+		iv = (int)(img[3] * r + 0.5f);
+		img[0] = (iv > 255) ? 255 : iv;
+		iv = (int)(img[3] * g + 0.5f);
+		img[1] = (iv > 255) ? 255 : iv;
+		iv = (int)(img[3] * b + 0.5f);
+		img[2] = (iv > 255) ? 255 : iv;
+		/* and on to the next pixel */
+		img += 4;
+	}
+	return 1;
+}
+
+int
+RGBE_to_RGBdivA2
+(
+    unsigned char *image,
+    int width, int height,
+    int rescale_to_max
+)
+{
+	/* local variables */
+	int i, iv;
+	unsigned char *img = image;
+	float scale = 1.0f;
+	/* error check */
+	if( (!image) || (width < 1) || (height < 1) )
+	{
+		return 0;
+	}
+	/* convert (note: no negative numbers, but 0.0 is possible) */
+	if( rescale_to_max )
+	{
+		scale = 255.0f * 255.0f / find_max_RGBE( image, width, height );
+	}
+	for( i = width * height; i > 0; --i )
+	{
+		/* decode this pixel, and find the max */
+		float r,g,b,e, m;
+		/* e = scale * powf( 2.0f, img[3] - 128.0f ) / 255.0f; */
+		e = scale * ldexp( 1.0f / 255.0f, (int)(img[3]) - 128 );
+		r = e * img[0];
+		g = e * img[1];
+		b = e * img[2];
+		m = (r > g) ? r : g;
+		m = (b > m) ? b : m;
+		/* and encode it into RGBdivA */
+		iv = (m != 0.0f) ? (int)sqrtf( 255.0f * 255.0f / m ) : 1.0f;
+		iv = (iv < 1) ? 1 : iv;
+		img[3] = (iv > 255) ? 255 : iv;
+		iv = (int)(img[3] * img[3] * r / 255.0f + 0.5f);
+		img[0] = (iv > 255) ? 255 : iv;
+		iv = (int)(img[3] * img[3] * g / 255.0f + 0.5f);
+		img[1] = (iv > 255) ? 255 : iv;
+		iv = (int)(img[3] * img[3] * b / 255.0f + 0.5f);
+		img[2] = (iv > 255) ? 255 : iv;
+		/* and on to the next pixel */
+		img += 4;
+	}
+	return 1;
+}
diff --git a/external/include/SOIL/image_helper.h b/external/include/SOIL/image_helper.h
new file mode 100644
index 0000000..3fa2662
--- /dev/null
+++ b/external/include/SOIL/image_helper.h
@@ -0,0 +1,115 @@
+/*
+    Jonathan Dummer
+
+    Image helper functions
+
+    MIT license
+*/
+
+#ifndef HEADER_IMAGE_HELPER
+#define HEADER_IMAGE_HELPER
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+	This function upscales an image.
+	Not to be used to create MIPmaps,
+	but to make it square,
+	or to make it a power-of-two sized.
+**/
+int
+	up_scale_image
+	(
+		const unsigned char* const orig,
+		int width, int height, int channels,
+		unsigned char* resampled,
+		int resampled_width, int resampled_height
+	);
+
+/**
+	This function downscales an image.
+	Used for creating MIPmaps,
+	the incoming image should be a
+	power-of-two sized.
+**/
+int
+	mipmap_image
+	(
+		const unsigned char* const orig,
+		int width, int height, int channels,
+		unsigned char* resampled,
+		int block_size_x, int block_size_y
+	);
+
+/**
+	This function takes the RGB components of the image
+	and scales each channel from [0,255] to [16,235].
+	This makes the colors "Safe" for display on NTSC
+	displays.  Note that this is _NOT_ a good idea for
+	loading images like normal- or height-maps!
+**/
+int
+	scale_image_RGB_to_NTSC_safe
+	(
+		unsigned char* orig,
+		int width, int height, int channels
+	);
+
+/**
+	This function takes the RGB components of the image
+	and converts them into YCoCg.  3 components will be
+	re-ordered to CoYCg (for optimum DXT1 compression),
+	while 4 components will be ordered CoCgAY (for DXT5
+	compression).
+**/
+int
+	convert_RGB_to_YCoCg
+	(
+		unsigned char* orig,
+		int width, int height, int channels
+	);
+
+/**
+	This function takes the YCoCg components of the image
+	and converts them into RGB.  See above.
+**/
+int
+	convert_YCoCg_to_RGB
+	(
+		unsigned char* orig,
+		int width, int height, int channels
+	);
+
+/**
+	Converts an HDR image from an array
+	of unsigned chars (RGBE) to RGBdivA
+	\return 0 if failed, otherwise returns 1
+**/
+int
+	RGBE_to_RGBdivA
+	(
+		unsigned char *image,
+		int width, int height,
+		int rescale_to_max
+	);
+
+/**
+	Converts an HDR image from an array
+	of unsigned chars (RGBE) to RGBdivA2
+	\return 0 if failed, otherwise returns 1
+**/
+int
+	RGBE_to_RGBdivA2
+	(
+		unsigned char *image,
+		int width, int height,
+		int rescale_to_max
+	);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* HEADER_IMAGE_HELPER	*/
diff --git a/external/include/SOIL/original/stb_image-1.09.c b/external/include/SOIL/original/stb_image-1.09.c
new file mode 100644
index 0000000..ee848ad
--- /dev/null
+++ b/external/include/SOIL/original/stb_image-1.09.c
@@ -0,0 +1,3632 @@
+/* stbi-1.09 - public domain JPEG/PNG reader - http://nothings.org/stb_image.c
+                      when you control the images you're loading
+
+   QUICK NOTES:
+      Primarily of interest to game developers and other people who can
+          avoid problematic images and only need the trivial interface
+
+      JPEG baseline (no JPEG progressive, no oddball channel decimations)
+      PNG non-interlaced
+      BMP non-1bpp, non-RLE
+      TGA (not sure what subset, if a subset)
+      PSD (composited view only, no extra channels)
+      HDR (radiance rgbE format)
+      writes BMP,TGA (define STBI_NO_WRITE to remove code)
+      decoded from memory or through stdio FILE (define STBI_NO_STDIO to remove code)
+        
+   TODO:
+      stbi_info_*
+  
+   history:
+      1.09   Fix format-conversion for PSD code (bad global variables!)
+      1.08   Thatcher Ulrich's PSD code integrated by Nicolas Schulz
+      1.07   attempt to fix C++ warning/errors again
+      1.06   attempt to fix C++ warning/errors again
+      1.05   fix TGA loading to return correct *comp and use good luminance calc
+      1.04   default float alpha is 1, not 255; use 'void *' for stbi_image_free
+      1.03   bugfixes to STBI_NO_STDIO, STBI_NO_HDR
+      1.02   support for (subset of) HDR files, float interface for preferred access to them
+      1.01   fix bug: possible bug in handling right-side up bmps... not sure
+             fix bug: the stbi_bmp_load() and stbi_tga_load() functions didn't work at all
+      1.00   interface to zlib that skips zlib header
+      0.99   correct handling of alpha in palette
+      0.98   TGA loader by lonesock; dynamically add loaders (untested)
+      0.97   jpeg errors on too large a file; also catch another malloc failure
+      0.96   fix detection of invalid v value - particleman@mollyrocket forum
+      0.95   during header scan, seek to markers in case of padding
+      0.94   STBI_NO_STDIO to disable stdio usage; rename all #defines the same
+      0.93   handle jpegtran output; verbose errors
+      0.92   read 4,8,16,24,32-bit BMP files of several formats
+      0.91   output 24-bit Windows 3.0 BMP files
+      0.90   fix a few more warnings; bump version number to approach 1.0
+      0.61   bugfixes due to Marc LeBlanc, Christopher Lloyd
+      0.60   fix compiling as c++
+      0.59   fix warnings: merge Dave Moore's -Wall fixes
+      0.58   fix bug: zlib uncompressed mode len/nlen was wrong endian
+      0.57   fix bug: jpg last huffman symbol before marker was >9 bits but less
+                      than 16 available
+      0.56   fix bug: zlib uncompressed mode len vs. nlen
+      0.55   fix bug: restart_interval not initialized to 0
+      0.54   allow NULL for 'int *comp'
+      0.53   fix bug in png 3->4; speedup png decoding
+      0.52   png handles req_comp=3,4 directly; minor cleanup; jpeg comments
+      0.51   obey req_comp requests, 1-component jpegs return as 1-component,
+             on 'test' only check type, not whether we support this variant
+*/
+
+
+////   begin header file  ////////////////////////////////////////////////////
+//
+// Limitations:
+//    - no progressive/interlaced support (jpeg, png)
+//    - 8-bit samples only (jpeg, png)
+//    - not threadsafe
+//    - channel subsampling of at most 2 in each dimension (jpeg)
+//    - no delayed line count (jpeg) -- IJG doesn't support either
+//
+// Basic usage (see HDR discussion below):
+//    int x,y,n;
+//    unsigned char *data = stbi_load(filename, &x, &y, &n, 0);
+//    // ... process data if not NULL ... 
+//    // ... x = width, y = height, n = # 8-bit components per pixel ...
+//    // ... replace '0' with '1'..'4' to force that many components per pixel
+//    stbi_image_free(data)
+//
+// Standard parameters:
+//    int *x       -- outputs image width in pixels
+//    int *y       -- outputs image height in pixels
+//    int *comp    -- outputs # of image components in image file
+//    int req_comp -- if non-zero, # of image components requested in result
+//
+// The return value from an image loader is an 'unsigned char *' which points
+// to the pixel data. The pixel data consists of *y scanlines of *x pixels,
+// with each pixel consisting of N interleaved 8-bit components; the first
+// pixel pointed to is top-left-most in the image. There is no padding between
+// image scanlines or between pixels, regardless of format. The number of
+// components N is 'req_comp' if req_comp is non-zero, or *comp otherwise.
+// If req_comp is non-zero, *comp has the number of components that _would_
+// have been output otherwise. E.g. if you set req_comp to 4, you will always
+// get RGBA output, but you can check *comp to easily see if it's opaque.
+//
+// An output image with N components has the following components interleaved
+// in this order in each pixel:
+//
+//     N=#comp     components
+//       1           grey
+//       2           grey, alpha
+//       3           red, green, blue
+//       4           red, green, blue, alpha
+//
+// If image loading fails for any reason, the return value will be NULL,
+// and *x, *y, *comp will be unchanged. The function stbi_failure_reason()
+// can be queried for an extremely brief, end-user unfriendly explanation
+// of why the load failed. Define STBI_NO_FAILURE_STRINGS to avoid
+// compiling these strings at all, and STBI_FAILURE_USERMSG to get slightly
+// more user-friendly ones.
+//
+// Paletted PNG and BMP images are automatically depalettized.
+//
+//
+// ===========================================================================
+//
+// HDR image support   (disable by defining STBI_NO_HDR)
+//
+// stb_image now supports loading HDR images in general, and currently
+// the Radiance .HDR file format, although the support is provided
+// generically. You can still load any file through the existing interface;
+// if you attempt to load an HDR file, it will be automatically remapped to
+// LDR, assuming gamma 2.2 and an arbitrary scale factor defaulting to 1;
+// both of these constants can be reconfigured through this interface:
+//
+//     stbi_hdr_to_ldr_gamma(2.2f);
+//     stbi_hdr_to_ldr_scale(1.0f);
+//
+// (note, do not use _inverse_ constants; stbi_image will invert them
+// appropriately).
+//
+// Additionally, there is a new, parallel interface for loading files as
+// (linear) floats to preserve the full dynamic range:
+//
+//    float *data = stbi_loadf(filename, &x, &y, &n, 0);
+// 
+// If you load LDR images through this interface, those images will
+// be promoted to floating point values, run through the inverse of
+// constants corresponding to the above:
+//
+//     stbi_ldr_to_hdr_scale(1.0f);
+//     stbi_ldr_to_hdr_gamma(2.2f);
+//
+// Finally, given a filename (or an open file or memory block--see header
+// file for details) containing image data, you can query for the "most
+// appropriate" interface to use (that is, whether the image is HDR or
+// not), using:
+//
+//     stbi_is_hdr(char *filename);
+
+
+#ifndef STBI_NO_STDIO
+#include <stdio.h>
+#endif
+
+#ifndef STBI_NO_HDR
+#include <math.h>  // ldexp
+#include <string.h> // strcmp
+#endif
+
+enum
+{
+   STBI_default = 0, // only used for req_comp
+
+   STBI_grey       = 1,
+   STBI_grey_alpha = 2,
+   STBI_rgb        = 3,
+   STBI_rgb_alpha  = 4,
+};
+
+typedef unsigned char stbi_uc;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// WRITING API
+
+#if !defined(STBI_NO_WRITE) && !defined(STBI_NO_STDIO)
+// write a BMP/TGA file given tightly packed 'comp' channels (no padding, nor bmp-stride-padding)
+// (you must include the appropriate extension in the filename).
+// returns TRUE on success, FALSE if couldn't open file, error writing file
+extern int      stbi_write_bmp       (char *filename,           int x, int y, int comp, void *data);
+extern int      stbi_write_tga       (char *filename,           int x, int y, int comp, void *data);
+#endif
+
+// PRIMARY API - works on images of any type
+
+// load image by filename, open file, or memory buffer
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_load            (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_load_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+extern int      stbi_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern stbi_uc *stbi_load_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+// for stbi_load_from_file, file pointer is left pointing immediately after image
+
+#ifndef STBI_NO_HDR
+#ifndef STBI_NO_STDIO
+extern float *stbi_loadf            (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern float *stbi_loadf_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+extern float *stbi_loadf_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+
+extern void   stbi_hdr_to_ldr_gamma(float gamma);
+extern void   stbi_hdr_to_ldr_scale(float scale);
+
+extern void   stbi_ldr_to_hdr_gamma(float gamma);
+extern void   stbi_ldr_to_hdr_scale(float scale);
+
+#endif // STBI_NO_HDR
+
+// get a VERY brief reason for failure
+extern char    *stbi_failure_reason  (void);
+
+// free the loaded image -- this is just free()
+extern void     stbi_image_free      (void *retval_from_stbi_load);
+
+// get image dimensions & components without fully decoding
+extern int      stbi_info_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp);
+extern int      stbi_is_hdr_from_memory(stbi_uc *buffer, int len);
+#ifndef STBI_NO_STDIO
+extern int      stbi_info            (char *filename,           int *x, int *y, int *comp);
+extern int      stbi_is_hdr          (char *filename);
+extern int      stbi_is_hdr_from_file(FILE *f);
+#endif
+
+// ZLIB client - used by PNG, available for other purposes
+
+extern char *stbi_zlib_decode_malloc_guesssize(int initial_size, int *outlen);
+extern char *stbi_zlib_decode_malloc(char *buffer, int len, int *outlen);
+extern int   stbi_zlib_decode_buffer(char *obuffer, int olen, char *ibuffer, int ilen);
+
+extern char *stbi_zlib_decode_noheader_malloc(char *buffer, int len, int *outlen);
+extern int   stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, char *ibuffer, int ilen);
+
+
+// TYPE-SPECIFIC ACCESS
+
+// is it a jpeg?
+extern int      stbi_jpeg_test_memory     (stbi_uc *buffer, int len);
+extern stbi_uc *stbi_jpeg_load_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+extern int      stbi_jpeg_info_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_jpeg_load            (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern int      stbi_jpeg_test_file       (FILE *f);
+extern stbi_uc *stbi_jpeg_load_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+
+extern int      stbi_jpeg_info            (char *filename,           int *x, int *y, int *comp);
+extern int      stbi_jpeg_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+
+extern int      stbi_jpeg_dc_only; // only decode DC component
+
+// is it a png?
+extern int      stbi_png_test_memory      (stbi_uc *buffer, int len);
+extern stbi_uc *stbi_png_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_png_load             (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info             (char *filename,           int *x, int *y, int *comp);
+extern int      stbi_png_test_file        (FILE *f);
+extern stbi_uc *stbi_png_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info_from_file   (FILE *f,                  int *x, int *y, int *comp);
+#endif
+
+// is it a bmp?
+extern int      stbi_bmp_test_memory      (stbi_uc *buffer, int len);
+
+extern stbi_uc *stbi_bmp_load             (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_bmp_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_bmp_test_file        (FILE *f);
+extern stbi_uc *stbi_bmp_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it a tga?
+extern int      stbi_tga_test_memory      (stbi_uc *buffer, int len);
+
+extern stbi_uc *stbi_tga_load             (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_tga_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_tga_test_file        (FILE *f);
+extern stbi_uc *stbi_tga_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it a psd?
+extern int      stbi_psd_test_memory      (stbi_uc *buffer, int len);
+
+extern stbi_uc *stbi_psd_load             (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_psd_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_psd_test_file        (FILE *f);
+extern stbi_uc *stbi_psd_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it an hdr?
+extern int      stbi_hdr_test_memory      (stbi_uc *buffer, int len);
+
+extern float *  stbi_hdr_load             (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern float *  stbi_hdr_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_hdr_test_file        (FILE *f);
+extern float *  stbi_hdr_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// define new loaders
+typedef struct
+{
+   int       (*test_memory)(stbi_uc *buffer, int len);
+   stbi_uc * (*load_from_memory)(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp);
+   #ifndef STBI_NO_STDIO
+   int       (*test_file)(FILE *f);
+   stbi_uc * (*load_from_file)(FILE *f, int *x, int *y, int *comp, int req_comp);
+   #endif
+} stbi_loader;
+
+// register a loader by filling out the above structure (you must defined ALL functions)
+// returns 1 if added or already added, 0 if not added (too many loaders)
+extern int stbi_register_loader(stbi_loader *loader);
+
+#ifdef __cplusplus
+}
+#endif
+
+//
+//
+////   end header file   /////////////////////////////////////////////////////
+
+#ifndef STBI_NO_STDIO
+#include <stdio.h>
+#endif
+#include <stdlib.h>
+#include <memory.h>
+#include <assert.h>
+#include <stdarg.h>
+
+#ifndef _MSC_VER
+#define __forceinline
+#endif
+
+// implementation:
+typedef unsigned char uint8;
+typedef unsigned short uint16;
+typedef   signed short  int16;
+typedef unsigned int   uint32;
+typedef   signed int    int32;
+typedef unsigned int   uint;
+
+// should produce compiler error if size is wrong
+typedef unsigned char validate_uint32[sizeof(uint32)==4];
+
+#if defined(STBI_NO_STDIO) && !defined(STBI_NO_WRITE)
+#define STBI_NO_WRITE
+#endif
+
+//////////////////////////////////////////////////////////////////////////////
+//
+// Generic API that works on all image types
+//
+
+static char *failure_reason;
+
+char *stbi_failure_reason(void)
+{
+   return failure_reason;
+}
+
+static int e(char *str)
+{
+   failure_reason = str;
+   return 0;
+}
+
+#ifdef STBI_NO_FAILURE_STRINGS
+   #define e(x,y)  0
+#elif defined(STBI_FAILURE_USERMSG)
+   #define e(x,y)  e(y)
+#else
+   #define e(x,y)  e(x)
+#endif
+
+#define epf(x,y)   ((float *) (e(x,y)?NULL:NULL))
+#define epuc(x,y)  ((unsigned char *) (e(x,y)?NULL:NULL))
+
+void stbi_image_free(void *retval_from_stbi_load)
+{
+   free(retval_from_stbi_load);
+}
+
+#define MAX_LOADERS  32
+stbi_loader *loaders[MAX_LOADERS];
+static int max_loaders = 0;
+
+int stbi_register_loader(stbi_loader *loader)
+{
+   int i;
+   for (i=0; i < MAX_LOADERS; ++i) {
+      // already present?
+      if (loaders[i] == loader)
+         return 1;
+      // end of the list?
+      if (loaders[i] == NULL) {
+         loaders[i] = loader;
+         max_loaders = i+1;
+         return 1;
+      }
+   }
+   // no room for it
+   return 0;
+}
+
+#ifndef STBI_NO_HDR
+static float   *ldr_to_hdr(stbi_uc *data, int x, int y, int comp);
+static stbi_uc *hdr_to_ldr(float   *data, int x, int y, int comp);
+#endif
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_load(char *filename, int *x, int *y, int *comp, int req_comp)
+{
+   FILE *f = fopen(filename, "rb");
+   unsigned char *result;
+   if (!f) return epuc("can't fopen", "Unable to open file");
+   result = stbi_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return result;
+}
+
+unsigned char *stbi_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   int i;
+   if (stbi_jpeg_test_file(f))
+      return stbi_jpeg_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_png_test_file(f))
+      return stbi_png_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_bmp_test_file(f))
+      return stbi_bmp_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_psd_test_file(f))
+      return stbi_psd_load_from_file(f,x,y,comp,req_comp);
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_file(f)) {
+      float *hdr = stbi_hdr_load_from_file(f, x,y,comp,req_comp);
+      return hdr_to_ldr(hdr, *x, *y, req_comp ? req_comp : *comp);
+   }
+   #endif
+   for (i=0; i < max_loaders; ++i)
+      if (loaders[i]->test_file(f))
+         return loaders[i]->load_from_file(f,x,y,comp,req_comp);
+   // test tga last because it's a crappy test!
+   if (stbi_tga_test_file(f))
+      return stbi_tga_load_from_file(f,x,y,comp,req_comp);
+   return epuc("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+unsigned char *stbi_load_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   int i;
+   if (stbi_jpeg_test_memory(buffer,len))
+      return stbi_jpeg_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_png_test_memory(buffer,len))
+      return stbi_png_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_bmp_test_memory(buffer,len))
+      return stbi_bmp_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_psd_test_memory(buffer,len))
+      return stbi_psd_load_from_memory(buffer,len,x,y,comp,req_comp);
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_memory(buffer, len)) {
+      float *hdr = stbi_hdr_load_from_memory(buffer, len,x,y,comp,req_comp);
+      return hdr_to_ldr(hdr, *x, *y, req_comp ? req_comp : *comp);
+   }
+   #endif
+   for (i=0; i < max_loaders; ++i)
+      if (loaders[i]->test_memory(buffer,len))
+         return loaders[i]->load_from_memory(buffer,len,x,y,comp,req_comp);
+   // test tga last because it's a crappy test!
+   if (stbi_tga_test_memory(buffer,len))
+      return stbi_tga_load_from_memory(buffer,len,x,y,comp,req_comp);
+   return epuc("unknown image type", "Image not of any known type, or corrupt");
+}
+
+#ifndef STBI_NO_HDR
+
+#ifndef STBI_NO_STDIO
+float *stbi_loadf(char *filename, int *x, int *y, int *comp, int req_comp)
+{
+   FILE *f = fopen(filename, "rb");
+   float *result;
+   if (!f) return epf("can't fopen", "Unable to open file");
+   result = stbi_loadf_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return result;
+}
+
+float *stbi_loadf_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_file(f))
+      return stbi_hdr_load_from_file(f,x,y,comp,req_comp);
+   #endif
+   data = stbi_load_from_file(f, x, y, comp, req_comp);
+   if (data)
+      return ldr_to_hdr(data, *x, *y, req_comp ? req_comp : *comp);
+   return epf("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+float *stbi_loadf_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_memory(buffer, len))
+      return stbi_hdr_load_from_memory(buffer, len,x,y,comp,req_comp);
+   #endif
+   data = stbi_load_from_memory(buffer, len, x, y, comp, req_comp);
+   if (data)
+      return ldr_to_hdr(data, *x, *y, req_comp ? req_comp : *comp);
+   return epf("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+// these is-hdr-or-not is defined independent of whether STBI_NO_HDR is
+// defined, for API simplicity; if STBI_NO_HDR is defined, it always
+// reports false!
+
+extern int      stbi_is_hdr_from_memory(stbi_uc *buffer, int len)
+{
+   #ifndef STBI_NO_HDR
+   return stbi_hdr_test_memory(buffer, len);
+   #else
+   return 0;
+   #endif
+}
+
+#ifndef STBI_NO_STDIO
+extern int      stbi_is_hdr          (char *filename)
+{
+   FILE *f = fopen(filename, "rb");
+   int result=0;
+   if (f) {
+      result = stbi_is_hdr_from_file(f);
+      fclose(f);
+   }
+   return result;
+}
+
+extern int      stbi_is_hdr_from_file(FILE *f)
+{
+   #ifndef STBI_NO_HDR
+   return stbi_hdr_test_file(f);
+   #else
+   return 0;
+   #endif
+}
+
+#endif
+
+// @TODO: get image dimensions & components without fully decoding
+#ifndef STBI_NO_STDIO
+extern int      stbi_info            (char *filename,           int *x, int *y, int *comp);
+extern int      stbi_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_info_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_HDR
+static float h2l_gamma_i=1.0f/2.2f, h2l_scale_i=1.0f;
+static float l2h_gamma=2.2f, l2h_scale=1.0f;
+
+void   stbi_hdr_to_ldr_gamma(float gamma) { h2l_gamma_i = 1/gamma; }
+void   stbi_hdr_to_ldr_scale(float scale) { h2l_scale_i = 1/scale; }
+
+void   stbi_ldr_to_hdr_gamma(float gamma) { l2h_gamma = gamma; }
+void   stbi_ldr_to_hdr_scale(float scale) { l2h_scale = scale; }
+#endif
+
+
+//////////////////////////////////////////////////////////////////////////////
+//
+// Common code used by all image loaders
+//
+
+// image width, height, # components
+static uint32 img_x, img_y;
+static int img_n, img_out_n;
+
+enum
+{
+   SCAN_load=0,
+   SCAN_type,
+   SCAN_header,
+};
+
+// An API for reading either from memory or file.
+#ifndef STBI_NO_STDIO
+static FILE  *img_file;
+#endif
+static uint8 *img_buffer, *img_buffer_end;
+
+#ifndef STBI_NO_STDIO
+static void start_file(FILE *f)
+{
+   img_file = f;
+}
+#endif
+
+static void start_mem(uint8 *buffer, int len)
+{
+#ifndef STBI_NO_STDIO
+   img_file = NULL;
+#endif
+   img_buffer = buffer;
+   img_buffer_end = buffer+len;
+}
+
+static int get8(void)
+{
+#ifndef STBI_NO_STDIO
+   if (img_file) {
+      int c = fgetc(img_file);
+      return c == EOF ? 0 : c;
+   }
+#endif
+   if (img_buffer < img_buffer_end)
+      return *img_buffer++;
+   return 0;
+}
+
+static int at_eof(void)
+{
+#ifndef STBI_NO_STDIO
+   if (img_file)
+      return feof(img_file);
+#endif
+   return img_buffer >= img_buffer_end;   
+}
+
+static uint8 get8u(void)
+{
+   return (uint8) get8();
+}
+
+static void skip(int n)
+{
+#ifndef STBI_NO_STDIO
+   if (img_file)
+      fseek(img_file, n, SEEK_CUR);
+   else
+#endif
+      img_buffer += n;
+}
+
+static int get16(void)
+{
+   int z = get8();
+   return (z << 8) + get8();
+}
+
+static uint32 get32(void)
+{
+   uint32 z = get16();
+   return (z << 16) + get16();
+}
+
+static int get16le(void)
+{
+   int z = get8();
+   return z + (get8() << 8);
+}
+
+static uint32 get32le(void)
+{
+   uint32 z = get16le();
+   return z + (get16le() << 16);
+}
+
+static void getn(stbi_uc *buffer, int n)
+{
+#ifndef STBI_NO_STDIO
+   if (img_file) {
+      fread(buffer, 1, n, img_file);
+      return;
+   }
+#endif
+   memcpy(buffer, img_buffer, n);
+   img_buffer += n;
+}
+
+//////////////////////////////////////////////////////////////////////////////
+//
+//  generic converter from built-in img_n to req_comp
+//    individual types do this automatically as much as possible (e.g. jpeg
+//    does all cases internally since it needs to colorspace convert anyway,
+//    and it never has alpha, so very few cases ). png can automatically
+//    interleave an alpha=255 channel, but falls back to this for other cases
+//
+//  assume data buffer is malloced, so malloc a new one and free that one
+//  only failure mode is malloc failing
+
+static uint8 compute_y(int r, int g, int b)
+{
+   return (uint8) (((r*77) + (g*150) +  (29*b)) >> 8);
+}
+
+static unsigned char *convert_format(unsigned char *data, int img_n, int req_comp)
+{
+   uint i,j;
+   unsigned char *good;
+
+   if (req_comp == img_n) return data;
+   assert(req_comp >= 1 && req_comp <= 4);
+
+   good = (unsigned char *) malloc(req_comp * img_x * img_y);
+   if (good == NULL) {
+      free(data);
+      return epuc("outofmem", "Out of memory");
+   }
+
+   for (j=0; j < img_y; ++j) {
+      unsigned char *src  = data + j * img_x * img_n   ;
+      unsigned char *dest = good + j * img_x * req_comp;
+
+      #define COMBO(a,b)  ((a)*8+(b))
+      #define CASE(a,b)   case COMBO(a,b): for(i=0; i < img_x; ++i, src += a, dest += b)
+
+      // convert source image with img_n components to one with req_comp components;
+      // avoid switch per pixel, so use switch per scanline and massive macros
+      switch(COMBO(img_n, req_comp)) {
+         CASE(1,2) dest[0]=src[0], dest[1]=255; break;
+         CASE(1,3) dest[0]=dest[1]=dest[2]=src[0]; break;
+         CASE(1,4) dest[0]=dest[1]=dest[2]=src[0], dest[3]=255; break;
+         CASE(2,1) dest[0]=src[0]; break;
+         CASE(2,3) dest[0]=dest[1]=dest[2]=src[0]; break;
+         CASE(2,4) dest[0]=dest[1]=dest[2]=src[0], dest[3]=src[1]; break;
+         CASE(3,4) dest[0]=src[0],dest[1]=src[1],dest[2]=src[2],dest[3]=255; break;
+         CASE(3,1) dest[0]=compute_y(src[0],src[1],src[2]); break;
+         CASE(3,2) dest[0]=compute_y(src[0],src[1],src[2]), dest[1] = 255; break;
+         CASE(4,1) dest[0]=compute_y(src[0],src[1],src[2]); break;
+         CASE(4,2) dest[0]=compute_y(src[0],src[1],src[2]), dest[1] = src[3]; break;
+         CASE(4,3) dest[0]=src[0],dest[1]=src[1],dest[2]=src[2]; break;
+         default: assert(0);
+      }
+      #undef CASE
+   }
+
+   free(data);
+   img_out_n = req_comp;
+   return good;
+}
+
+#ifndef STBI_NO_HDR
+static float   *ldr_to_hdr(stbi_uc *data, int x, int y, int comp)
+{
+   int i,k,n;
+   float *output = (float *) malloc(x * y * comp * sizeof(float));
+   if (output == NULL) { free(data); return epf("outofmem", "Out of memory"); }
+   // compute number of non-alpha components
+   if (comp & 1) n = comp; else n = comp-1;
+   for (i=0; i < x*y; ++i) {
+      for (k=0; k < n; ++k) {
+         output[i*comp + k] = (float) pow(data[i*comp+k]/255.0f, l2h_gamma) * l2h_scale;
+      }
+      if (k < comp) output[i*comp + k] = data[i*comp+k]/255.0f;
+   }
+   free(data);
+   return output;
+}
+
+#define float2int(x)   ((int) (x))
+static stbi_uc *hdr_to_ldr(float   *data, int x, int y, int comp)
+{
+   int i,k,n;
+   stbi_uc *output = (stbi_uc *) malloc(x * y * comp);
+   if (output == NULL) { free(data); return epuc("outofmem", "Out of memory"); }
+   // compute number of non-alpha components
+   if (comp & 1) n = comp; else n = comp-1;
+   for (i=0; i < x*y; ++i) {
+      for (k=0; k < n; ++k) {
+         float z = (float) pow(data[i*comp+k]*h2l_scale_i, h2l_gamma_i) * 255 + 0.5f;
+         if (z < 0) z = 0;
+         if (z > 255) z = 255;
+         output[i*comp + k] = float2int(z);
+      }
+      if (k < comp) {
+         float z = data[i*comp+k] * 255 + 0.5f;
+         if (z < 0) z = 0;
+         if (z > 255) z = 255;
+         output[i*comp + k] = float2int(z);
+      }
+   }
+   free(data);
+   return output;
+}
+#endif
+
+//////////////////////////////////////////////////////////////////////////////
+//
+//  "baseline" JPEG/JFIF decoder (not actually fully baseline implementation)
+//
+//    simple implementation
+//      - channel subsampling of at most 2 in each dimension
+//      - doesn't support delayed output of y-dimension
+//      - simple interface (only one output format: 8-bit interleaved RGB)
+//      - doesn't try to recover corrupt jpegs
+//      - doesn't allow partial loading, loading multiple at once
+//      - still fast on x86 (copying globals into locals doesn't help x86)
+//      - allocates lots of intermediate memory (full size of all components)
+//        - non-interleaved case requires this anyway
+//        - allows good upsampling (see next)
+//    high-quality
+//      - upsampled channels are bilinearly interpolated, even across blocks
+//      - quality integer IDCT derived from IJG's 'slow'
+//    performance
+//      - fast huffman; reasonable integer IDCT
+//      - uses a lot of intermediate memory, could cache poorly
+//      - load http://nothings.org/remote/anemones.jpg 3 times on 2.8Ghz P4
+//          stb_jpeg:   1.34 seconds (MSVC6, default release build)
+//          stb_jpeg:   1.06 seconds (MSVC6, processor = Pentium Pro)
+//          IJL11.dll:  1.08 seconds (compiled by intel)
+//          IJG 1998:   0.98 seconds (MSVC6, makefile provided by IJG)
+//          IJG 1998:   0.95 seconds (MSVC6, makefile + proc=PPro)
+
+int stbi_jpeg_dc_only;
+
+// huffman decoding acceleration
+#define FAST_BITS   9  // larger handles more cases; smaller stomps less cache
+
+typedef struct
+{
+   uint8  fast[1 << FAST_BITS];
+   // weirdly, repacking this into AoS is a 10% speed loss, instead of a win
+   uint16 code[256];
+   uint8  values[256];
+   uint8  size[257];
+   unsigned int maxcode[18];
+   int    delta[17];   // old 'firstsymbol' - old 'firstcode'
+} huffman;
+
+static huffman huff_dc[4];  // baseline is 2 tables, extended is 4
+static huffman huff_ac[4];
+static uint8 dequant[4][64];
+
+static int build_huffman(huffman *h, int *count)
+{
+   int i,j,k=0,code;
+   // build size list for each symbol (from JPEG spec)
+   for (i=0; i < 16; ++i)
+      for (j=0; j < count[i]; ++j)
+         h->size[k++] = (uint8) (i+1);
+   h->size[k] = 0;
+
+   // compute actual symbols (from jpeg spec)
+   code = 0;
+   k = 0;
+   for(j=1; j <= 16; ++j) {
+      // compute delta to add to code to compute symbol id
+      h->delta[j] = k - code;
+      if (h->size[k] == j) {
+         while (h->size[k] == j)
+            h->code[k++] = (uint16) (code++);
+         if (code-1 >= (1 << j)) return e("bad code lengths","Corrupt JPEG");
+      }
+      // compute largest code + 1 for this size, preshifted as needed later
+      h->maxcode[j] = code << (16-j);
+      code <<= 1;
+   }
+   h->maxcode[j] = 0xffffffff;
+
+   // build non-spec acceleration table; 255 is flag for not-accelerated
+   memset(h->fast, 255, 1 << FAST_BITS);
+   for (i=0; i < k; ++i) {
+      int s = h->size[i];
+      if (s <= FAST_BITS) {
+         int c = h->code[i] << (FAST_BITS-s);
+         int m = 1 << (FAST_BITS-s);
+         for (j=0; j < m; ++j) {
+            h->fast[c+j] = (uint8) i;
+         }
+      }
+   }
+   return 1;
+}
+
+// sizes for components, interleaved MCUs
+static int img_h_max, img_v_max;
+static int img_mcu_x, img_mcu_y;
+static int img_mcu_w, img_mcu_h;
+
+// definition of jpeg image component
+static struct
+{
+   int id;
+   int h,v;
+   int tq;
+   int hd,ha;
+   int dc_pred;
+
+   int x,y,w2,h2;
+   uint8 *data;
+} img_comp[4];
+
+static unsigned long  code_buffer; // jpeg entropy-coded buffer
+static int            code_bits;   // number of valid bits
+static unsigned char  marker;      // marker seen while filling entropy buffer
+static int            nomore;      // flag if we saw a marker so must stop
+ 
+static void grow_buffer_unsafe(void)
+{
+   do {
+      int b = nomore ? 0 : get8();
+      if (b == 0xff) {
+         int c = get8();
+         if (c != 0) {
+            marker = (unsigned char) c;
+            nomore = 1;
+            return;
+         }
+      }
+      code_buffer = (code_buffer << 8) | b;
+      code_bits += 8;
+   } while (code_bits <= 24);
+}
+
+// (1 << n) - 1
+static unsigned long bmask[17]={0,1,3,7,15,31,63,127,255,511,1023,2047,4095,8191,16383,32767,65535};
+
+// decode a jpeg huffman value from the bitstream
+__forceinline static int decode(huffman *h)
+{
+   unsigned int temp;
+   int c,k;
+
+   if (code_bits < 16) grow_buffer_unsafe();
+
+   // look at the top FAST_BITS and determine what symbol ID it is,
+   // if the code is <= FAST_BITS
+   c = (code_buffer >> (code_bits - FAST_BITS)) & ((1 << FAST_BITS)-1);
+   k = h->fast[c];
+   if (k < 255) {
+      if (h->size[k] > code_bits)
+         return -1;
+      code_bits -= h->size[k];
+      return h->values[k];
+   }
+
+   // naive test is to shift the code_buffer down so k bits are
+   // valid, then test against maxcode. To speed this up, we've
+   // preshifted maxcode left so that it has (16-k) 0s at the
+   // end; in other words, regardless of the number of bits, it
+   // wants to be compared against something shifted to have 16;
+   // that way we don't need to shift inside the loop.
+   if (code_bits < 16)
+      temp = (code_buffer << (16 - code_bits)) & 0xffff;
+   else
+      temp = (code_buffer >> (code_bits - 16)) & 0xffff;
+   for (k=FAST_BITS+1 ; ; ++k)
+      if (temp < h->maxcode[k])
+         break;
+   if (k == 17) {
+      // error! code not found
+      code_bits -= 16;
+      return -1;
+   }
+
+   if (k > code_bits)
+      return -1;
+
+   // convert the huffman code to the symbol id
+   c = ((code_buffer >> (code_bits - k)) & bmask[k]) + h->delta[k];
+   assert((((code_buffer) >> (code_bits - h->size[c])) & bmask[h->size[c]]) == h->code[c]);
+
+   // convert the id to a symbol
+   code_bits -= k;
+   return h->values[c];
+}
+
+// combined JPEG 'receive' and JPEG 'extend', since baseline
+// always extends everything it receives.
+__forceinline static int extend_receive(int n)
+{
+   unsigned int m = 1 << (n-1);
+   unsigned int k;
+   if (code_bits < n) grow_buffer_unsafe();
+   k = (code_buffer >> (code_bits - n)) & bmask[n];
+   code_bits -= n;
+   // the following test is probably a random branch that won't
+   // predict well. I tried to table accelerate it but failed.
+   // maybe it's compiling as a conditional move?
+   if (k < m)
+      return (-1 << n) + k + 1;
+   else
+      return k;
+}
+
+// given a value that's at position X in the zigzag stream,
+// where does it appear in the 8x8 matrix coded as row-major?
+static uint8 dezigzag[64+15] =
+{
+    0,  1,  8, 16,  9,  2,  3, 10,
+   17, 24, 32, 25, 18, 11,  4,  5,
+   12, 19, 26, 33, 40, 48, 41, 34,
+   27, 20, 13,  6,  7, 14, 21, 28,
+   35, 42, 49, 56, 57, 50, 43, 36,
+   29, 22, 15, 23, 30, 37, 44, 51,
+   58, 59, 52, 45, 38, 31, 39, 46,
+   53, 60, 61, 54, 47, 55, 62, 63,
+   // let corrupt input sample past end
+   63, 63, 63, 63, 63, 63, 63, 63,
+   63, 63, 63, 63, 63, 63, 63
+};
+
+// decode one 64-entry block--
+static int decode_block(short data[64], huffman *hdc, huffman *hac, int b)
+{
+   int diff,dc,k;
+   int t = decode(hdc);
+   if (t < 0) return e("bad huffman code","Corrupt JPEG");
+
+   // 0 all the ac values now so we can do it 32-bits at a time
+   memset(data,0,64*sizeof(data[0]));
+
+   diff = t ? extend_receive(t) : 0;
+   dc = img_comp[b].dc_pred + diff;
+   img_comp[b].dc_pred = dc;
+   data[0] = (short) dc;
+
+   // decode AC components, see JPEG spec
+   k = 1;
+   do {
+      int r,s;
+      int rs = decode(hac);
+      if (rs < 0) return e("bad huffman code","Corrupt JPEG");
+      s = rs & 15;
+      r = rs >> 4;
+      if (s == 0) {
+         if (rs != 0xf0) break; // end block
+         k += 16;
+      } else {
+         k += r;
+         // decode into unzigzag'd location
+         data[dezigzag[k++]] = (short) extend_receive(s);
+      }
+   } while (k < 64);
+   return 1;
+}
+
+// take a -128..127 value and clamp it and convert to 0..255
+__forceinline static uint8 clamp(int x)
+{
+   x += 128;
+   // trick to use a single test to catch both cases
+   if ((unsigned int) x > 255) {
+      if (x < 0) return 0;
+      if (x > 255) return 255;
+   }
+   return (uint8) x;
+}
+
+#define f2f(x)  (int) (((x) * 4096 + 0.5))
+#define fsh(x)  ((x) << 12)
+
+// derived from jidctint -- DCT_ISLOW
+#define IDCT_1D(s0,s1,s2,s3,s4,s5,s6,s7)       \
+   int t0,t1,t2,t3,p1,p2,p3,p4,p5,x0,x1,x2,x3; \
+   p2 = s2;                                    \
+   p3 = s6;                                    \
+   p1 = (p2+p3) * f2f(0.5411961f);             \
+   t2 = p1 + p3*f2f(-1.847759065f);            \
+   t3 = p1 + p2*f2f( 0.765366865f);            \
+   p2 = s0;                                    \
+   p3 = s4;                                    \
+   t0 = fsh(p2+p3);                            \
+   t1 = fsh(p2-p3);                            \
+   x0 = t0+t3;                                 \
+   x3 = t0-t3;                                 \
+   x1 = t1+t2;                                 \
+   x2 = t1-t2;                                 \
+   t0 = s7;                                    \
+   t1 = s5;                                    \
+   t2 = s3;                                    \
+   t3 = s1;                                    \
+   p3 = t0+t2;                                 \
+   p4 = t1+t3;                                 \
+   p1 = t0+t3;                                 \
+   p2 = t1+t2;                                 \
+   p5 = (p3+p4)*f2f( 1.175875602f);            \
+   t0 = t0*f2f( 0.298631336f);                 \
+   t1 = t1*f2f( 2.053119869f);                 \
+   t2 = t2*f2f( 3.072711026f);                 \
+   t3 = t3*f2f( 1.501321110f);                 \
+   p1 = p5 + p1*f2f(-0.899976223f);            \
+   p2 = p5 + p2*f2f(-2.562915447f);            \
+   p3 = p3*f2f(-1.961570560f);                 \
+   p4 = p4*f2f(-0.390180644f);                 \
+   t3 += p1+p4;                                \
+   t2 += p2+p3;                                \
+   t1 += p2+p4;                                \
+   t0 += p1+p3;
+
+// .344 seconds on 3*anemones.jpg
+static void idct_block(uint8 *out, int out_stride, short data[64], uint8 *dequantize)
+{
+   int i,val[64],*v=val;
+   uint8 *o,*dq = dequantize;
+   short *d = data;
+
+   if (stbi_jpeg_dc_only) {
+      // ok, I don't really know why this is right, but it seems to be:
+      int z = 128 + ((d[0] * dq[0]) >> 3);
+      for (i=0; i < 8; ++i) {
+         out[0] = out[1] = out[2] = out[3] = out[4] = out[5] = out[6] = out[7] = z;
+         out += out_stride;
+      }
+      return;
+   }
+
+   // columns
+   for (i=0; i < 8; ++i,++d,++dq, ++v) {
+      // if all zeroes, shortcut -- this avoids dequantizing 0s and IDCTing
+      if (d[ 8]==0 && d[16]==0 && d[24]==0 && d[32]==0
+           && d[40]==0 && d[48]==0 && d[56]==0) {
+         //    no shortcut                 0     seconds
+         //    (1|2|3|4|5|6|7)==0          0     seconds
+         //    all separate               -0.047 seconds
+         //    1 && 2|3 && 4|5 && 6|7:    -0.047 seconds
+         int dcterm = d[0] * dq[0] << 2;
+         v[0] = v[8] = v[16] = v[24] = v[32] = v[40] = v[48] = v[56] = dcterm;
+      } else {
+         IDCT_1D(d[ 0]*dq[ 0],d[ 8]*dq[ 8],d[16]*dq[16],d[24]*dq[24],
+                 d[32]*dq[32],d[40]*dq[40],d[48]*dq[48],d[56]*dq[56])
+         // constants scaled things up by 1<<12; let's bring them back
+         // down, but keep 2 extra bits of precision
+         x0 += 512; x1 += 512; x2 += 512; x3 += 512;
+         v[ 0] = (x0+t3) >> 10;
+         v[56] = (x0-t3) >> 10;
+         v[ 8] = (x1+t2) >> 10;
+         v[48] = (x1-t2) >> 10;
+         v[16] = (x2+t1) >> 10;
+         v[40] = (x2-t1) >> 10;
+         v[24] = (x3+t0) >> 10;
+         v[32] = (x3-t0) >> 10;
+      }
+   }
+
+   for (i=0, v=val, o=out; i < 8; ++i,v+=8,o+=out_stride) {
+      // no fast case since the first 1D IDCT spread components out
+      IDCT_1D(v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7])
+      // constants scaled things up by 1<<12, plus we had 1<<2 from first
+      // loop, plus horizontal and vertical each scale by sqrt(8) so together
+      // we've got an extra 1<<3, so 1<<17 total we need to remove.
+      x0 += 65536; x1 += 65536; x2 += 65536; x3 += 65536;
+      o[0] = clamp((x0+t3) >> 17);
+      o[7] = clamp((x0-t3) >> 17);
+      o[1] = clamp((x1+t2) >> 17);
+      o[6] = clamp((x1-t2) >> 17);
+      o[2] = clamp((x2+t1) >> 17);
+      o[5] = clamp((x2-t1) >> 17);
+      o[3] = clamp((x3+t0) >> 17);
+      o[4] = clamp((x3-t0) >> 17);
+   }
+}
+
+#define MARKER_none  0xff
+// if there's a pending marker from the entropy stream, return that
+// otherwise, fetch from the stream and get a marker. if there's no
+// marker, return 0xff, which is never a valid marker value
+static uint8 get_marker(void)
+{
+   uint8 x;
+   if (marker != MARKER_none) { x = marker; marker = MARKER_none; return x; }
+   x = get8u();
+   if (x != 0xff) return MARKER_none;
+   while (x == 0xff)
+      x = get8u();
+   return x;
+}
+
+// in each scan, we'll have scan_n components, and the order
+// of the components is specified by order[]
+static int scan_n, order[4];
+static int restart_interval, todo;
+#define RESTART(x)     ((x) >= 0xd0 && (x) <= 0xd7)
+
+// after a restart interval, reset the entropy decoder and
+// the dc prediction
+static void reset(void)
+{
+   code_bits = 0;
+   code_buffer = 0;
+   nomore = 0;
+   img_comp[0].dc_pred = img_comp[1].dc_pred = img_comp[2].dc_pred = 0;
+   marker = MARKER_none;
+   todo = restart_interval ? restart_interval : 0x7fffffff;
+   // no more than 1<<31 MCUs if no restart_interal? that's plenty safe,
+   // since we don't even allow 1<<30 pixels
+}
+
+static int parse_entropy_coded_data(void)
+{
+   reset();
+   if (scan_n == 1) {
+      int i,j;
+      short data[64];
+      int n = order[0];
+      // non-interleaved data, we just need to process one block at a time,
+      // in trivial scanline order
+      // number of blocks to do just depends on how many actual "pixels" this
+      // component has, independent of interleaved MCU blocking and such
+      int w = (img_comp[n].x+7) >> 3;
+      int h = (img_comp[n].y+7) >> 3;
+      for (j=0; j < h; ++j) {
+         for (i=0; i < w; ++i) {
+            if (!decode_block(data, huff_dc+img_comp[n].hd, huff_ac+img_comp[n].ha, n)) return 0;
+            idct_block(img_comp[n].data+img_comp[n].w2*j*8+i*8, img_comp[n].w2, data, dequant[img_comp[n].tq]);
+            // every data block is an MCU, so countdown the restart interval
+            if (--todo <= 0) {
+               if (code_bits < 24) grow_buffer_unsafe();
+               // if it's NOT a restart, then just bail, so we get corrupt data
+               // rather than no data
+               if (!RESTART(marker)) return 1;
+               reset();
+            }
+         }
+      }
+   } else { // interleaved!
+      int i,j,k,x,y;
+      short data[64];
+      for (j=0; j < img_mcu_y; ++j) {
+         for (i=0; i < img_mcu_x; ++i) {
+            // scan an interleaved mcu... process scan_n components in order
+            for (k=0; k < scan_n; ++k) {
+               int n = order[k];
+               // scan out an mcu's worth of this component; that's just determined
+               // by the basic H and V specified for the component
+               for (y=0; y < img_comp[n].v; ++y) {
+                  for (x=0; x < img_comp[n].h; ++x) {
+                     int x2 = (i*img_comp[n].h + x)*8;
+                     int y2 = (j*img_comp[n].v + y)*8;
+                     if (!decode_block(data, huff_dc+img_comp[n].hd, huff_ac+img_comp[n].ha, n)) return 0;
+                     idct_block(img_comp[n].data+img_comp[n].w2*y2+x2, img_comp[n].w2, data, dequant[img_comp[n].tq]);
+                  }
+               }
+            }
+            // after all interleaved components, that's an interleaved MCU,
+            // so now count down the restart interval
+            if (--todo <= 0) {
+               if (code_bits < 24) grow_buffer_unsafe();
+               // if it's NOT a restart, then just bail, so we get corrupt data
+               // rather than no data
+               if (!RESTART(marker)) return 1;
+               reset();
+            }
+         }
+      }
+   }
+   return 1;
+}
+
+static int process_marker(int m)
+{
+   int L;
+   switch (m) {
+      case MARKER_none: // no marker found
+         return e("expected marker","Corrupt JPEG");
+
+      case 0xC2: // SOF - progressive
+         return e("progressive jpeg","JPEG format not supported (progressive)");
+
+      case 0xDD: // DRI - specify restart interval
+         if (get16() != 4) return e("bad DRI len","Corrupt JPEG");
+         restart_interval = get16();
+         return 1;
+
+      case 0xDB: // DQT - define quantization table
+         L = get16()-2;
+         while (L > 0) {
+            int z = get8();
+            int p = z >> 4;
+            int t = z & 15,i;
+            if (p != 0) return e("bad DQT type","Corrupt JPEG");
+            if (t > 3) return e("bad DQT table","Corrupt JPEG");
+            for (i=0; i < 64; ++i)
+               dequant[t][dezigzag[i]] = get8u();
+            L -= 65;
+         }
+         return L==0;
+
+      case 0xC4: // DHT - define huffman table
+         L = get16()-2;
+         while (L > 0) {
+            uint8 *v;
+            int sizes[16],i,m=0;
+            int z = get8();
+            int tc = z >> 4;
+            int th = z & 15;
+            if (tc > 1 || th > 3) return e("bad DHT header","Corrupt JPEG");
+            for (i=0; i < 16; ++i) {
+               sizes[i] = get8();
+               m += sizes[i];
+            }
+            L -= 17;
+            if (tc == 0) {
+               if (!build_huffman(huff_dc+th, sizes)) return 0;
+               v = huff_dc[th].values;
+            } else {
+               if (!build_huffman(huff_ac+th, sizes)) return 0;
+               v = huff_ac[th].values;
+            }
+            for (i=0; i < m; ++i)
+               v[i] = get8u();
+            L -= m;
+         }
+         return L==0;
+   }
+   // check for comment block or APP blocks
+   if ((m >= 0xE0 && m <= 0xEF) || m == 0xFE) {
+      skip(get16()-2);
+      return 1;
+   }
+   return 0;
+}
+
+// after we see SOS
+static int process_scan_header(void)
+{
+   int i;
+   int Ls = get16();
+   scan_n = get8();
+   if (scan_n < 1 || scan_n > 4 || scan_n > (int) img_n) return e("bad SOS component count","Corrupt JPEG");
+   if (Ls != 6+2*scan_n) return e("bad SOS len","Corrupt JPEG");
+   for (i=0; i < scan_n; ++i) {
+      int id = get8(), which;
+      int z = get8();
+      for (which = 0; which < img_n; ++which)
+         if (img_comp[which].id == id)
+            break;
+      if (which == img_n) return 0;
+      img_comp[which].hd = z >> 4;   if (img_comp[which].hd > 3) return e("bad DC huff","Corrupt JPEG");
+      img_comp[which].ha = z & 15;   if (img_comp[which].ha > 3) return e("bad AC huff","Corrupt JPEG");
+      order[i] = which;
+   }
+   if (get8() != 0) return e("bad SOS","Corrupt JPEG");
+   get8(); // should be 63, but might be 0
+   if (get8() != 0) return e("bad SOS","Corrupt JPEG");
+
+   return 1;
+}
+
+static int process_frame_header(int scan)
+{
+   int Lf,p,i,z, h_max=1,v_max=1;
+   Lf = get16();         if (Lf < 11) return e("bad SOF len","Corrupt JPEG"); // JPEG
+   p  = get8();          if (p != 8) return e("only 8-bit","JPEG format not supported: 8-bit only"); // JPEG baseline
+   img_y = get16();      if (img_y == 0) return e("no header height", "JPEG format not supported: delayed height"); // Legal, but we don't handle it--but neither does IJG
+   img_x = get16();      if (img_x == 0) return e("0 width","Corrupt JPEG"); // JPEG requires
+   img_n = get8();
+   if (img_n != 3 && img_n != 1) return e("bad component count","Corrupt JPEG");    // JFIF requires
+
+   if (Lf != 8+3*img_n) return e("bad SOF len","Corrupt JPEG");
+
+   for (i=0; i < img_n; ++i) {
+      img_comp[i].id = get8();
+      if (img_comp[i].id != i+1)   // JFIF requires
+         if (img_comp[i].id != i)  // jpegtran outputs non-JFIF-compliant files!
+            return e("bad component ID","Corrupt JPEG");
+      z = get8();
+      img_comp[i].h = (z >> 4);  if (!img_comp[i].h || img_comp[i].h > 4) return e("bad H","Corrupt JPEG");
+      img_comp[i].v = z & 15;    if (!img_comp[i].v || img_comp[i].v > 4) return e("bad V","Corrupt JPEG");
+      img_comp[i].tq = get8();   if (img_comp[i].tq > 3) return e("bad TQ","Corrupt JPEG");
+   }
+
+   if (scan != SCAN_load) return 1;
+
+   if ((1 << 30) / img_x / img_n < img_y) return e("too large", "Image too large to decode");
+
+   for (i=0; i < img_n; ++i) {
+      if (img_comp[i].h > h_max) h_max = img_comp[i].h;
+      if (img_comp[i].v > v_max) v_max = img_comp[i].v;
+   }
+
+   // compute interleaved mcu info
+   img_h_max = h_max;
+   img_v_max = v_max;
+   img_mcu_w = h_max * 8;
+   img_mcu_h = v_max * 8;
+   img_mcu_x = (img_x + img_mcu_w-1) / img_mcu_w;
+   img_mcu_y = (img_y + img_mcu_h-1) / img_mcu_h;
+
+   for (i=0; i < img_n; ++i) {
+      // number of effective pixels (e.g. for non-interleaved MCU)
+      img_comp[i].x = (img_x * img_comp[i].h + h_max-1) / h_max;
+      img_comp[i].y = (img_y * img_comp[i].v + v_max-1) / v_max;
+      // to simplify generation, we'll allocate enough memory to decode
+      // the bogus oversized data from using interleaved MCUs and their
+      // big blocks (e.g. a 16x16 iMCU on an image of width 33); we won't
+      // discard the extra data until colorspace conversion
+      img_comp[i].w2 = img_mcu_x * img_comp[i].h * 8;
+      img_comp[i].h2 = img_mcu_y * img_comp[i].v * 8;
+      img_comp[i].data = (uint8 *) malloc(img_comp[i].w2 * img_comp[i].h2);
+      if (img_comp[i].data == NULL) {
+         for(--i; i >= 0; --i)
+            free(img_comp[i].data);
+         return e("outofmem", "Out of memory");
+      }
+   }
+
+   return 1;
+}
+
+// use comparisons since in some cases we handle more than one case (e.g. SOF)
+#define DNL(x)         ((x) == 0xdc)
+#define SOI(x)         ((x) == 0xd8)
+#define EOI(x)         ((x) == 0xd9)
+#define SOF(x)         ((x) == 0xc0 || (x) == 0xc1)
+#define SOS(x)         ((x) == 0xda)
+
+static int decode_jpeg_header(int scan)
+{
+   int m;
+   marker = MARKER_none; // initialize cached marker to empty
+   m = get_marker();
+   if (!SOI(m)) return e("no SOI","Corrupt JPEG");
+   if (scan == SCAN_type) return 1;
+   m = get_marker();
+   while (!SOF(m)) {
+      if (!process_marker(m)) return 0;
+      m = get_marker();
+      while (m == MARKER_none) {
+         // some files have extra padding after their blocks, so ok, we'll scan
+         if (at_eof()) return e("no SOF", "Corrupt JPEG");
+         m = get_marker();
+      }
+   }
+   if (!process_frame_header(scan)) return 0;
+   return 1;
+}
+
+static int decode_jpeg_image(void)
+{
+   int m;
+   restart_interval = 0;
+   if (!decode_jpeg_header(SCAN_load)) return 0;
+   m = get_marker();
+   while (!EOI(m)) {
+      if (SOS(m)) {
+         if (!process_scan_header()) return 0;
+         if (!parse_entropy_coded_data()) return 0;
+      } else {
+         if (!process_marker(m)) return 0;
+      }
+      m = get_marker();
+   }
+   return 1;
+}
+
+// static jfif-centered resampling with cross-block smoothing
+// here by cross-block smoothing what I mean is that the resampling
+// is bilerp and crosses blocks; I dunno what IJG means
+
+#define div4(x) ((uint8) ((x) >> 2))
+
+static void resample_v_2(uint8 *out1, uint8 *input, int w, int h, int s)
+{
+   // need to generate two samples vertically for every one in input
+   uint8 *above;
+   uint8 *below;
+   uint8 *source;
+   uint8 *out2;
+   int i,j;
+   source = input;
+   out2 = out1+w;
+   for (j=0; j < h; ++j) {
+      above = source;
+      source = input + j*s;
+      below = source + s; if (j == h-1) below = source;
+      for (i=0; i < w; ++i) {
+         int n = source[i]*3;
+         out1[i] = div4(above[i] + n);
+         out2[i] = div4(below[i] + n);
+      }
+      out1 += w*2;
+      out2 += w*2;
+   }
+}
+
+static void resample_h_2(uint8 *out, uint8 *input, int w, int h, int s)
+{
+   // need to generate two samples horizontally for every one in input
+   int i,j;
+   if (w == 1) {
+      for (j=0; j < h; ++j)
+         out[j*2+0] = out[j*2+1] = input[j*s];
+      return;
+   }
+   for (j=0; j < h; ++j) {
+      out[0] = input[0];
+      out[1] = div4(input[0]*3 + input[1]);
+      for (i=1; i < w-1; ++i) {
+         int n = input[i]*3;
+         out[i*2-2] = div4(input[i-1] + n);
+         out[i*2-1] = div4(input[i+1] + n);
+      }
+      out[w*2-2] = div4(input[w-2]*3 + input[w-1]);
+      out[w*2-1] = input[w-1];
+      out += w*2;
+      input += s;
+   }
+}
+
+// .172 seconds on 3*anemones.jpg
+static void resample_hv_2(uint8 *out, uint8 *input, int w, int h, int s)
+{
+   // need to generate 2x2 samples for every one in input
+   int i,j;
+   int os = w*2;
+   // generate edge samples... @TODO lerp them!
+   for (i=0; i < w; ++i) {
+      out[i*2+0] = out[i*2+1] = input[i];
+      out[i*2+(2*h-1)*os+0] = out[i*2+(2*h-1)*os+1] = input[i+(h-1)*w];
+   }
+   for (j=0; j < h; ++j) {
+      out[j*os*2+0] = out[j*os*2+os+0] = input[j*w];
+      out[j*os*2+os-1] = out[j*os*2+os+os-1] = input[j*w+i-1];
+   }
+   // now generate interior samples; i & j point to top left of input
+   for (j=0; j < h-1; ++j) {
+      uint8 *in1 = input+j*s;
+      uint8 *in2 = in1 + s;
+      uint8 *out1 = out + (j*2+1)*os + 1;
+      uint8 *out2 = out1 + os;
+      for (i=0; i < w-1; ++i) {
+         int p00 = in1[0], p01=in1[1], p10=in2[0], p11=in2[1];
+         int p00_3 = p00*3, p01_3 = p01*3, p10_3 = p10*3, p11_3 = p11*3;
+
+         #define div16(x)  ((uint8) ((x) >> 4))
+
+         out1[0] = div16(p00*9 + p01_3 + p10_3 + p11);
+         out1[1] = div16(p01*9 + p00_3 + p01_3 + p10);
+         out2[0] = div16(p10*9 + p11_3 + p00_3 + p01);
+         out2[1] = div16(p11*9 + p10_3 + p01_3 + p00);
+         out1 += 2;
+         out2 += 2;                                                         
+         ++in1;
+         ++in2;
+      }
+   }
+}
+
+#define float2fixed(x)  ((int) ((x) * 65536 + 0.5))
+
+// 0.38 seconds on 3*anemones.jpg   (0.25 with processor = Pro)
+// VC6 without processor=Pro is generating multiple LEAs per multiply!
+static void YCbCr_to_RGB_row(uint8 *out, uint8 *y, uint8 *pcb, uint8 *pcr, int count, int step)
+{
+   int i;
+   for (i=0; i < count; ++i) {
+      int y_fixed = (y[i] << 16) + 32768; // rounding
+      int r,g,b;
+      int cr = pcr[i] - 128;
+      int cb = pcb[i] - 128;
+      r = y_fixed + cr*float2fixed(1.40200f);
+      g = y_fixed - cr*float2fixed(0.71414f) - cb*float2fixed(0.34414f);
+      b = y_fixed                            + cb*float2fixed(1.77200f);
+      r >>= 16;
+      g >>= 16;
+      b >>= 16;
+      if ((unsigned) r > 255) { if (r < 0) r = 0; else r = 255; }
+      if ((unsigned) g > 255) { if (g < 0) g = 0; else g = 255; }
+      if ((unsigned) b > 255) { if (b < 0) b = 0; else b = 255; }
+      out[0] = (uint8)r;
+      out[1] = (uint8)g;
+      out[2] = (uint8)b;
+      if (step == 4) out[3] = 255;
+      out += step;
+   }
+}
+
+// clean up the temporary component buffers
+static void cleanup_jpeg(void)
+{
+   int i;
+   for (i=0; i < img_n; ++i) {
+      if (img_comp[i].data) {
+         free(img_comp[i].data);
+         img_comp[i].data = NULL;
+      }
+   }
+}
+
+static uint8 *load_jpeg_image(int *out_x, int *out_y, int *comp, int req_comp)
+{
+   int i, n;
+   // validate req_comp
+   if (req_comp < 0 || req_comp > 4) return epuc("bad req_comp", "Internal error");
+
+   // load a jpeg image from whichever source
+   if (!decode_jpeg_image()) { cleanup_jpeg(); return NULL; }
+
+   // determine actual number of components to generate
+   n = req_comp ? req_comp : img_n;
+
+   // resample components to full size... memory wasteful, but this
+   // lets us bilerp across blocks while upsampling
+   for (i=0; i < img_n; ++i) {
+      // if we're outputting fewer than 3 components, we're grey not RGB;
+      // in that case, don't bother upsampling Cb or Cr
+      if (n < 3 && i) continue;
+
+      // check if the component scale is less than max; if so it needs upsampling
+      if (img_comp[i].h != img_h_max || img_comp[i].v != img_v_max) {
+         int stride = img_x;
+         // allocate final size; make sure it's big enough for upsampling off
+         // the edges with upsample up to 4x4 (although we only support 2x2
+         // currently)
+         uint8 *new_data = (uint8 *) malloc((img_x+3)*(img_y+3));
+         if (new_data == NULL) {
+            cleanup_jpeg();
+            return epuc("outofmem", "Out of memory (image too large?)");
+         }
+         if (img_comp[i].h*2 == img_h_max && img_comp[i].v*2 == img_v_max) {
+            int tx = (img_x+1)>>1;
+            resample_hv_2(new_data, img_comp[i].data, tx,(img_y+1)>>1, img_comp[i].w2);
+            stride = tx*2;
+         } else if (img_comp[i].h == img_h_max && img_comp[i].v*2 == img_v_max) {
+            resample_v_2(new_data, img_comp[i].data, img_x,(img_y+1)>>1, img_comp[i].w2);
+         } else if (img_comp[i].h*2 == img_h_max && img_comp[i].v == img_v_max) {
+            int tx = (img_x+1)>>1;
+            resample_h_2(new_data, img_comp[i].data, tx,img_y, img_comp[i].w2);
+            stride = tx*2;
+         } else {
+            // @TODO resample uncommon sampling pattern with nearest neighbor
+            free(new_data);
+            cleanup_jpeg();
+            return epuc("uncommon H or V", "JPEG not supported: atypical downsampling mode");
+         }
+         img_comp[i].w2 = stride;
+         free(img_comp[i].data);
+         img_comp[i].data = new_data;
+      }
+   }
+
+   // now convert components to output image
+   {
+      uint32 i,j;
+      uint8 *output = (uint8 *) malloc(n * img_x * img_y + 1);
+      if (n >= 3) { // output STBI_rgb_*
+         for (j=0; j < img_y; ++j) {
+            uint8 *y  = img_comp[0].data + j*img_comp[0].w2;
+            uint8 *out = output + n * img_x * j;
+            if (img_n == 3) {
+               uint8 *cb = img_comp[1].data + j*img_comp[1].w2;
+               uint8 *cr = img_comp[2].data + j*img_comp[2].w2;
+               YCbCr_to_RGB_row(out, y, cb, cr, img_x, n);
+            } else {
+               for (i=0; i < img_x; ++i) {
+                  out[0] = out[1] = out[2] = y[i];
+                  out[3] = 255; // not used if n == 3
+                  out += n;
+               }
+            }
+         }
+      } else {      // output STBI_grey_*
+         for (j=0; j < img_y; ++j) {
+            uint8 *y  = img_comp[0].data + j*img_comp[0].w2;
+            uint8 *out = output + n * img_x * j;
+            if (n == 1)
+               for (i=0; i < img_x; ++i) *out++ = *y++;
+            else
+               for (i=0; i < img_x; ++i) *out++ = *y++, *out++ = 255;
+         }
+      }
+      cleanup_jpeg();
+      *out_x = img_x;
+      *out_y = img_y;
+      if (comp) *comp  = img_n; // report original components, not output
+      return output;
+   }
+}
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_jpeg_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   start_file(f);
+   return load_jpeg_image(x,y,comp,req_comp);
+}
+
+unsigned char *stbi_jpeg_load(char *filename, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_jpeg_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+#endif
+
+unsigned char *stbi_jpeg_load_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   start_mem(buffer,len);
+   return load_jpeg_image(x,y,comp,req_comp);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_jpeg_test_file(FILE *f)
+{
+   int n,r;
+   n = ftell(f);
+   start_file(f);
+   r = decode_jpeg_header(SCAN_type);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_jpeg_test_memory(unsigned char *buffer, int len)
+{
+   start_mem(buffer,len);
+   return decode_jpeg_header(SCAN_type);
+}
+
+// @TODO:
+#ifndef STBI_NO_STDIO
+extern int      stbi_jpeg_info            (char *filename,           int *x, int *y, int *comp);
+extern int      stbi_jpeg_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_jpeg_info_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp);
+
+// public domain zlib decode    v0.2  Sean Barrett 2006-11-18
+//    simple implementation
+//      - all input must be provided in an upfront buffer
+//      - all output is written to a single output buffer (can malloc/realloc)
+//    performance
+//      - fast huffman
+
+// fast-way is faster to check than jpeg huffman, but slow way is slower
+#define ZFAST_BITS  9 // accelerate all cases in default tables
+#define ZFAST_MASK  ((1 << ZFAST_BITS) - 1)
+
+// zlib-style huffman encoding
+// (jpegs packs from left, zlib from right, so can't share code)
+typedef struct
+{
+   uint16 fast[1 << ZFAST_BITS];
+   uint16 firstcode[16];
+   int maxcode[17];
+   uint16 firstsymbol[16];
+   uint8  size[288];
+   uint16 value[288]; 
+} zhuffman;
+
+__forceinline static int bitreverse16(int n)
+{
+  n = ((n & 0xAAAA) >>  1) | ((n & 0x5555) << 1);
+  n = ((n & 0xCCCC) >>  2) | ((n & 0x3333) << 2);
+  n = ((n & 0xF0F0) >>  4) | ((n & 0x0F0F) << 4);
+  n = ((n & 0xFF00) >>  8) | ((n & 0x00FF) << 8);
+  return n;
+}
+
+__forceinline static int bit_reverse(int v, int bits)
+{
+   assert(bits <= 16);
+   // to bit reverse n bits, reverse 16 and shift
+   // e.g. 11 bits, bit reverse and shift away 5
+   return bitreverse16(v) >> (16-bits);
+}
+
+static int zbuild_huffman(zhuffman *z, uint8 *sizelist, int num)
+{
+   int i,k=0;
+   int code, next_code[16], sizes[17];
+
+   // DEFLATE spec for generating codes
+   memset(sizes, 0, sizeof(sizes));
+   memset(z->fast, 255, sizeof(z->fast));
+   for (i=0; i < num; ++i) 
+      ++sizes[sizelist[i]];
+   sizes[0] = 0;
+   for (i=1; i < 16; ++i)
+      assert(sizes[i] <= (1 << i));
+   code = 0;
+   for (i=1; i < 16; ++i) {
+      next_code[i] = code;
+      z->firstcode[i] = (uint16) code;
+      z->firstsymbol[i] = (uint16) k;
+      code = (code + sizes[i]);
+      if (sizes[i])
+         if (code-1 >= (1 << i)) return e("bad codelengths","Corrupt JPEG");
+      z->maxcode[i] = code << (16-i); // preshift for inner loop
+      code <<= 1;
+      k += sizes[i];
+   }
+   z->maxcode[16] = 0x10000; // sentinel
+   for (i=0; i < num; ++i) {
+      int s = sizelist[i];
+      if (s) {
+         int c = next_code[s] - z->firstcode[s] + z->firstsymbol[s];
+         z->size[c] = (uint8)s;
+         z->value[c] = (uint16)i;
+         if (s <= ZFAST_BITS) {
+            int k = bit_reverse(next_code[s],s);
+            while (k < (1 << ZFAST_BITS)) {
+               z->fast[k] = (uint16) c;
+               k += (1 << s);
+            }
+         }
+         ++next_code[s];
+      }
+   }
+   return 1;
+}
+
+// zlib-from-memory implementation for PNG reading
+//    because PNG allows splitting the zlib stream arbitrarily,
+//    and it's annoying structurally to have PNG call ZLIB call PNG,
+//    we require PNG read all the IDATs and combine them into a single
+//    memory buffer
+
+static uint8 *zbuffer, *zbuffer_end;
+
+__forceinline static int zget8(void)
+{
+   if (zbuffer >= zbuffer_end) return 0;
+   return *zbuffer++;
+}
+
+//static unsigned long code_buffer;
+static int           num_bits;
+
+static void fill_bits(void)
+{
+   do {
+      assert(code_buffer < (1U << num_bits));
+      code_buffer |= zget8() << num_bits;
+      num_bits += 8;
+   } while (num_bits <= 24);
+}
+
+__forceinline static unsigned int zreceive(int n)
+{
+   unsigned int k;
+   if (num_bits < n) fill_bits();
+   k = code_buffer & ((1 << n) - 1);
+   code_buffer >>= n;
+   num_bits -= n;
+   return k;   
+}
+
+__forceinline static int zhuffman_decode(zhuffman *z)
+{
+   int b,s,k;
+   if (num_bits < 16) fill_bits();
+   b = z->fast[code_buffer & ZFAST_MASK];
+   if (b < 0xffff) {
+      s = z->size[b];
+      code_buffer >>= s;
+      num_bits -= s;
+      return z->value[b];
+   }
+
+   // not resolved by fast table, so compute it the slow way
+   // use jpeg approach, which requires MSbits at top
+   k = bit_reverse(code_buffer, 16);
+   for (s=ZFAST_BITS+1; ; ++s)
+      if (k < z->maxcode[s])
+         break;
+   if (s == 16) return -1; // invalid code!
+   // code size is s, so:
+   b = (k >> (16-s)) - z->firstcode[s] + z->firstsymbol[s];
+   assert(z->size[b] == s);
+   code_buffer >>= s;
+   num_bits -= s;
+   return z->value[b];
+}
+
+static char *zout;
+static char *zout_start;
+static char *zout_end;
+static int   z_expandable;
+
+static int expand(int n)  // need to make room for n bytes
+{
+   char *q;
+   int cur, limit;
+   if (!z_expandable) return e("output buffer limit","Corrupt PNG");
+   cur   = (int) (zout     - zout_start);
+   limit = (int) (zout_end - zout_start);
+   while (cur + n > limit)
+      limit *= 2;
+   q = (char *) realloc(zout_start, limit);
+   if (q == NULL) return e("outofmem", "Out of memory");
+   zout_start = q;
+   zout       = q + cur;
+   zout_end   = q + limit;
+   return 1;
+}
+
+static zhuffman z_length, z_distance;
+
+static int length_base[31] = {
+   3,4,5,6,7,8,9,10,11,13,
+   15,17,19,23,27,31,35,43,51,59,
+   67,83,99,115,131,163,195,227,258,0,0 };
+
+static int length_extra[31]= 
+{ 0,0,0,0,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,0,0,0 };
+
+static int dist_base[32] = { 1,2,3,4,5,7,9,13,17,25,33,49,65,97,129,193,
+257,385,513,769,1025,1537,2049,3073,4097,6145,8193,12289,16385,24577,0,0};
+
+static int dist_extra[32] =
+{ 0,0,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13};
+
+static int parse_huffman_block(void)
+{
+   for(;;) {
+      int z = zhuffman_decode(&z_length);
+      if (z < 256) {
+         if (z < 0) return e("bad huffman code","Corrupt PNG"); // error in huffman codes
+         if (zout >= zout_end) if (!expand(1)) return 0;
+         *zout++ = (char) z;
+      } else {
+         uint8 *p;
+         int len,dist;
+         if (z == 256) return 1;
+         z -= 257;
+         len = length_base[z];
+         if (length_extra[z]) len += zreceive(length_extra[z]);
+         z = zhuffman_decode(&z_distance);
+         if (z < 0) return e("bad huffman code","Corrupt PNG");
+         dist = dist_base[z];
+         if (dist_extra[z]) dist += zreceive(dist_extra[z]);
+         if (zout - zout_start < dist) return e("bad dist","Corrupt PNG");
+         if (zout + len > zout_end) if (!expand(len)) return 0;
+         p = (uint8 *) (zout - dist);
+         while (len--)
+            *zout++ = *p++;
+      }
+   }
+}
+
+static int compute_huffman_codes(void)
+{
+   static uint8 length_dezigzag[19] = { 16,17,18,0,8,7,9,6,10,5,11,4,12,3,13,2,14,1,15 };
+   static zhuffman z_codelength; // static just to save stack space
+   uint8 lencodes[286+32+137];//padding for maximum single op
+   uint8 codelength_sizes[19];
+   int i,n;
+
+   int hlit  = zreceive(5) + 257;
+   int hdist = zreceive(5) + 1;
+   int hclen = zreceive(4) + 4;
+
+   memset(codelength_sizes, 0, sizeof(codelength_sizes));
+   for (i=0; i < hclen; ++i) {
+      int s = zreceive(3);
+      codelength_sizes[length_dezigzag[i]] = (uint8) s;
+   }
+   if (!zbuild_huffman(&z_codelength, codelength_sizes, 19)) return 0;
+
+   n = 0;
+   while (n < hlit + hdist) {
+      int c = zhuffman_decode(&z_codelength);
+      assert(c >= 0 && c < 19);
+      if (c < 16)
+         lencodes[n++] = (uint8) c;
+      else if (c == 16) {
+         c = zreceive(2)+3;
+         memset(lencodes+n, lencodes[n-1], c);
+         n += c;
+      } else if (c == 17) {
+         c = zreceive(3)+3;
+         memset(lencodes+n, 0, c);
+         n += c;
+      } else {
+         assert(c == 18);
+         c = zreceive(7)+11;
+         memset(lencodes+n, 0, c);
+         n += c;
+      }
+   }
+   if (n != hlit+hdist) return e("bad codelengths","Corrupt PNG");
+   if (!zbuild_huffman(&z_length, lencodes, hlit)) return 0;
+   if (!zbuild_huffman(&z_distance, lencodes+hlit, hdist)) return 0;
+   return 1;
+}
+
+static int parse_uncompressed_block(void)
+{
+   uint8 header[4];
+   int len,nlen,k;
+   if (num_bits & 7)
+      zreceive(num_bits & 7); // discard
+   // drain the bit-packed data into header
+   k = 0;
+   while (num_bits > 0) {
+      header[k++] = (uint8) (code_buffer & 255); // wtf this warns?
+      code_buffer >>= 8;
+      num_bits -= 8;
+   }
+   assert(num_bits == 0);
+   // now fill header the normal way
+   while (k < 4)
+      header[k++] = (uint8) zget8();
+   len  = header[1] * 256 + header[0];
+   nlen = header[3] * 256 + header[2];
+   if (nlen != (len ^ 0xffff)) return e("zlib corrupt","Corrupt PNG");
+   if (zbuffer + len > zbuffer_end) return e("read past buffer","Corrupt PNG");
+   if (zout + len > zout_end)
+      if (!expand(len)) return 0;
+   memcpy(zout, zbuffer, len);
+   zbuffer += len;
+   zout += len;
+   return 1;
+}
+
+static int parse_zlib_header(void)
+{
+   int cmf   = zget8();
+   int cm    = cmf & 15;
+   /* int cinfo = cmf >> 4; */
+   int flg   = zget8();
+   if ((cmf*256+flg) % 31 != 0) return e("bad zlib header","Corrupt PNG"); // zlib spec
+   if (flg & 32) return e("no preset dict","Corrupt PNG"); // preset dictionary not allowed in png
+   if (cm != 8) return e("bad compression","Corrupt PNG"); // DEFLATE required for png
+   // window = 1 << (8 + cinfo)... but who cares, we fully buffer output
+   return 1;
+}
+
+static uint8 default_length[288], default_distance[32];
+static void init_defaults(void)
+{
+   int i;   // use <= to match clearly with spec
+   for (i=0; i <= 143; ++i)     default_length[i]   = 8;
+   for (   ; i <= 255; ++i)     default_length[i]   = 9;
+   for (   ; i <= 279; ++i)     default_length[i]   = 7;
+   for (   ; i <= 287; ++i)     default_length[i]   = 8;
+
+   for (i=0; i <=  31; ++i)     default_distance[i] = 5;
+}
+
+static int parse_zlib(int parse_header)
+{
+   int final, type;
+   if (parse_header)
+      if (!parse_zlib_header()) return 0;
+   num_bits = 0;
+   code_buffer = 0;
+   do {
+      final = zreceive(1);
+      type = zreceive(2);
+      if (type == 0) {
+         if (!parse_uncompressed_block()) return 0;
+      } else if (type == 3) {
+         return 0;
+      } else {
+         if (type == 1) {
+            // use fixed code lengths
+            if (!default_length[0]) init_defaults();
+            if (!zbuild_huffman(&z_length  , default_length  , 288)) return 0;
+            if (!zbuild_huffman(&z_distance, default_distance,  32)) return 0;
+         } else {
+            if (!compute_huffman_codes()) return 0;
+         }
+         if (!parse_huffman_block()) return 0;
+      }
+   } while (!final);
+   return 1;
+}
+
+static int do_zlib(char *obuf, int olen, int exp, int parse_header)
+{
+   zout_start = obuf;
+   zout       = obuf;
+   zout_end   = obuf + olen;
+   z_expandable = exp;
+
+   return parse_zlib(parse_header);
+}
+
+char *stbi_zlib_decode_malloc_guesssize(int initial_size, int *outlen)
+{
+   char *p = (char *) malloc(initial_size);
+   if (p == NULL) return NULL;
+   if (do_zlib(p, initial_size, 1, 1)) {
+      *outlen = (int) (zout - zout_start);
+      return zout_start;
+   } else {
+      free(zout_start);
+      return NULL;
+   }
+}
+
+char *stbi_zlib_decode_malloc(char *buffer, int len, int *outlen)
+{
+   zbuffer = (uint8 *) buffer;
+   zbuffer_end = (uint8 *) buffer+len;
+   return stbi_zlib_decode_malloc_guesssize(16384, outlen);
+}
+
+int stbi_zlib_decode_buffer(char *obuffer, int olen, char *ibuffer, int ilen)
+{
+   zbuffer = (uint8 *) ibuffer;
+   zbuffer_end = (uint8 *) ibuffer + ilen;
+   if (do_zlib(obuffer, olen, 0, 1))
+      return (int) (zout - zout_start);
+   else
+      return -1;
+}
+
+char *stbi_zlib_decode_noheader_malloc(char *buffer, int len, int *outlen)
+{
+   char *p = (char *) malloc(16384);
+   if (p == NULL) return NULL;
+   zbuffer = (uint8 *) buffer;
+   zbuffer_end = (uint8 *) buffer+len;
+   if (do_zlib(p, 16384, 1, 0)) {
+      *outlen = (int) (zout - zout_start);
+      return zout_start;
+   } else {
+      free(zout_start);
+      return NULL;
+   }
+}
+
+int stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, char *ibuffer, int ilen)
+{
+   zbuffer = (uint8 *) ibuffer;
+   zbuffer_end = (uint8 *) ibuffer + ilen;
+   if (do_zlib(obuffer, olen, 0, 0))
+      return (int) (zout - zout_start);
+   else
+      return -1;
+}
+
+// public domain "baseline" PNG decoder   v0.10  Sean Barrett 2006-11-18
+//    simple implementation
+//      - only 8-bit samples
+//      - no CRC checking
+//      - allocates lots of intermediate memory
+//        - avoids problem of streaming data between subsystems
+//        - avoids explicit window management
+//    performance
+//      - uses stb_zlib, a PD zlib implementation with fast huffman decoding
+
+
+typedef struct
+{
+   unsigned long length;
+   unsigned long type;
+} chunk;
+
+#define PNG_TYPE(a,b,c,d)  (((a) << 24) + ((b) << 16) + ((c) << 8) + (d))
+
+static chunk get_chunk_header(void)
+{
+   chunk c;
+   c.length = get32();
+   c.type   = get32();
+   return c;
+}
+
+static int check_png_header(void)
+{
+   static uint8 png_sig[8] = { 137,80,78,71,13,10,26,10 };
+   int i;
+   for (i=0; i < 8; ++i)
+      if (get8() != png_sig[i]) return e("bad png sig","Not a PNG");
+   return 1;
+}
+
+static uint8 *idata, *expanded, *out;
+
+enum {
+   F_none=0, F_sub=1, F_up=2, F_avg=3, F_paeth=4,
+   F_avg_first, F_paeth_first,
+};
+
+static uint8 first_row_filter[5] =
+{
+   F_none, F_sub, F_none, F_avg_first, F_paeth_first
+};
+
+static int paeth(int a, int b, int c)
+{
+   int p = a + b - c;
+   int pa = abs(p-a);
+   int pb = abs(p-b);
+   int pc = abs(p-c);
+   if (pa <= pb && pa <= pc) return a;
+   if (pb <= pc) return b;
+   return c;
+}
+
+// create the png data from post-deflated data
+static int create_png_image(uint8 *raw, uint32 raw_len, int out_n)
+{
+   uint32 i,j,stride = img_x*out_n;
+   int k;
+   assert(out_n == img_n || out_n == img_n+1);
+   out = (uint8 *) malloc(img_x * img_y * out_n);
+   if (!out) return e("outofmem", "Out of memory");
+   if (raw_len != (img_n * img_x + 1) * img_y) return e("not enough pixels","Corrupt PNG");
+   for (j=0; j < img_y; ++j) {
+      uint8 *cur = out + stride*j;
+      uint8 *prior = cur - stride;
+      int filter = *raw++;
+      if (filter > 4) return e("invalid filter","Corrupt PNG");
+      // if first row, use special filter that doesn't sample previous row
+      if (j == 0) filter = first_row_filter[filter];
+      // handle first pixel explicitly
+      for (k=0; k < img_n; ++k) {
+         switch(filter) {
+            case F_none       : cur[k] = raw[k]; break;
+            case F_sub        : cur[k] = raw[k]; break;
+            case F_up         : cur[k] = raw[k] + prior[k]; break;
+            case F_avg        : cur[k] = raw[k] + (prior[k]>>1); break;
+            case F_paeth      : cur[k] = (uint8) (raw[k] + paeth(0,prior[k],0)); break;
+            case F_avg_first  : cur[k] = raw[k]; break;
+            case F_paeth_first: cur[k] = raw[k]; break;
+         }
+      }
+      if (img_n != out_n) cur[img_n] = 255;
+      raw += img_n;
+      cur += out_n;
+      prior += out_n;
+      // this is a little gross, so that we don't switch per-pixel or per-component
+      if (img_n == out_n) {
+         #define CASE(f) \
+             case f:     \
+                for (i=1; i < img_x; ++i, raw+=img_n,cur+=img_n,prior+=img_n) \
+                   for (k=0; k < img_n; ++k)
+         switch(filter) {
+            CASE(F_none)  cur[k] = raw[k]; break;
+            CASE(F_sub)   cur[k] = raw[k] + cur[k-img_n]; break;
+            CASE(F_up)    cur[k] = raw[k] + prior[k]; break;
+            CASE(F_avg)   cur[k] = raw[k] + ((prior[k] + cur[k-img_n])>>1); break;
+            CASE(F_paeth)  cur[k] = (uint8) (raw[k] + paeth(cur[k-img_n],prior[k],prior[k-img_n])); break;
+            CASE(F_avg_first)    cur[k] = raw[k] + (cur[k-img_n] >> 1); break;
+            CASE(F_paeth_first)  cur[k] = (uint8) (raw[k] + paeth(cur[k-img_n],0,0)); break;
+         }
+         #undef CASE
+      } else {
+         assert(img_n+1 == out_n);
+         #define CASE(f) \
+             case f:     \
+                for (i=1; i < img_x; ++i, cur[img_n]=255,raw+=img_n,cur+=out_n,prior+=out_n) \
+                   for (k=0; k < img_n; ++k)
+         switch(filter) {
+            CASE(F_none)  cur[k] = raw[k]; break;
+            CASE(F_sub)   cur[k] = raw[k] + cur[k-out_n]; break;
+            CASE(F_up)    cur[k] = raw[k] + prior[k]; break;
+            CASE(F_avg)   cur[k] = raw[k] + ((prior[k] + cur[k-out_n])>>1); break;
+            CASE(F_paeth)  cur[k] = (uint8) (raw[k] + paeth(cur[k-out_n],prior[k],prior[k-out_n])); break;
+            CASE(F_avg_first)    cur[k] = raw[k] + (cur[k-out_n] >> 1); break;
+            CASE(F_paeth_first)  cur[k] = (uint8) (raw[k] + paeth(cur[k-out_n],0,0)); break;
+         }
+         #undef CASE
+      }
+   }
+   return 1;
+}
+
+static int compute_transparency(uint8 tc[3], int out_n)
+{
+   uint32 i, pixel_count = img_x * img_y;
+   uint8 *p = out;
+
+   // compute color-based transparency, assuming we've
+   // already got 255 as the alpha value in the output
+   assert(out_n == 2 || out_n == 4);
+
+   p = out;
+   if (out_n == 2) {
+      for (i=0; i < pixel_count; ++i) {
+         p[1] = (p[0] == tc[0] ? 0 : 255);
+         p += 2;
+      }
+   } else {
+      for (i=0; i < pixel_count; ++i) {
+         if (p[0] == tc[0] && p[1] == tc[1] && p[2] == tc[2])
+            p[3] = 0;
+         p += 4;
+      }
+   }
+   return 1;
+}
+
+static int expand_palette(uint8 *palette, int len, int pal_img_n)
+{
+   uint32 i, pixel_count = img_x * img_y;
+   uint8 *p, *temp_out, *orig = out;
+
+   p = (uint8 *) malloc(pixel_count * pal_img_n);
+   if (p == NULL) return e("outofmem", "Out of memory");
+
+   // between here and free(out) below, exitting would leak
+   temp_out = p;
+
+   if (pal_img_n == 3) {
+      for (i=0; i < pixel_count; ++i) {
+         int n = orig[i]*4;
+         p[0] = palette[n  ];
+         p[1] = palette[n+1];
+         p[2] = palette[n+2];
+         p += 3;
+      }
+   } else {
+      for (i=0; i < pixel_count; ++i) {
+         int n = orig[i]*4;
+         p[0] = palette[n  ];
+         p[1] = palette[n+1];
+         p[2] = palette[n+2];
+         p[3] = palette[n+3];
+         p += 4;
+      }
+   }
+   free(out);
+   out = temp_out;
+   return 1;
+}
+
+static int parse_png_file(int scan, int req_comp)
+{
+   uint8 palette[1024], pal_img_n=0;
+   uint8 has_trans=0, tc[3];
+   uint32 ioff=0, idata_limit=0, i, pal_len=0;
+   int first=1,k;
+
+   if (!check_png_header()) return 0;
+
+   if (scan == SCAN_type) return 1;
+
+   for(;;first=0) {
+      chunk c = get_chunk_header();
+      if (first && c.type != PNG_TYPE('I','H','D','R'))
+         return e("first not IHDR","Corrupt PNG");
+      switch (c.type) {
+         case PNG_TYPE('I','H','D','R'): {
+            int depth,color,interlace,comp,filter;
+            if (!first) return e("multiple IHDR","Corrupt PNG");
+            if (c.length != 13) return e("bad IHDR len","Corrupt PNG");
+            img_x = get32(); if (img_x > (1 << 24)) return e("too large","Very large image (corrupt?)");
+            img_y = get32(); if (img_y > (1 << 24)) return e("too large","Very large image (corrupt?)");
+            depth = get8();  if (depth != 8)        return e("8bit only","PNG not supported: 8-bit only");
+            color = get8();  if (color > 6)         return e("bad ctype","Corrupt PNG");
+            if (color == 3) pal_img_n = 3; else if (color & 1) return e("bad ctype","Corrupt PNG");
+            comp  = get8();  if (comp) return e("bad comp method","Corrupt PNG");
+            filter= get8();  if (filter) return e("bad filter method","Corrupt PNG");
+            interlace = get8(); if (interlace) return e("interlaced","PNG not supported: interlaced mode");
+            if (!img_x || !img_y) return e("0-pixel image","Corrupt PNG");
+            if (!pal_img_n) {
+               img_n = (color & 2 ? 3 : 1) + (color & 4 ? 1 : 0);
+               if ((1 << 30) / img_x / img_n < img_y) return e("too large", "Image too large to decode");
+               if (scan == SCAN_header) return 1;
+            } else {
+               // if paletted, then pal_n is our final components, and
+               // img_n is # components to decompress/filter.
+               img_n = 1;
+               if ((1 << 30) / img_x / 4 < img_y) return e("too large","Corrupt PNG");
+               // if SCAN_header, have to scan to see if we have a tRNS
+            }
+            break;
+         }
+
+         case PNG_TYPE('P','L','T','E'):  {
+            if (c.length > 256*3) return e("invalid PLTE","Corrupt PNG");
+            pal_len = c.length / 3;
+            if (pal_len * 3 != c.length) return e("invalid PLTE","Corrupt PNG");
+            for (i=0; i < pal_len; ++i) {
+               palette[i*4+0] = get8u();
+               palette[i*4+1] = get8u();
+               palette[i*4+2] = get8u();
+               palette[i*4+3] = 255;
+            }
+            break;
+         }
+
+         case PNG_TYPE('t','R','N','S'): {
+            if (idata) return e("tRNS after IDAT","Corrupt PNG");
+            if (pal_img_n) {
+               if (scan == SCAN_header) { img_n = 4; return 1; }
+               if (pal_len == 0) return e("tRNS before PLTE","Corrupt PNG");
+               if (c.length > pal_len) return e("bad tRNS len","Corrupt PNG");
+               pal_img_n = 4;
+               for (i=0; i < c.length; ++i)
+                  palette[i*4+3] = get8u();
+            } else {
+               if (!(img_n & 1)) return e("tRNS with alpha","Corrupt PNG");
+               if (c.length != (uint32) img_n*2) return e("bad tRNS len","Corrupt PNG");
+               has_trans = 1;
+               for (k=0; k < img_n; ++k)
+                  tc[k] = (uint8) get16(); // non 8-bit images will be larger
+            }
+            break;
+         }
+
+         case PNG_TYPE('I','D','A','T'): {
+            if (pal_img_n && !pal_len) return e("no PLTE","Corrupt PNG");
+            if (scan == SCAN_header) { img_n = pal_img_n; return 1; }
+            if (ioff + c.length > idata_limit) {
+               uint8 *p;
+               if (idata_limit == 0) idata_limit = c.length > 4096 ? c.length : 4096;
+               while (ioff + c.length > idata_limit)
+                  idata_limit *= 2;
+               p = (uint8 *) realloc(idata, idata_limit); if (p == NULL) return e("outofmem", "Out of memory");
+               idata = p;
+            }
+            #ifndef STBI_NO_STDIO
+            if (img_file)
+            {
+               if (fread(idata+ioff,1,c.length,img_file) != c.length) return e("outofdata","Corrupt PNG");
+            }
+            else
+            #endif
+            {
+               memcpy(idata+ioff, img_buffer, c.length);
+               img_buffer += c.length;
+            }
+            ioff += c.length;
+            break;
+         }
+
+         case PNG_TYPE('I','E','N','D'): {
+            uint32 raw_len;
+            if (scan != SCAN_load) return 1;
+            if (idata == NULL) return e("no IDAT","Corrupt PNG");
+            expanded = (uint8 *) stbi_zlib_decode_malloc((char *) idata, ioff, (int *) &raw_len);
+            if (expanded == NULL) return 0; // zlib should set error
+            free(idata); idata = NULL;
+            if ((req_comp == img_n+1 && req_comp != 3 && !pal_img_n) || has_trans)
+               img_out_n = img_n+1;
+            else
+               img_out_n = img_n;
+            if (!create_png_image(expanded, raw_len, img_out_n)) return 0;
+            if (has_trans)
+               if (!compute_transparency(tc, img_out_n)) return 0;
+            if (pal_img_n) {
+               // pal_img_n == 3 or 4
+               img_n = pal_img_n; // record the actual colors we had
+               img_out_n = pal_img_n;
+               if (req_comp >= 3) img_out_n = req_comp;
+               if (!expand_palette(palette, pal_len, img_out_n))
+                  return 0;
+            }
+            free(expanded); expanded = NULL;
+            return 1;
+         }
+
+         default:
+            // if critical, fail
+            if ((c.type & (1 << 29)) == 0) {
+               #ifndef STBI_NO_FAILURE_STRINGS
+               static char invalid_chunk[] = "XXXX chunk not known";
+               invalid_chunk[0] = (uint8) (c.type >> 24);
+               invalid_chunk[1] = (uint8) (c.type >> 16);
+               invalid_chunk[2] = (uint8) (c.type >>  8);
+               invalid_chunk[3] = (uint8) (c.type >>  0);
+               #endif
+               return e(invalid_chunk, "PNG not supported: unknown chunk type");
+            }
+            skip(c.length);
+            break;
+      }
+      // end of chunk, read and skip CRC
+      get8(); get8(); get8(); get8();
+   }
+}
+
+static unsigned char *do_png(int *x, int *y, int *n, int req_comp)
+{
+   unsigned char *result=NULL;
+   if (req_comp < 0 || req_comp > 4) return epuc("bad req_comp", "Internal error");
+   if (parse_png_file(SCAN_load, req_comp)) {
+      result = out;
+      out = NULL;
+      if (req_comp && req_comp != img_out_n) {
+         result = convert_format(result, img_out_n, req_comp);
+         if (result == NULL) return result;
+      }
+      *x = img_x;
+      *y = img_y;
+      if (n) *n = img_n;
+   }
+   free(out);      out      = NULL;
+   free(expanded); expanded = NULL;
+   free(idata);    idata    = NULL;
+
+   return result;
+}
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_png_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   start_file(f);
+   return do_png(x,y,comp,req_comp);
+}
+
+unsigned char *stbi_png_load(char *filename, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_png_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+#endif
+
+unsigned char *stbi_png_load_from_memory(unsigned char *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   start_mem(buffer,len);
+   return do_png(x,y,comp,req_comp);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_png_test_file(FILE *f)
+{
+   int n,r;
+   n = ftell(f);
+   start_file(f);
+   r = parse_png_file(SCAN_type,STBI_default);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_png_test_memory(unsigned char *buffer, int len)
+{
+   start_mem(buffer, len);
+   return parse_png_file(SCAN_type,STBI_default);
+}
+
+// TODO: load header from png
+#ifndef STBI_NO_STDIO
+extern int      stbi_png_info             (char *filename,           int *x, int *y, int *comp);
+extern int      stbi_png_info_from_file   (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_png_info_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp);
+
+// Microsoft/Windows BMP image
+
+static int bmp_test(void)
+{
+   int sz;
+   if (get8() != 'B') return 0;
+   if (get8() != 'M') return 0;
+   get32le(); // discard filesize
+   get16le(); // discard reserved
+   get16le(); // discard reserved
+   get32le(); // discard data offset
+   sz = get32le();
+   if (sz == 12 || sz == 40 || sz == 56 || sz == 108) return 1;
+   return 0;
+}
+
+#ifndef STBI_NO_STDIO
+int      stbi_bmp_test_file        (FILE *f)
+{
+   int r,n = ftell(f);
+   start_file(f);
+   r = bmp_test();
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int      stbi_bmp_test_memory      (stbi_uc *buffer, int len)
+{
+   start_mem(buffer, len);
+   return bmp_test();
+}
+
+// returns 0..31 for the highest set bit
+static int high_bit(unsigned int z)
+{
+   int n=0;
+   if (z == 0) return -1;
+   if (z >= 0x10000) n += 16, z >>= 16;
+   if (z >= 0x00100) n +=  8, z >>=  8;
+   if (z >= 0x00010) n +=  4, z >>=  4;
+   if (z >= 0x00004) n +=  2, z >>=  2;
+   if (z >= 0x00002) n +=  1, z >>=  1;
+   return n;
+}
+
+static int bitcount(unsigned int a)
+{
+   a = (a & 0x55555555) + ((a >>  1) & 0x55555555); // max 2
+   a = (a & 0x33333333) + ((a >>  2) & 0x33333333); // max 4
+   a = (a + (a >> 4)) & 0x0f0f0f0f; // max 8 per 4, now 8 bits
+   a = (a + (a >> 8)); // max 16 per 8 bits
+   a = (a + (a >> 16)); // max 32 per 8 bits
+   return a & 0xff;
+}
+
+static int shiftsigned(int v, int shift, int bits)
+{
+   int result;
+   int z=0;
+
+   if (shift < 0) v <<= -shift;
+   else v >>= shift;
+   result = v;
+
+   z = bits;
+   while (z < 8) {
+      result += v >> z;
+      z += bits;
+   }
+   return result;
+}
+
+static stbi_uc *bmp_load(int *x, int *y, int *comp, int req_comp)
+{
+   unsigned int mr=0,mg=0,mb=0,ma=0;
+   stbi_uc pal[256][4];
+   int psize=0,i,j,compress=0,width;
+   int bpp, flip_vertically, pad, target, offset, hsz;
+   if (get8() != 'B' || get8() != 'M') return epuc("not BMP", "Corrupt BMP");
+   get32le(); // discard filesize
+   get16le(); // discard reserved
+   get16le(); // discard reserved
+   offset = get32le();
+   hsz = get32le();
+   if (hsz != 12 && hsz != 40 && hsz != 56 && hsz != 108) return epuc("unknown BMP", "BMP type not supported: unknown");
+   failure_reason = "bad BMP";
+   if (hsz == 12) {
+      img_x = get16le();
+      img_y = get16le();
+   } else {
+      img_x = get32le();
+      img_y = get32le();
+   }
+   if (get16le() != 1) return 0;
+   bpp = get16le();
+   if (bpp == 1) return epuc("monochrome", "BMP type not supported: 1-bit");
+   flip_vertically = ((int) img_y) > 0;
+   img_y = abs((int) img_y);
+   if (hsz == 12) {
+      if (bpp < 24)
+         psize = (offset - 14 - 24) / 3;
+   } else {
+      compress = get32le();
+      if (compress == 1 || compress == 2) return epuc("BMP RLE", "BMP type not supported: RLE");
+      get32le(); // discard sizeof
+      get32le(); // discard hres
+      get32le(); // discard vres
+      get32le(); // discard colorsused
+      get32le(); // discard max important
+      if (hsz == 40 || hsz == 56) {
+         if (hsz == 56) {
+            get32le();
+            get32le();
+            get32le();
+            get32le();
+         }
+         if (bpp == 16 || bpp == 32) {
+            mr = mg = mb = 0;
+            if (compress == 0) {
+               if (bpp == 32) {
+                  mr = 0xff << 16;
+                  mg = 0xff <<  8;
+                  mb = 0xff <<  0;
+               } else {
+                  mr = 31 << 10;
+                  mg = 31 <<  5;
+                  mb = 31 <<  0;
+               }
+            } else if (compress == 3) {
+               mr = get32le();
+               mg = get32le();
+               mb = get32le();
+               // not documented, but generated by photoshop and handled by mspaint
+               if (mr == mg && mg == mb) {
+                  // ?!?!?
+                  return NULL;
+               }
+            } else
+               return NULL;
+         }
+      } else {
+         assert(hsz == 108);
+         mr = get32le();
+         mg = get32le();
+         mb = get32le();
+         ma = get32le();
+         get32le(); // discard color space
+         for (i=0; i < 12; ++i)
+            get32le(); // discard color space parameters
+      }
+      if (bpp < 16)
+         psize = (offset - 14 - hsz) >> 2;
+   }
+   img_n = ma ? 4 : 3;
+   if (req_comp && req_comp >= 3) // we can directly decode 3 or 4
+      target = req_comp;
+   else
+      target = img_n; // if they want monochrome, we'll post-convert
+   out = (stbi_uc *) malloc(target * img_x * img_y);
+   if (!out) return epuc("outofmem", "Out of memory");
+   if (bpp < 16) {
+      int z=0;
+      if (psize == 0 || psize > 256) return epuc("invalid", "Corrupt BMP");
+      for (i=0; i < psize; ++i) {
+         pal[i][2] = get8();
+         pal[i][1] = get8();
+         pal[i][0] = get8();
+         if (hsz != 12) get8();
+         pal[i][3] = 255;
+      }
+      skip(offset - 14 - hsz - psize * (hsz == 12 ? 3 : 4));
+      if (bpp == 4) width = (img_x + 1) >> 1;
+      else if (bpp == 8) width = img_x;
+      else return epuc("bad bpp", "Corrupt BMP");
+      pad = (-width)&3;
+      for (j=0; j < (int) img_y; ++j) {
+         for (i=0; i < (int) img_x; i += 2) {
+            int v=get8(),v2=0;
+            if (bpp == 4) {
+               v2 = v & 15;
+               v >>= 4;
+            }
+            out[z++] = pal[v][0];
+            out[z++] = pal[v][1];
+            out[z++] = pal[v][2];
+            if (target == 4) out[z++] = 255;
+            if (i+1 == (int) img_x) break;
+            v = (bpp == 8) ? get8() : v2;
+            out[z++] = pal[v][0];
+            out[z++] = pal[v][1];
+            out[z++] = pal[v][2];
+            if (target == 4) out[z++] = 255;
+         }
+         skip(pad);
+      }
+   } else {
+      int rshift=0,gshift=0,bshift=0,ashift=0,rcount=0,gcount=0,bcount=0,acount=0;
+      int z = 0;
+      int easy=0;
+      skip(offset - 14 - hsz);
+      if (bpp == 24) width = 3 * img_x;
+      else if (bpp == 16) width = 2*img_x;
+      else /* bpp = 32 and pad = 0 */ width=0;
+      pad = (-width) & 3;
+      if (bpp == 24) {
+         easy = 1;
+      } else if (bpp == 32) {
+         if (mb == 0xff && mg == 0xff00 && mr == 0xff000000 && ma == 0xff000000)
+            easy = 2;
+      }
+      if (!easy) {
+         if (!mr || !mg || !mb) return epuc("bad masks", "Corrupt BMP");
+         // right shift amt to put high bit in position #7
+         rshift = high_bit(mr)-7; rcount = bitcount(mr);
+         gshift = high_bit(mg)-7; gcount = bitcount(mr);
+         bshift = high_bit(mb)-7; bcount = bitcount(mr);
+         ashift = high_bit(ma)-7; acount = bitcount(mr);
+      }
+      for (j=0; j < (int) img_y; ++j) {
+         if (easy) {
+            for (i=0; i < (int) img_x; ++i) {
+               int a;
+               out[z+2] = get8();
+               out[z+1] = get8();
+               out[z+0] = get8();
+               z += 3;
+               a = (easy == 2 ? get8() : 255);
+               if (target == 4) out[z++] = a;
+            }
+         } else {
+            for (i=0; i < (int) img_x; ++i) {
+               unsigned long v = (bpp == 16 ? get16le() : get32le());
+               int a;
+               out[z++] = shiftsigned(v & mr, rshift, rcount);
+               out[z++] = shiftsigned(v & mg, gshift, gcount);
+               out[z++] = shiftsigned(v & mb, bshift, bcount);
+               a = (ma ? shiftsigned(v & ma, ashift, acount) : 255);
+               if (target == 4) out[z++] = a; 
+            }
+         }
+         skip(pad);
+      }
+   }
+   if (flip_vertically) {
+      stbi_uc t;
+      for (j=0; j < (int) img_y>>1; ++j) {
+         stbi_uc *p1 = out +      j     *img_x*target;
+         stbi_uc *p2 = out + (img_y-1-j)*img_x*target;
+         for (i=0; i < (int) img_x*target; ++i) {
+            t = p1[i], p1[i] = p2[i], p2[i] = t;
+         }
+      }
+   }
+
+   if (req_comp && req_comp != target) {
+      out = convert_format(out, target, req_comp);
+      if (out == NULL) return out; // convert_format frees input on failure
+   }
+
+   *x = img_x;
+   *y = img_y;
+   if (comp) *comp = target;
+   return out;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_bmp_load             (char *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_bmp_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_bmp_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp)
+{
+   start_file(f);
+   return bmp_load(x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_bmp_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   start_mem(buffer, len);
+   return bmp_load(x,y,comp,req_comp);
+}
+
+// Targa Truevision - TGA
+// by Jonathan Dummer
+
+static int tga_test(void)
+{
+	int sz;
+	get8u();		//	discard Offset
+	sz = get8u();	//	color type
+	if( sz > 1 ) return 0;	//	only RGB or indexed allowed
+	sz = get8u();	//	image type
+	if( (sz != 1) && (sz != 2) && (sz != 3) && (sz != 9) && (sz != 10) && (sz != 11) ) return 0;	//	only RGB or grey allowed, +/- RLE
+	get16();		//	discard palette start
+	get16();		//	discard palette length
+	get8();			//	discard bits per palette color entry
+	get16();		//	discard x origin
+	get16();		//	discard y origin
+	if( get16() < 1 ) return 0;		//	test width
+	if( get16() < 1 ) return 0;		//	test height
+	sz = get8();	//	bits per pixel
+	if( (sz != 8) && (sz != 16) && (sz != 24) && (sz != 32) ) return 0;	//	only RGB or RGBA or grey allowed
+	return 1;		//	seems to have passed everything
+}
+
+#ifndef STBI_NO_STDIO
+int      stbi_tga_test_file        (FILE *f)
+{
+   int r,n = ftell(f);
+   start_file(f);
+   r = tga_test();
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int      stbi_tga_test_memory      (stbi_uc *buffer, int len)
+{
+   start_mem(buffer, len);
+   return tga_test();
+}
+
+static stbi_uc *tga_load(int *x, int *y, int *comp, int req_comp)
+{
+	//	read in the TGA header stuff
+	int tga_offset = get8u();
+	int tga_indexed = get8u();
+	int tga_image_type = get8u();
+	int tga_is_RLE = 0;
+	int tga_palette_start = get16le();
+	int tga_palette_len = get16le();
+	int tga_palette_bits = get8u();
+	int tga_x_origin = get16le();
+	int tga_y_origin = get16le();
+	int tga_width = get16le();
+	int tga_height = get16le();
+	int tga_bits_per_pixel = get8u();
+	int tga_inverted = get8u();
+	//	image data
+	unsigned char *tga_data;
+	unsigned char *tga_palette = NULL;
+	int i, j;
+	unsigned char raw_data[4];
+	unsigned char trans_data[4];
+	int RLE_count = 0;
+	int RLE_repeating = 0;
+	int read_next_pixel = 1;
+	//	do a tiny bit of precessing
+	if( tga_image_type >= 8 )
+	{
+		tga_image_type -= 8;
+		tga_is_RLE = 1;
+	}
+	/* int tga_alpha_bits = tga_inverted & 15; */
+	tga_inverted = 1 - ((tga_inverted >> 5) & 1);
+
+	//	error check
+	if( //(tga_indexed) ||
+		(tga_width < 1) || (tga_height < 1) ||
+		(tga_image_type < 1) || (tga_image_type > 3) ||
+		((tga_bits_per_pixel != 8) && (tga_bits_per_pixel != 16) &&
+		(tga_bits_per_pixel != 24) && (tga_bits_per_pixel != 32))
+		)
+	{
+		return NULL;
+	}
+
+	//	If I'm paletted, then I'll use the number of bits from the palette
+	if( tga_indexed )
+	{
+		tga_bits_per_pixel = tga_palette_bits;
+	}
+
+	//	tga info
+	*x = tga_width;
+	*y = tga_height;
+	if( (req_comp < 1) || (req_comp > 4) )
+	{
+		//	just use whatever the file was
+		req_comp = tga_bits_per_pixel / 8;
+		*comp = req_comp;
+	} else
+	{
+		//	force a new number of components
+		*comp = tga_bits_per_pixel/8;
+	}
+	tga_data = (unsigned char*)malloc( tga_width * tga_height * req_comp );
+
+	//	skip to the data's starting position (offset usually = 0)
+	skip( tga_offset );
+	//	do I need to load a palette?
+	if( tga_indexed )
+	{
+		//	any data to skip? (offset usually = 0)
+		skip( tga_palette_start );
+		//	load the palette
+		tga_palette = (unsigned char*)malloc( tga_palette_len * tga_palette_bits / 8 );
+		getn( tga_palette, tga_palette_len * tga_palette_bits / 8 );
+	}
+	//	load the data
+	for( i = 0; i < tga_width * tga_height; ++i )
+	{
+		//	if I'm in RLE mode, do I need to get a RLE chunk?
+		if( tga_is_RLE )
+		{
+			if( RLE_count == 0 )
+			{
+				//	yep, get the next byte as a RLE command
+				int RLE_cmd = get8u();
+				RLE_count = 1 + (RLE_cmd & 127);
+				RLE_repeating = RLE_cmd >> 7;
+				read_next_pixel = 1;
+			} else if( !RLE_repeating )
+			{
+				read_next_pixel = 1;
+			}
+		} else
+		{
+			read_next_pixel = 1;
+		}
+		//	OK, if I need to read a pixel, do it now
+		if( read_next_pixel )
+		{
+			//	load however much data we did have
+			if( tga_indexed )
+			{
+				//	read in 1 byte, then perform the lookup
+				int pal_idx = get8u();
+				if( pal_idx >= tga_palette_len )
+				{
+					//	invalid index
+					pal_idx = 0;
+				}
+				pal_idx *= tga_bits_per_pixel / 8;
+				for( j = 0; j*8 < tga_bits_per_pixel; ++j )
+				{
+					raw_data[j] = tga_palette[pal_idx+j];
+				}
+			} else
+			{
+				//	read in the data raw
+				for( j = 0; j*8 < tga_bits_per_pixel; ++j )
+				{
+					raw_data[j] = get8u();
+				}
+			}
+			//	convert raw to the intermediate format
+			switch( tga_bits_per_pixel )
+			{
+			case 8:
+				//	Luminous => RGBA
+				trans_data[0] = raw_data[0];
+				trans_data[1] = raw_data[0];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = 255;
+				break;
+			case 16:
+				//	Luminous,Alpha => RGBA
+				trans_data[0] = raw_data[0];
+				trans_data[1] = raw_data[0];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = raw_data[1];
+				break;
+			case 24:
+				//	BGR => RGBA
+				trans_data[0] = raw_data[2];
+				trans_data[1] = raw_data[1];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = 255;
+				break;
+			case 32:
+				//	BGRA => RGBA
+				trans_data[0] = raw_data[2];
+				trans_data[1] = raw_data[1];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = raw_data[3];
+				break;
+			}
+			//	clear the reading flag for the next pixel
+			read_next_pixel = 0;
+		} // end of reading a pixel
+		//	convert to final format
+		switch( req_comp )
+		{
+		case 1:
+			//	RGBA => Luminance
+			tga_data[i*req_comp+0] = compute_y(trans_data[0],trans_data[1],trans_data[2]);
+			break;
+		case 2:
+			//	RGBA => Luminance,Alpha
+			tga_data[i*req_comp+0] = compute_y(trans_data[0],trans_data[1],trans_data[2]);
+			tga_data[i*req_comp+1] = trans_data[3];
+			break;
+		case 3:
+			//	RGBA => RGB
+			tga_data[i*req_comp+0] = trans_data[0];
+			tga_data[i*req_comp+1] = trans_data[1];
+			tga_data[i*req_comp+2] = trans_data[2];
+			break;
+		case 4:
+			//	RGBA => RGBA
+			tga_data[i*req_comp+0] = trans_data[0];
+			tga_data[i*req_comp+1] = trans_data[1];
+			tga_data[i*req_comp+2] = trans_data[2];
+			tga_data[i*req_comp+3] = trans_data[3];
+			break;
+		}
+		//	in case we're in RLE mode, keep counting down
+		--RLE_count;
+	}
+	//	do I need to invert the image?
+	if( tga_inverted )
+	{
+		for( j = 0; j*2 < tga_height; ++j )
+		{
+			int index1 = j * tga_width * req_comp;
+			int index2 = (tga_height - 1 - j) * tga_width * req_comp;
+			for( i = tga_width * req_comp; i > 0; --i )
+			{
+				unsigned char temp = tga_data[index1];
+				tga_data[index1] = tga_data[index2];
+				tga_data[index2] = temp;
+				++index1;
+				++index2;
+			}
+		}
+	}
+	//	clear my palette, if I had one
+	if( tga_palette != NULL )
+	{
+		free( tga_palette );
+	}
+	//	the things I do to get rid of an error message, and yet keep
+	//	Microsoft's C compilers happy... [8^(
+	tga_palette_start = tga_palette_len = tga_palette_bits =
+			tga_x_origin = tga_y_origin = 0;
+	//	OK, done
+	return tga_data;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_tga_load             (char *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_tga_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_tga_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp)
+{
+   start_file(f);
+   return tga_load(x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_tga_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   start_mem(buffer, len);
+   return tga_load(x,y,comp,req_comp);
+}
+
+
+// *************************************************************************************************
+// Photoshop PSD loader -- PD by Thatcher Ulrich, integration by Nicholas Schulz, tweaked by STB
+
+static int psd_test(void)
+{
+	if (get32() != 0x38425053) return 0;	// "8BPS"
+	else return 1;
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_psd_test_file(FILE *f)
+{
+   int r,n = ftell(f);
+   start_file(f);
+   r = psd_test();
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_psd_test_memory(stbi_uc *buffer, int len)
+{
+   start_mem(buffer, len);
+   return psd_test();
+}
+
+static stbi_uc *psd_load(int *x, int *y, int *comp, int req_comp)
+{
+	int	pixelCount;
+	int channelCount, compression;
+	int channel, i, count, len;
+   int w,h;
+
+	// Check identifier
+	if (get32() != 0x38425053)	// "8BPS"
+		return epuc("not PSD", "Corrupt PSD image");
+
+	// Check file type version.
+	if (get16() != 1)
+		return epuc("wrong version", "Unsupported version of PSD image");
+
+	// Skip 6 reserved bytes.
+	skip( 6 );
+
+	// Read the number of channels (R, G, B, A, etc).
+	channelCount = get16();
+	if (channelCount < 0 || channelCount > 16)
+		return epuc("wrong channel count", "Unsupported number of channels in PSD image");
+
+	// Read the rows and columns of the image.
+   h = get32();
+   w = get32();
+	
+	// Make sure the depth is 8 bits.
+	if (get16() != 8)
+		return epuc("unsupported bit depth", "PSD bit depth is not 8 bit");
+
+	// Make sure the color mode is RGB.
+	// Valid options are:
+	//   0: Bitmap
+	//   1: Grayscale
+	//   2: Indexed color
+	//   3: RGB color
+	//   4: CMYK color
+	//   7: Multichannel
+	//   8: Duotone
+	//   9: Lab color
+	if (get16() != 3)
+		return epuc("wrong color format", "PSD is not in RGB color format");
+
+	// Skip the Mode Data.  (It's the palette for indexed color; other info for other modes.)
+	skip(get32() );
+
+	// Skip the image resources.  (resolution, pen tool paths, etc)
+	skip( get32() );
+
+	// Skip the reserved data.
+	skip( get32() );
+
+	// Find out if the data is compressed.
+	// Known values:
+	//   0: no compression
+	//   1: RLE compressed
+	compression = get16();
+	if (compression > 1)
+		return epuc("unknown compression type", "PSD has an unknown compression format");
+
+	// Create the destination image.
+	out = (stbi_uc *) malloc(4 * w*h);
+	if (!out) return epuc("outofmem", "Out of memory");
+   pixelCount = w*h;
+
+	// Initialize the data to zero.
+	//memset( out, 0, pixelCount * 4 );
+	
+	// Finally, the image data.
+	if (compression) {
+		// RLE as used by .PSD and .TIFF
+		// Loop until you get the number of unpacked bytes you are expecting:
+		//     Read the next source byte into n.
+		//     If n is between 0 and 127 inclusive, copy the next n+1 bytes literally.
+		//     Else if n is between -127 and -1 inclusive, copy the next byte -n+1 times.
+		//     Else if n is 128, noop.
+		// Endloop
+
+		// The RLE-compressed data is preceeded by a 2-byte data count for each row in the data,
+		// which we're going to just skip.
+		skip( h * channelCount * 2 );
+
+		// Read the RLE data by channel.
+		for (channel = 0; channel < 4; channel++) {
+			uint8 *p;
+			
+         p = out+channel;
+			if (channel >= channelCount) {
+				// Fill this channel with default data.
+				for (i = 0; i < pixelCount; i++) *p = (channel == 3 ? 255 : 0), p += 4;
+			} else {
+				// Read the RLE data.
+				count = 0;
+				while (count < pixelCount) {
+					len = get8();
+					if (len == 128) {
+						// No-op.
+					} else if (len < 128) {
+						// Copy next len+1 bytes literally.
+						len++;
+						count += len;
+						while (len) {
+							*p = get8();
+                     p += 4;
+							len--;
+						}
+					} else if (len > 128) {
+						uint32	val;
+						// Next -len+1 bytes in the dest are replicated from next source byte.
+						// (Interpret len as a negative 8-bit int.)
+						len ^= 0x0FF;
+						len += 2;
+                  val = get8();
+						count += len;
+						while (len) {
+							*p = val;
+                     p += 4;
+							len--;
+						}
+					}
+				}
+			}
+		}
+		
+	} else {
+		// We're at the raw image data.  It's each channel in order (Red, Green, Blue, Alpha, ...)
+		// where each channel consists of an 8-bit value for each pixel in the image.
+		
+		// Read the data by channel.
+		for (channel = 0; channel < 4; channel++) {
+			uint8 *p;
+			
+         p = out + channel;
+			if (channel > channelCount) {
+				// Fill this channel with default data.
+				for (i = 0; i < pixelCount; i++) *p = channel == 3 ? 255 : 0, p += 4;
+			} else {
+				// Read the data.
+				count = 0;
+				for (i = 0; i < pixelCount; i++)
+					*p = get8(), p += 4;
+			}
+		}
+	}
+
+	if (req_comp && req_comp != 4) {
+      img_x = w;
+      img_y = h;
+		out = convert_format(out, 4, req_comp);
+		if (out == NULL) return out; // convert_format frees input on failure
+	}
+
+	if (comp) *comp = channelCount;
+	*y = h;
+	*x = w;
+	
+	return out;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_psd_load(char *filename, int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_psd_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_psd_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   start_file(f);
+   return psd_load(x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_psd_load_from_memory (stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   start_mem(buffer, len);
+   return psd_load(x,y,comp,req_comp);
+}
+
+
+// *************************************************************************************************
+// Radiance RGBE HDR loader
+// originally by Nicolas Schulz
+#ifndef STBI_NO_HDR
+static int hdr_test(void)
+{
+   char *signature = "#?RADIANCE\n";
+   int i;
+   for (i=0; signature[i]; ++i)
+      if (get8() != signature[i])
+         return 0;
+	return 1;
+}
+
+int stbi_hdr_test_memory(stbi_uc *buffer, int len)
+{
+	start_mem(buffer, len);
+	return hdr_test();
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_hdr_test_file(FILE *f)
+{
+   int r,n = ftell(f);
+   start_file(f);
+   r = hdr_test();
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+#define HDR_BUFLEN  1024
+static char *hdr_gettoken(char *buffer)
+{
+   int len=0;
+	char *s = buffer, c = '\0';
+
+   c = get8();
+
+	while (!at_eof() && c != '\n') {
+		buffer[len++] = c;
+      if (len == HDR_BUFLEN-1) {
+         // flush to end of line
+         while (!at_eof() && get8() != '\n')
+            ;
+         break;
+      }
+      c = get8();
+	}
+
+   buffer[len] = 0;
+	return buffer;
+}
+
+static void hdr_convert(float *output, stbi_uc *input, int req_comp)
+{
+	if( input[3] != 0 ) {
+      float f1;
+		// Exponent
+		f1 = (float) ldexp(1.0f, input[3] - (int)(128 + 8));
+      if (req_comp <= 2)
+         output[0] = (input[0] + input[1] + input[2]) * f1 / 3;
+      else {
+         output[0] = input[0] * f1;
+         output[1] = input[1] * f1;
+         output[2] = input[2] * f1;
+      }
+      if (req_comp == 2) output[1] = 1;
+      if (req_comp == 4) output[3] = 1;
+	} else {
+      switch (req_comp) {
+         case 4: output[3] = 1; /* fallthrough */
+         case 3: output[0] = output[1] = output[2] = 0;
+                 break;
+         case 2: output[1] = 1; /* fallthrough */
+         case 1: output[0] = 0;
+                 break;
+      }
+	}
+}
+
+
+static float *hdr_load(int *x, int *y, int *comp, int req_comp)
+{
+   char buffer[HDR_BUFLEN];
+	char *token;
+	int valid = 0;
+	int width, height;
+   stbi_uc *scanline;
+	float *hdr_data;
+	int len;
+	unsigned char count, value;
+	int i, j, k, c1,c2, z;
+
+
+	// Check identifier
+	if (strcmp(hdr_gettoken(buffer), "#?RADIANCE") != 0)
+		return epf("not HDR", "Corrupt HDR image");
+	
+	// Parse header
+	while(1) {
+		token = hdr_gettoken(buffer);
+      if (token[0] == 0) break;
+		if (strcmp(token, "FORMAT=32-bit_rle_rgbe") == 0) valid = 1;
+   }
+
+	if (!valid)    return epf("unsupported format", "Unsupported HDR format");
+
+   // Parse width and height
+   // can't use sscanf() if we're not using stdio!
+   token = hdr_gettoken(buffer);
+   if (strncmp(token, "-Y ", 3))  return epf("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   height = strtol(token, &token, 10);
+   while (*token == ' ') ++token;
+   if (strncmp(token, "+X ", 3))  return epf("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   width = strtol(token, NULL, 10);
+
+	*x = width;
+	*y = height;
+
+   *comp = 3;
+	if (req_comp == 0) req_comp = 3;
+
+	// Read data
+	hdr_data = (float *) malloc(height * width * req_comp * sizeof(float));
+
+	// Load image data
+   // image data is stored as some number of sca
+	if( width < 8 || width >= 32768) {
+		// Read flat data
+      for (j=0; j < height; ++j) {
+         for (i=0; i < width; ++i) {
+            stbi_uc rgbe[4];
+           main_decode_loop:
+            getn(rgbe, 4);
+            hdr_convert(hdr_data + j * width * req_comp + i * req_comp, rgbe, req_comp);
+         }
+      }
+	} else {
+		// Read RLE-encoded data
+		scanline = NULL;
+
+		for (j = 0; j < height; ++j) {
+         c1 = get8();
+         c2 = get8();
+         len = get8();
+         if (c1 != 2 || c2 != 2 || (len & 0x80)) {
+            // not run-length encoded, so we have to actually use THIS data as a decoded
+            // pixel (note this can't be a valid pixel--one of RGB must be >= 128)
+            stbi_uc rgbe[4] = { c1,c2,len, get8() };
+            hdr_convert(hdr_data, rgbe, req_comp);
+            i = 1;
+            j = 0;
+            free(scanline);
+            goto main_decode_loop; // yes, this is fucking insane; blame the fucking insane format
+         }
+         len <<= 8;
+         len |= get8();
+         if (len != width) { free(hdr_data); free(scanline); return epf("invalid decoded scanline length", "corrupt HDR"); }
+         if (scanline == NULL) scanline = (stbi_uc *) malloc(width * 4);
+				
+			for (k = 0; k < 4; ++k) {
+				i = 0;
+				while (i < width) {
+					count = get8();
+					if (count > 128) {
+						// Run
+						value = get8();
+                  count -= 128;
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = value;
+					} else {
+						// Dump
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = get8();
+					}
+				}
+			}
+         for (i=0; i < width; ++i)
+            hdr_convert(hdr_data+(j*width + i)*req_comp, scanline + i*4, req_comp);
+		}
+      free(scanline);
+	}
+
+   return hdr_data;
+}
+
+#ifndef STBI_NO_STDIO
+float *stbi_hdr_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   start_file(f);
+   return hdr_load(x,y,comp,req_comp);
+}
+#endif
+
+float *stbi_hdr_load_from_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   start_mem(buffer, len);
+   return hdr_load(x,y,comp,req_comp);
+}
+
+#endif // STBI_NO_HDR
+
+/////////////////////// write image ///////////////////////
+
+#ifndef STBI_NO_WRITE
+
+static void write8(FILE *f, int x) { uint8 z = (uint8) x; fwrite(&z,1,1,f); }
+
+static void writefv(FILE *f, char *fmt, va_list v)
+{
+   while (*fmt) {
+      switch (*fmt++) {
+         case ' ': break;
+         case '1': { uint8 x = va_arg(v, int); write8(f,x); break; }
+         case '2': { int16 x = va_arg(v, int); write8(f,x); write8(f,x>>8); break; }
+         case '4': { int32 x = va_arg(v, int); write8(f,x); write8(f,x>>8); write8(f,x>>16); write8(f,x>>24); break; }
+         default:
+            assert(0);
+            va_end(v);
+            return;
+      }
+   }
+}
+
+static void writef(FILE *f, char *fmt, ...)
+{
+   va_list v;
+   va_start(v, fmt);
+   writefv(f,fmt,v);
+   va_end(v);
+}
+
+static void write_pixels(FILE *f, int rgb_dir, int vdir, int x, int y, int comp, void *data, int write_alpha, int scanline_pad)
+{
+   uint8 bg[3] = { 255, 0, 255}, px[3];
+   uint32 zero = 0;
+   int i,j,k, j_end;
+
+   if (vdir < 0) 
+      j_end = -1, j = y-1;
+   else
+      j_end =  y, j = 0;
+
+   for (; j != j_end; j += vdir) {
+      for (i=0; i < x; ++i) {
+         uint8 *d = (uint8 *) data + (j*x+i)*comp;
+         if (write_alpha < 0)
+            fwrite(&d[comp-1], 1, 1, f);
+         switch (comp) {
+            case 1:
+            case 2: writef(f, "111", d[0],d[0],d[0]);
+                    break;
+            case 4:
+               if (!write_alpha) {
+                  for (k=0; k < 3; ++k)
+                     px[k] = bg[k] + ((d[k] - bg[k]) * d[3])/255;
+                  writef(f, "111", px[1-rgb_dir],px[1],px[1+rgb_dir]);
+                  break;
+               }
+               /* FALLTHROUGH */
+            case 3:
+               writef(f, "111", d[1-rgb_dir],d[1],d[1+rgb_dir]);
+               break;
+         }
+         if (write_alpha > 0)
+            fwrite(&d[comp-1], 1, 1, f);
+      }
+      fwrite(&zero,scanline_pad,1,f);
+   }
+}
+
+static int outfile(char *filename, int rgb_dir, int vdir, int x, int y, int comp, void *data, int alpha, int pad, char *fmt, ...)
+{
+   FILE *f = fopen(filename, "wb");
+   if (f) {
+      va_list v;
+      va_start(v, fmt);
+      writefv(f, fmt, v);
+      va_end(v);
+      write_pixels(f,rgb_dir,vdir,x,y,comp,data,alpha,pad);
+      fclose(f);
+   }
+   return f != NULL;
+}
+
+int stbi_write_bmp(char *filename, int x, int y, int comp, void *data)
+{
+   int pad = (-x*3) & 3;
+   return outfile(filename,-1,-1,x,y,comp,data,0,pad,
+           "11 4 22 4" "4 44 22 444444",
+           'B', 'M', 14+40+(x*3+pad)*y, 0,0, 14+40,  // file header
+            40, x,y, 1,24, 0,0,0,0,0,0);             // bitmap header
+}
+
+int stbi_write_tga(char *filename, int x, int y, int comp, void *data)
+{
+   int has_alpha = !(comp & 1);
+   return outfile(filename, -1,-1, x, y, comp, data, has_alpha, 0,
+                  "111 221 2222 11", 0,0,2, 0,0,0, 0,0,x,y, 24+8*has_alpha, 8*has_alpha);
+}
+
+// any other image formats that do interleaved rgb data?
+//    PNG: requires adler32,crc32 -- significant amount of code
+//    PSD: no, channels output separately
+//    TIFF: no, stripwise-interleaved... i think
+
+#endif // STBI_NO_WRITE
diff --git a/external/include/SOIL/original/stb_image-1.16.c b/external/include/SOIL/original/stb_image-1.16.c
new file mode 100644
index 0000000..cfa8dc8
--- /dev/null
+++ b/external/include/SOIL/original/stb_image-1.16.c
@@ -0,0 +1,3821 @@
+/* stbi-1.16 - public domain JPEG/PNG reader - http://nothings.org/stb_image.c
+                      when you control the images you're loading
+
+   QUICK NOTES:
+      Primarily of interest to game developers and other people who can
+          avoid problematic images and only need the trivial interface
+
+      JPEG baseline (no JPEG progressive, no oddball channel decimations)
+      PNG non-interlaced
+      BMP non-1bpp, non-RLE
+      TGA (not sure what subset, if a subset)
+      PSD (composited view only, no extra channels)
+      HDR (radiance rgbE format)
+      writes BMP,TGA (define STBI_NO_WRITE to remove code)
+      decoded from memory or through stdio FILE (define STBI_NO_STDIO to remove code)
+      supports installable dequantizing-IDCT, YCbCr-to-RGB conversion (define STBI_SIMD)
+        
+   TODO:
+      stbi_info_*
+  
+   history:
+      1.16   major bugfix - convert_format converted one too many pixels
+      1.15   initialize some fields for thread safety
+      1.14   fix threadsafe conversion bug; header-file-only version (#define STBI_HEADER_FILE_ONLY before including)
+      1.13   threadsafe
+      1.12   const qualifiers in the API
+      1.11   Support installable IDCT, colorspace conversion routines
+      1.10   Fixes for 64-bit (don't use "unsigned long")
+             optimized upsampling by Fabian "ryg" Giesen
+      1.09   Fix format-conversion for PSD code (bad global variables!)
+      1.08   Thatcher Ulrich's PSD code integrated by Nicolas Schulz
+      1.07   attempt to fix C++ warning/errors again
+      1.06   attempt to fix C++ warning/errors again
+      1.05   fix TGA loading to return correct *comp and use good luminance calc
+      1.04   default float alpha is 1, not 255; use 'void *' for stbi_image_free
+      1.03   bugfixes to STBI_NO_STDIO, STBI_NO_HDR
+      1.02   support for (subset of) HDR files, float interface for preferred access to them
+      1.01   fix bug: possible bug in handling right-side up bmps... not sure
+             fix bug: the stbi_bmp_load() and stbi_tga_load() functions didn't work at all
+      1.00   interface to zlib that skips zlib header
+      0.99   correct handling of alpha in palette
+      0.98   TGA loader by lonesock; dynamically add loaders (untested)
+      0.97   jpeg errors on too large a file; also catch another malloc failure
+      0.96   fix detection of invalid v value - particleman@mollyrocket forum
+      0.95   during header scan, seek to markers in case of padding
+      0.94   STBI_NO_STDIO to disable stdio usage; rename all #defines the same
+      0.93   handle jpegtran output; verbose errors
+      0.92   read 4,8,16,24,32-bit BMP files of several formats
+      0.91   output 24-bit Windows 3.0 BMP files
+      0.90   fix a few more warnings; bump version number to approach 1.0
+      0.61   bugfixes due to Marc LeBlanc, Christopher Lloyd
+      0.60   fix compiling as c++
+      0.59   fix warnings: merge Dave Moore's -Wall fixes
+      0.58   fix bug: zlib uncompressed mode len/nlen was wrong endian
+      0.57   fix bug: jpg last huffman symbol before marker was >9 bits but less
+                      than 16 available
+      0.56   fix bug: zlib uncompressed mode len vs. nlen
+      0.55   fix bug: restart_interval not initialized to 0
+      0.54   allow NULL for 'int *comp'
+      0.53   fix bug in png 3->4; speedup png decoding
+      0.52   png handles req_comp=3,4 directly; minor cleanup; jpeg comments
+      0.51   obey req_comp requests, 1-component jpegs return as 1-component,
+             on 'test' only check type, not whether we support this variant
+*/
+
+
+#ifndef STBI_INCLUDE_STB_IMAGE_H
+#define STBI_INCLUDE_STB_IMAGE_H
+
+////   begin header file  ////////////////////////////////////////////////////
+//
+// Limitations:
+//    - no progressive/interlaced support (jpeg, png)
+//    - 8-bit samples only (jpeg, png)
+//    - not threadsafe
+//    - channel subsampling of at most 2 in each dimension (jpeg)
+//    - no delayed line count (jpeg) -- IJG doesn't support either
+//
+// Basic usage (see HDR discussion below):
+//    int x,y,n;
+//    unsigned char *data = stbi_load(filename, &x, &y, &n, 0);
+//    // ... process data if not NULL ... 
+//    // ... x = width, y = height, n = # 8-bit components per pixel ...
+//    // ... replace '0' with '1'..'4' to force that many components per pixel
+//    stbi_image_free(data)
+//
+// Standard parameters:
+//    int *x       -- outputs image width in pixels
+//    int *y       -- outputs image height in pixels
+//    int *comp    -- outputs # of image components in image file
+//    int req_comp -- if non-zero, # of image components requested in result
+//
+// The return value from an image loader is an 'unsigned char *' which points
+// to the pixel data. The pixel data consists of *y scanlines of *x pixels,
+// with each pixel consisting of N interleaved 8-bit components; the first
+// pixel pointed to is top-left-most in the image. There is no padding between
+// image scanlines or between pixels, regardless of format. The number of
+// components N is 'req_comp' if req_comp is non-zero, or *comp otherwise.
+// If req_comp is non-zero, *comp has the number of components that _would_
+// have been output otherwise. E.g. if you set req_comp to 4, you will always
+// get RGBA output, but you can check *comp to easily see if it's opaque.
+//
+// An output image with N components has the following components interleaved
+// in this order in each pixel:
+//
+//     N=#comp     components
+//       1           grey
+//       2           grey, alpha
+//       3           red, green, blue
+//       4           red, green, blue, alpha
+//
+// If image loading fails for any reason, the return value will be NULL,
+// and *x, *y, *comp will be unchanged. The function stbi_failure_reason()
+// can be queried for an extremely brief, end-user unfriendly explanation
+// of why the load failed. Define STBI_NO_FAILURE_STRINGS to avoid
+// compiling these strings at all, and STBI_FAILURE_USERMSG to get slightly
+// more user-friendly ones.
+//
+// Paletted PNG and BMP images are automatically depalettized.
+//
+//
+// ===========================================================================
+//
+// HDR image support   (disable by defining STBI_NO_HDR)
+//
+// stb_image now supports loading HDR images in general, and currently
+// the Radiance .HDR file format, although the support is provided
+// generically. You can still load any file through the existing interface;
+// if you attempt to load an HDR file, it will be automatically remapped to
+// LDR, assuming gamma 2.2 and an arbitrary scale factor defaulting to 1;
+// both of these constants can be reconfigured through this interface:
+//
+//     stbi_hdr_to_ldr_gamma(2.2f);
+//     stbi_hdr_to_ldr_scale(1.0f);
+//
+// (note, do not use _inverse_ constants; stbi_image will invert them
+// appropriately).
+//
+// Additionally, there is a new, parallel interface for loading files as
+// (linear) floats to preserve the full dynamic range:
+//
+//    float *data = stbi_loadf(filename, &x, &y, &n, 0);
+// 
+// If you load LDR images through this interface, those images will
+// be promoted to floating point values, run through the inverse of
+// constants corresponding to the above:
+//
+//     stbi_ldr_to_hdr_scale(1.0f);
+//     stbi_ldr_to_hdr_gamma(2.2f);
+//
+// Finally, given a filename (or an open file or memory block--see header
+// file for details) containing image data, you can query for the "most
+// appropriate" interface to use (that is, whether the image is HDR or
+// not), using:
+//
+//     stbi_is_hdr(char *filename);
+
+#ifndef STBI_NO_STDIO
+#include <stdio.h>
+#endif
+
+#define STBI_VERSION 1
+
+enum
+{
+   STBI_default = 0, // only used for req_comp
+
+   STBI_grey       = 1,
+   STBI_grey_alpha = 2,
+   STBI_rgb        = 3,
+   STBI_rgb_alpha  = 4,
+};
+
+typedef unsigned char stbi_uc;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// WRITING API
+
+#if !defined(STBI_NO_WRITE) && !defined(STBI_NO_STDIO)
+// write a BMP/TGA file given tightly packed 'comp' channels (no padding, nor bmp-stride-padding)
+// (you must include the appropriate extension in the filename).
+// returns TRUE on success, FALSE if couldn't open file, error writing file
+extern int      stbi_write_bmp       (char const *filename,     int x, int y, int comp, void *data);
+extern int      stbi_write_tga       (char const *filename,     int x, int y, int comp, void *data);
+#endif
+
+// PRIMARY API - works on images of any type
+
+// load image by filename, open file, or memory buffer
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_load            (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_load_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+extern int      stbi_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern stbi_uc *stbi_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+// for stbi_load_from_file, file pointer is left pointing immediately after image
+
+#ifndef STBI_NO_HDR
+#ifndef STBI_NO_STDIO
+extern float *stbi_loadf            (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern float *stbi_loadf_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+extern float *stbi_loadf_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+
+extern void   stbi_hdr_to_ldr_gamma(float gamma);
+extern void   stbi_hdr_to_ldr_scale(float scale);
+
+extern void   stbi_ldr_to_hdr_gamma(float gamma);
+extern void   stbi_ldr_to_hdr_scale(float scale);
+
+#endif // STBI_NO_HDR
+
+// get a VERY brief reason for failure
+// NOT THREADSAFE
+extern char    *stbi_failure_reason  (void); 
+
+// free the loaded image -- this is just free()
+extern void     stbi_image_free      (void *retval_from_stbi_load);
+
+// get image dimensions & components without fully decoding
+extern int      stbi_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+extern int      stbi_is_hdr_from_memory(stbi_uc const *buffer, int len);
+#ifndef STBI_NO_STDIO
+extern int      stbi_info            (char const *filename,     int *x, int *y, int *comp);
+extern int      stbi_is_hdr          (char const *filename);
+extern int      stbi_is_hdr_from_file(FILE *f);
+#endif
+
+// ZLIB client - used by PNG, available for other purposes
+
+extern char *stbi_zlib_decode_malloc_guesssize(const char *buffer, int len, int initial_size, int *outlen);
+extern char *stbi_zlib_decode_malloc(const char *buffer, int len, int *outlen);
+extern int   stbi_zlib_decode_buffer(char *obuffer, int olen, const char *ibuffer, int ilen);
+
+extern char *stbi_zlib_decode_noheader_malloc(const char *buffer, int len, int *outlen);
+extern int   stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, const char *ibuffer, int ilen);
+
+// TYPE-SPECIFIC ACCESS
+
+// is it a jpeg?
+extern int      stbi_jpeg_test_memory     (stbi_uc const *buffer, int len);
+extern stbi_uc *stbi_jpeg_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+extern int      stbi_jpeg_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_jpeg_load            (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern int      stbi_jpeg_test_file       (FILE *f);
+extern stbi_uc *stbi_jpeg_load_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+
+extern int      stbi_jpeg_info            (char const *filename,     int *x, int *y, int *comp);
+extern int      stbi_jpeg_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+
+// is it a png?
+extern int      stbi_png_test_memory      (stbi_uc const *buffer, int len);
+extern stbi_uc *stbi_png_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_png_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info             (char const *filename,     int *x, int *y, int *comp);
+extern int      stbi_png_test_file        (FILE *f);
+extern stbi_uc *stbi_png_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info_from_file   (FILE *f,                  int *x, int *y, int *comp);
+#endif
+
+// is it a bmp?
+extern int      stbi_bmp_test_memory      (stbi_uc const *buffer, int len);
+
+extern stbi_uc *stbi_bmp_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_bmp_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_bmp_test_file        (FILE *f);
+extern stbi_uc *stbi_bmp_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it a tga?
+extern int      stbi_tga_test_memory      (stbi_uc const *buffer, int len);
+
+extern stbi_uc *stbi_tga_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_tga_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_tga_test_file        (FILE *f);
+extern stbi_uc *stbi_tga_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it a psd?
+extern int      stbi_psd_test_memory      (stbi_uc const *buffer, int len);
+
+extern stbi_uc *stbi_psd_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_psd_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_psd_test_file        (FILE *f);
+extern stbi_uc *stbi_psd_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it an hdr?
+extern int      stbi_hdr_test_memory      (stbi_uc const *buffer, int len);
+
+extern float *  stbi_hdr_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern float *  stbi_hdr_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_hdr_test_file        (FILE *f);
+extern float *  stbi_hdr_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// define new loaders
+typedef struct
+{
+   int       (*test_memory)(stbi_uc const *buffer, int len);
+   stbi_uc * (*load_from_memory)(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+   #ifndef STBI_NO_STDIO
+   int       (*test_file)(FILE *f);
+   stbi_uc * (*load_from_file)(FILE *f, int *x, int *y, int *comp, int req_comp);
+   #endif
+} stbi_loader;
+
+// register a loader by filling out the above structure (you must defined ALL functions)
+// returns 1 if added or already added, 0 if not added (too many loaders)
+// NOT THREADSAFE
+extern int stbi_register_loader(stbi_loader *loader);
+
+// define faster low-level operations (typically SIMD support)
+#if STBI_SIMD
+typedef void (*stbi_idct_8x8)(uint8 *out, int out_stride, short data[64], unsigned short *dequantize);
+// compute an integer IDCT on "input"
+//     input[x] = data[x] * dequantize[x]
+//     write results to 'out': 64 samples, each run of 8 spaced by 'out_stride'
+//                             CLAMP results to 0..255
+typedef void (*stbi_YCbCr_to_RGB_run)(uint8 *output, uint8 const *y, uint8 const *cb, uint8 const *cr, int count, int step);
+// compute a conversion from YCbCr to RGB
+//     'count' pixels
+//     write pixels to 'output'; each pixel is 'step' bytes (either 3 or 4; if 4, write '255' as 4th), order R,G,B
+//     y: Y input channel
+//     cb: Cb input channel; scale/biased to be 0..255
+//     cr: Cr input channel; scale/biased to be 0..255
+
+extern void stbi_install_idct(stbi_idct_8x8 func);
+extern void stbi_install_YCbCr_to_RGB(stbi_YCbCr_to_RGB_run func);
+#endif // STBI_SIMD
+
+#ifdef __cplusplus
+}
+#endif
+
+//
+//
+////   end header file   /////////////////////////////////////////////////////
+#endif // STBI_INCLUDE_STB_IMAGE_H
+
+#ifndef STBI_HEADER_FILE_ONLY
+
+#ifndef STBI_NO_HDR
+#include <math.h>  // ldexp
+#include <string.h> // strcmp
+#endif
+
+#ifndef STBI_NO_STDIO
+#include <stdio.h>
+#endif
+#include <stdlib.h>
+#include <memory.h>
+#include <assert.h>
+#include <stdarg.h>
+
+#ifndef _MSC_VER
+  #ifdef __cplusplus
+  #define __forceinline inline
+  #else
+  #define __forceinline
+  #endif
+#endif
+
+
+// implementation:
+typedef unsigned char uint8;
+typedef unsigned short uint16;
+typedef   signed short  int16;
+typedef unsigned int   uint32;
+typedef   signed int    int32;
+typedef unsigned int   uint;
+
+// should produce compiler error if size is wrong
+typedef unsigned char validate_uint32[sizeof(uint32)==4];
+
+#if defined(STBI_NO_STDIO) && !defined(STBI_NO_WRITE)
+#define STBI_NO_WRITE
+#endif
+
+//////////////////////////////////////////////////////////////////////////////
+//
+// Generic API that works on all image types
+//
+
+// this is not threadsafe
+static char *failure_reason;
+
+char *stbi_failure_reason(void)
+{
+   return failure_reason;
+}
+
+static int e(char *str)
+{
+   failure_reason = str;
+   return 0;
+}
+
+#ifdef STBI_NO_FAILURE_STRINGS
+   #define e(x,y)  0
+#elif defined(STBI_FAILURE_USERMSG)
+   #define e(x,y)  e(y)
+#else
+   #define e(x,y)  e(x)
+#endif
+
+#define epf(x,y)   ((float *) (e(x,y)?NULL:NULL))
+#define epuc(x,y)  ((unsigned char *) (e(x,y)?NULL:NULL))
+
+void stbi_image_free(void *retval_from_stbi_load)
+{
+   free(retval_from_stbi_load);
+}
+
+#define MAX_LOADERS  32
+stbi_loader *loaders[MAX_LOADERS];
+static int max_loaders = 0;
+
+int stbi_register_loader(stbi_loader *loader)
+{
+   int i;
+   for (i=0; i < MAX_LOADERS; ++i) {
+      // already present?
+      if (loaders[i] == loader)
+         return 1;
+      // end of the list?
+      if (loaders[i] == NULL) {
+         loaders[i] = loader;
+         max_loaders = i+1;
+         return 1;
+      }
+   }
+   // no room for it
+   return 0;
+}
+
+#ifndef STBI_NO_HDR
+static float   *ldr_to_hdr(stbi_uc *data, int x, int y, int comp);
+static stbi_uc *hdr_to_ldr(float   *data, int x, int y, int comp);
+#endif
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   FILE *f = fopen(filename, "rb");
+   unsigned char *result;
+   if (!f) return epuc("can't fopen", "Unable to open file");
+   result = stbi_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return result;
+}
+
+unsigned char *stbi_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   int i;
+   if (stbi_jpeg_test_file(f))
+      return stbi_jpeg_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_png_test_file(f))
+      return stbi_png_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_bmp_test_file(f))
+      return stbi_bmp_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_psd_test_file(f))
+      return stbi_psd_load_from_file(f,x,y,comp,req_comp);
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_file(f)) {
+      float *hdr = stbi_hdr_load_from_file(f, x,y,comp,req_comp);
+      return hdr_to_ldr(hdr, *x, *y, req_comp ? req_comp : *comp);
+   }
+   #endif
+   for (i=0; i < max_loaders; ++i)
+      if (loaders[i]->test_file(f))
+         return loaders[i]->load_from_file(f,x,y,comp,req_comp);
+   // test tga last because it's a crappy test!
+   if (stbi_tga_test_file(f))
+      return stbi_tga_load_from_file(f,x,y,comp,req_comp);
+   return epuc("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+unsigned char *stbi_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   int i;
+   if (stbi_jpeg_test_memory(buffer,len))
+      return stbi_jpeg_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_png_test_memory(buffer,len))
+      return stbi_png_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_bmp_test_memory(buffer,len))
+      return stbi_bmp_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_psd_test_memory(buffer,len))
+      return stbi_psd_load_from_memory(buffer,len,x,y,comp,req_comp);
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_memory(buffer, len)) {
+      float *hdr = stbi_hdr_load_from_memory(buffer, len,x,y,comp,req_comp);
+      return hdr_to_ldr(hdr, *x, *y, req_comp ? req_comp : *comp);
+   }
+   #endif
+   for (i=0; i < max_loaders; ++i)
+      if (loaders[i]->test_memory(buffer,len))
+         return loaders[i]->load_from_memory(buffer,len,x,y,comp,req_comp);
+   // test tga last because it's a crappy test!
+   if (stbi_tga_test_memory(buffer,len))
+      return stbi_tga_load_from_memory(buffer,len,x,y,comp,req_comp);
+   return epuc("unknown image type", "Image not of any known type, or corrupt");
+}
+
+#ifndef STBI_NO_HDR
+
+#ifndef STBI_NO_STDIO
+float *stbi_loadf(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   FILE *f = fopen(filename, "rb");
+   float *result;
+   if (!f) return epf("can't fopen", "Unable to open file");
+   result = stbi_loadf_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return result;
+}
+
+float *stbi_loadf_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_file(f))
+      return stbi_hdr_load_from_file(f,x,y,comp,req_comp);
+   #endif
+   data = stbi_load_from_file(f, x, y, comp, req_comp);
+   if (data)
+      return ldr_to_hdr(data, *x, *y, req_comp ? req_comp : *comp);
+   return epf("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+float *stbi_loadf_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_memory(buffer, len))
+      return stbi_hdr_load_from_memory(buffer, len,x,y,comp,req_comp);
+   #endif
+   data = stbi_load_from_memory(buffer, len, x, y, comp, req_comp);
+   if (data)
+      return ldr_to_hdr(data, *x, *y, req_comp ? req_comp : *comp);
+   return epf("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+// these is-hdr-or-not is defined independent of whether STBI_NO_HDR is
+// defined, for API simplicity; if STBI_NO_HDR is defined, it always
+// reports false!
+
+int stbi_is_hdr_from_memory(stbi_uc const *buffer, int len)
+{
+   #ifndef STBI_NO_HDR
+   return stbi_hdr_test_memory(buffer, len);
+   #else
+   return 0;
+   #endif
+}
+
+#ifndef STBI_NO_STDIO
+extern int      stbi_is_hdr          (char const *filename)
+{
+   FILE *f = fopen(filename, "rb");
+   int result=0;
+   if (f) {
+      result = stbi_is_hdr_from_file(f);
+      fclose(f);
+   }
+   return result;
+}
+
+extern int      stbi_is_hdr_from_file(FILE *f)
+{
+   #ifndef STBI_NO_HDR
+   return stbi_hdr_test_file(f);
+   #else
+   return 0;
+   #endif
+}
+
+#endif
+
+// @TODO: get image dimensions & components without fully decoding
+#ifndef STBI_NO_STDIO
+extern int      stbi_info            (char const *filename,           int *x, int *y, int *comp);
+extern int      stbi_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_HDR
+static float h2l_gamma_i=1.0f/2.2f, h2l_scale_i=1.0f;
+static float l2h_gamma=2.2f, l2h_scale=1.0f;
+
+void   stbi_hdr_to_ldr_gamma(float gamma) { h2l_gamma_i = 1/gamma; }
+void   stbi_hdr_to_ldr_scale(float scale) { h2l_scale_i = 1/scale; }
+
+void   stbi_ldr_to_hdr_gamma(float gamma) { l2h_gamma = gamma; }
+void   stbi_ldr_to_hdr_scale(float scale) { l2h_scale = scale; }
+#endif
+
+
+//////////////////////////////////////////////////////////////////////////////
+//
+// Common code used by all image loaders
+//
+
+enum
+{
+   SCAN_load=0,
+   SCAN_type,
+   SCAN_header,
+};
+
+typedef struct
+{
+   uint32 img_x, img_y;
+   int img_n, img_out_n;
+
+   #ifndef STBI_NO_STDIO
+   FILE  *img_file;
+   #endif
+   uint8 *img_buffer, *img_buffer_end;
+} stbi;
+
+#ifndef STBI_NO_STDIO
+static void start_file(stbi *s, FILE *f)
+{
+   s->img_file = f;
+}
+#endif
+
+static void start_mem(stbi *s, uint8 const *buffer, int len)
+{
+#ifndef STBI_NO_STDIO
+   s->img_file = NULL;
+#endif
+   s->img_buffer = (uint8 *) buffer;
+   s->img_buffer_end = (uint8 *) buffer+len;
+}
+
+__forceinline static int get8(stbi *s)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file) {
+      int c = fgetc(s->img_file);
+      return c == EOF ? 0 : c;
+   }
+#endif
+   if (s->img_buffer < s->img_buffer_end)
+      return *s->img_buffer++;
+   return 0;
+}
+
+__forceinline static int at_eof(stbi *s)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file)
+      return feof(s->img_file);
+#endif
+   return s->img_buffer >= s->img_buffer_end;   
+}
+
+__forceinline static uint8 get8u(stbi *s)
+{
+   return (uint8) get8(s);
+}
+
+static void skip(stbi *s, int n)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file)
+      fseek(s->img_file, n, SEEK_CUR);
+   else
+#endif
+      s->img_buffer += n;
+}
+
+static int get16(stbi *s)
+{
+   int z = get8(s);
+   return (z << 8) + get8(s);
+}
+
+static uint32 get32(stbi *s)
+{
+   uint32 z = get16(s);
+   return (z << 16) + get16(s);
+}
+
+static int get16le(stbi *s)
+{
+   int z = get8(s);
+   return z + (get8(s) << 8);
+}
+
+static uint32 get32le(stbi *s)
+{
+   uint32 z = get16le(s);
+   return z + (get16le(s) << 16);
+}
+
+static void getn(stbi *s, stbi_uc *buffer, int n)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file) {
+      fread(buffer, 1, n, s->img_file);
+      return;
+   }
+#endif
+   memcpy(buffer, s->img_buffer, n);
+   s->img_buffer += n;
+}
+
+//////////////////////////////////////////////////////////////////////////////
+//
+//  generic converter from built-in img_n to req_comp
+//    individual types do this automatically as much as possible (e.g. jpeg
+//    does all cases internally since it needs to colorspace convert anyway,
+//    and it never has alpha, so very few cases ). png can automatically
+//    interleave an alpha=255 channel, but falls back to this for other cases
+//
+//  assume data buffer is malloced, so malloc a new one and free that one
+//  only failure mode is malloc failing
+
+static uint8 compute_y(int r, int g, int b)
+{
+   return (uint8) (((r*77) + (g*150) +  (29*b)) >> 8);
+}
+
+static unsigned char *convert_format(unsigned char *data, int img_n, int req_comp, uint x, uint y)
+{
+   int i,j;
+   unsigned char *good;
+
+   if (req_comp == img_n) return data;
+   assert(req_comp >= 1 && req_comp <= 4);
+
+   good = (unsigned char *) malloc(req_comp * x * y);
+   if (good == NULL) {
+      free(data);
+      return epuc("outofmem", "Out of memory");
+   }
+
+   for (j=0; j < (int) y; ++j) {
+      unsigned char *src  = data + j * x * img_n   ;
+      unsigned char *dest = good + j * x * req_comp;
+
+      #define COMBO(a,b)  ((a)*8+(b))
+      #define CASE(a,b)   case COMBO(a,b): for(i=x-1; i >= 0; --i, src += a, dest += b)
+      // convert source image with img_n components to one with req_comp components;
+      // avoid switch per pixel, so use switch per scanline and massive macros
+      switch(COMBO(img_n, req_comp)) {
+         CASE(1,2) dest[0]=src[0], dest[1]=255; break;
+         CASE(1,3) dest[0]=dest[1]=dest[2]=src[0]; break;
+         CASE(1,4) dest[0]=dest[1]=dest[2]=src[0], dest[3]=255; break;
+         CASE(2,1) dest[0]=src[0]; break;
+         CASE(2,3) dest[0]=dest[1]=dest[2]=src[0]; break;
+         CASE(2,4) dest[0]=dest[1]=dest[2]=src[0], dest[3]=src[1]; break;
+         CASE(3,4) dest[0]=src[0],dest[1]=src[1],dest[2]=src[2],dest[3]=255; break;
+         CASE(3,1) dest[0]=compute_y(src[0],src[1],src[2]); break;
+         CASE(3,2) dest[0]=compute_y(src[0],src[1],src[2]), dest[1] = 255; break;
+         CASE(4,1) dest[0]=compute_y(src[0],src[1],src[2]); break;
+         CASE(4,2) dest[0]=compute_y(src[0],src[1],src[2]), dest[1] = src[3]; break;
+         CASE(4,3) dest[0]=src[0],dest[1]=src[1],dest[2]=src[2]; break;
+         default: assert(0);
+      }
+      #undef CASE
+   }
+
+   free(data);
+   return good;
+}
+
+#ifndef STBI_NO_HDR
+static float   *ldr_to_hdr(stbi_uc *data, int x, int y, int comp)
+{
+   int i,k,n;
+   float *output = (float *) malloc(x * y * comp * sizeof(float));
+   if (output == NULL) { free(data); return epf("outofmem", "Out of memory"); }
+   // compute number of non-alpha components
+   if (comp & 1) n = comp; else n = comp-1;
+   for (i=0; i < x*y; ++i) {
+      for (k=0; k < n; ++k) {
+         output[i*comp + k] = (float) pow(data[i*comp+k]/255.0f, l2h_gamma) * l2h_scale;
+      }
+      if (k < comp) output[i*comp + k] = data[i*comp+k]/255.0f;
+   }
+   free(data);
+   return output;
+}
+
+#define float2int(x)   ((int) (x))
+static stbi_uc *hdr_to_ldr(float   *data, int x, int y, int comp)
+{
+   int i,k,n;
+   stbi_uc *output = (stbi_uc *) malloc(x * y * comp);
+   if (output == NULL) { free(data); return epuc("outofmem", "Out of memory"); }
+   // compute number of non-alpha components
+   if (comp & 1) n = comp; else n = comp-1;
+   for (i=0; i < x*y; ++i) {
+      for (k=0; k < n; ++k) {
+         float z = (float) pow(data[i*comp+k]*h2l_scale_i, h2l_gamma_i) * 255 + 0.5f;
+         if (z < 0) z = 0;
+         if (z > 255) z = 255;
+         output[i*comp + k] = float2int(z);
+      }
+      if (k < comp) {
+         float z = data[i*comp+k] * 255 + 0.5f;
+         if (z < 0) z = 0;
+         if (z > 255) z = 255;
+         output[i*comp + k] = float2int(z);
+      }
+   }
+   free(data);
+   return output;
+}
+#endif
+
+//////////////////////////////////////////////////////////////////////////////
+//
+//  "baseline" JPEG/JFIF decoder (not actually fully baseline implementation)
+//
+//    simple implementation
+//      - channel subsampling of at most 2 in each dimension
+//      - doesn't support delayed output of y-dimension
+//      - simple interface (only one output format: 8-bit interleaved RGB)
+//      - doesn't try to recover corrupt jpegs
+//      - doesn't allow partial loading, loading multiple at once
+//      - still fast on x86 (copying globals into locals doesn't help x86)
+//      - allocates lots of intermediate memory (full size of all components)
+//        - non-interleaved case requires this anyway
+//        - allows good upsampling (see next)
+//    high-quality
+//      - upsampled channels are bilinearly interpolated, even across blocks
+//      - quality integer IDCT derived from IJG's 'slow'
+//    performance
+//      - fast huffman; reasonable integer IDCT
+//      - uses a lot of intermediate memory, could cache poorly
+//      - load http://nothings.org/remote/anemones.jpg 3 times on 2.8Ghz P4
+//          stb_jpeg:   1.34 seconds (MSVC6, default release build)
+//          stb_jpeg:   1.06 seconds (MSVC6, processor = Pentium Pro)
+//          IJL11.dll:  1.08 seconds (compiled by intel)
+//          IJG 1998:   0.98 seconds (MSVC6, makefile provided by IJG)
+//          IJG 1998:   0.95 seconds (MSVC6, makefile + proc=PPro)
+
+// huffman decoding acceleration
+#define FAST_BITS   9  // larger handles more cases; smaller stomps less cache
+
+typedef struct
+{
+   uint8  fast[1 << FAST_BITS];
+   // weirdly, repacking this into AoS is a 10% speed loss, instead of a win
+   uint16 code[256];
+   uint8  values[256];
+   uint8  size[257];
+   unsigned int maxcode[18];
+   int    delta[17];   // old 'firstsymbol' - old 'firstcode'
+} huffman;
+
+typedef struct
+{
+   #if STBI_SIMD
+   unsigned short dequant2[4][64];
+   #endif
+   stbi s;
+   huffman huff_dc[4];
+   huffman huff_ac[4];
+   uint8 dequant[4][64];
+
+// sizes for components, interleaved MCUs
+   int img_h_max, img_v_max;
+   int img_mcu_x, img_mcu_y;
+   int img_mcu_w, img_mcu_h;
+
+// definition of jpeg image component
+   struct
+   {
+      int id;
+      int h,v;
+      int tq;
+      int hd,ha;
+      int dc_pred;
+
+      int x,y,w2,h2;
+      uint8 *data;
+      void *raw_data;
+      uint8 *linebuf;
+   } img_comp[4];
+
+   uint32         code_buffer; // jpeg entropy-coded buffer
+   int            code_bits;   // number of valid bits
+   unsigned char  marker;      // marker seen while filling entropy buffer
+   int            nomore;      // flag if we saw a marker so must stop
+
+   int scan_n, order[4];
+   int restart_interval, todo;
+} jpeg;
+
+static int build_huffman(huffman *h, int *count)
+{
+   int i,j,k=0,code;
+   // build size list for each symbol (from JPEG spec)
+   for (i=0; i < 16; ++i)
+      for (j=0; j < count[i]; ++j)
+         h->size[k++] = (uint8) (i+1);
+   h->size[k] = 0;
+
+   // compute actual symbols (from jpeg spec)
+   code = 0;
+   k = 0;
+   for(j=1; j <= 16; ++j) {
+      // compute delta to add to code to compute symbol id
+      h->delta[j] = k - code;
+      if (h->size[k] == j) {
+         while (h->size[k] == j)
+            h->code[k++] = (uint16) (code++);
+         if (code-1 >= (1 << j)) return e("bad code lengths","Corrupt JPEG");
+      }
+      // compute largest code + 1 for this size, preshifted as needed later
+      h->maxcode[j] = code << (16-j);
+      code <<= 1;
+   }
+   h->maxcode[j] = 0xffffffff;
+
+   // build non-spec acceleration table; 255 is flag for not-accelerated
+   memset(h->fast, 255, 1 << FAST_BITS);
+   for (i=0; i < k; ++i) {
+      int s = h->size[i];
+      if (s <= FAST_BITS) {
+         int c = h->code[i] << (FAST_BITS-s);
+         int m = 1 << (FAST_BITS-s);
+         for (j=0; j < m; ++j) {
+            h->fast[c+j] = (uint8) i;
+         }
+      }
+   }
+   return 1;
+}
+
+static void grow_buffer_unsafe(jpeg *j)
+{
+   do {
+      int b = j->nomore ? 0 : get8(&j->s);
+      if (b == 0xff) {
+         int c = get8(&j->s);
+         if (c != 0) {
+            j->marker = (unsigned char) c;
+            j->nomore = 1;
+            return;
+         }
+      }
+      j->code_buffer = (j->code_buffer << 8) | b;
+      j->code_bits += 8;
+   } while (j->code_bits <= 24);
+}
+
+// (1 << n) - 1
+static uint32 bmask[17]={0,1,3,7,15,31,63,127,255,511,1023,2047,4095,8191,16383,32767,65535};
+
+// decode a jpeg huffman value from the bitstream
+__forceinline static int decode(jpeg *j, huffman *h)
+{
+   unsigned int temp;
+   int c,k;
+
+   if (j->code_bits < 16) grow_buffer_unsafe(j);
+
+   // look at the top FAST_BITS and determine what symbol ID it is,
+   // if the code is <= FAST_BITS
+   c = (j->code_buffer >> (j->code_bits - FAST_BITS)) & ((1 << FAST_BITS)-1);
+   k = h->fast[c];
+   if (k < 255) {
+      if (h->size[k] > j->code_bits)
+         return -1;
+      j->code_bits -= h->size[k];
+      return h->values[k];
+   }
+
+   // naive test is to shift the code_buffer down so k bits are
+   // valid, then test against maxcode. To speed this up, we've
+   // preshifted maxcode left so that it has (16-k) 0s at the
+   // end; in other words, regardless of the number of bits, it
+   // wants to be compared against something shifted to have 16;
+   // that way we don't need to shift inside the loop.
+   if (j->code_bits < 16)
+      temp = (j->code_buffer << (16 - j->code_bits)) & 0xffff;
+   else
+      temp = (j->code_buffer >> (j->code_bits - 16)) & 0xffff;
+   for (k=FAST_BITS+1 ; ; ++k)
+      if (temp < h->maxcode[k])
+         break;
+   if (k == 17) {
+      // error! code not found
+      j->code_bits -= 16;
+      return -1;
+   }
+
+   if (k > j->code_bits)
+      return -1;
+
+   // convert the huffman code to the symbol id
+   c = ((j->code_buffer >> (j->code_bits - k)) & bmask[k]) + h->delta[k];
+   assert((((j->code_buffer) >> (j->code_bits - h->size[c])) & bmask[h->size[c]]) == h->code[c]);
+
+   // convert the id to a symbol
+   j->code_bits -= k;
+   return h->values[c];
+}
+
+// combined JPEG 'receive' and JPEG 'extend', since baseline
+// always extends everything it receives.
+__forceinline static int extend_receive(jpeg *j, int n)
+{
+   unsigned int m = 1 << (n-1);
+   unsigned int k;
+   if (j->code_bits < n) grow_buffer_unsafe(j);
+   k = (j->code_buffer >> (j->code_bits - n)) & bmask[n];
+   j->code_bits -= n;
+   // the following test is probably a random branch that won't
+   // predict well. I tried to table accelerate it but failed.
+   // maybe it's compiling as a conditional move?
+   if (k < m)
+      return (-1 << n) + k + 1;
+   else
+      return k;
+}
+
+// given a value that's at position X in the zigzag stream,
+// where does it appear in the 8x8 matrix coded as row-major?
+static uint8 dezigzag[64+15] =
+{
+    0,  1,  8, 16,  9,  2,  3, 10,
+   17, 24, 32, 25, 18, 11,  4,  5,
+   12, 19, 26, 33, 40, 48, 41, 34,
+   27, 20, 13,  6,  7, 14, 21, 28,
+   35, 42, 49, 56, 57, 50, 43, 36,
+   29, 22, 15, 23, 30, 37, 44, 51,
+   58, 59, 52, 45, 38, 31, 39, 46,
+   53, 60, 61, 54, 47, 55, 62, 63,
+   // let corrupt input sample past end
+   63, 63, 63, 63, 63, 63, 63, 63,
+   63, 63, 63, 63, 63, 63, 63
+};
+
+// decode one 64-entry block--
+static int decode_block(jpeg *j, short data[64], huffman *hdc, huffman *hac, int b)
+{
+   int diff,dc,k;
+   int t = decode(j, hdc);
+   if (t < 0) return e("bad huffman code","Corrupt JPEG");
+
+   // 0 all the ac values now so we can do it 32-bits at a time
+   memset(data,0,64*sizeof(data[0]));
+
+   diff = t ? extend_receive(j, t) : 0;
+   dc = j->img_comp[b].dc_pred + diff;
+   j->img_comp[b].dc_pred = dc;
+   data[0] = (short) dc;
+
+   // decode AC components, see JPEG spec
+   k = 1;
+   do {
+      int r,s;
+      int rs = decode(j, hac);
+      if (rs < 0) return e("bad huffman code","Corrupt JPEG");
+      s = rs & 15;
+      r = rs >> 4;
+      if (s == 0) {
+         if (rs != 0xf0) break; // end block
+         k += 16;
+      } else {
+         k += r;
+         // decode into unzigzag'd location
+         data[dezigzag[k++]] = (short) extend_receive(j,s);
+      }
+   } while (k < 64);
+   return 1;
+}
+
+// take a -128..127 value and clamp it and convert to 0..255
+__forceinline static uint8 clamp(int x)
+{
+   x += 128;
+   // trick to use a single test to catch both cases
+   if ((unsigned int) x > 255) {
+      if (x < 0) return 0;
+      if (x > 255) return 255;
+   }
+   return (uint8) x;
+}
+
+#define f2f(x)  (int) (((x) * 4096 + 0.5))
+#define fsh(x)  ((x) << 12)
+
+// derived from jidctint -- DCT_ISLOW
+#define IDCT_1D(s0,s1,s2,s3,s4,s5,s6,s7)       \
+   int t0,t1,t2,t3,p1,p2,p3,p4,p5,x0,x1,x2,x3; \
+   p2 = s2;                                    \
+   p3 = s6;                                    \
+   p1 = (p2+p3) * f2f(0.5411961f);             \
+   t2 = p1 + p3*f2f(-1.847759065f);            \
+   t3 = p1 + p2*f2f( 0.765366865f);            \
+   p2 = s0;                                    \
+   p3 = s4;                                    \
+   t0 = fsh(p2+p3);                            \
+   t1 = fsh(p2-p3);                            \
+   x0 = t0+t3;                                 \
+   x3 = t0-t3;                                 \
+   x1 = t1+t2;                                 \
+   x2 = t1-t2;                                 \
+   t0 = s7;                                    \
+   t1 = s5;                                    \
+   t2 = s3;                                    \
+   t3 = s1;                                    \
+   p3 = t0+t2;                                 \
+   p4 = t1+t3;                                 \
+   p1 = t0+t3;                                 \
+   p2 = t1+t2;                                 \
+   p5 = (p3+p4)*f2f( 1.175875602f);            \
+   t0 = t0*f2f( 0.298631336f);                 \
+   t1 = t1*f2f( 2.053119869f);                 \
+   t2 = t2*f2f( 3.072711026f);                 \
+   t3 = t3*f2f( 1.501321110f);                 \
+   p1 = p5 + p1*f2f(-0.899976223f);            \
+   p2 = p5 + p2*f2f(-2.562915447f);            \
+   p3 = p3*f2f(-1.961570560f);                 \
+   p4 = p4*f2f(-0.390180644f);                 \
+   t3 += p1+p4;                                \
+   t2 += p2+p3;                                \
+   t1 += p2+p4;                                \
+   t0 += p1+p3;
+
+#if !STBI_SIMD
+// .344 seconds on 3*anemones.jpg
+static void idct_block(uint8 *out, int out_stride, short data[64], uint8 *dequantize)
+{
+   int i,val[64],*v=val;
+   uint8 *o,*dq = dequantize;
+   short *d = data;
+
+   // columns
+   for (i=0; i < 8; ++i,++d,++dq, ++v) {
+      // if all zeroes, shortcut -- this avoids dequantizing 0s and IDCTing
+      if (d[ 8]==0 && d[16]==0 && d[24]==0 && d[32]==0
+           && d[40]==0 && d[48]==0 && d[56]==0) {
+         //    no shortcut                 0     seconds
+         //    (1|2|3|4|5|6|7)==0          0     seconds
+         //    all separate               -0.047 seconds
+         //    1 && 2|3 && 4|5 && 6|7:    -0.047 seconds
+         int dcterm = d[0] * dq[0] << 2;
+         v[0] = v[8] = v[16] = v[24] = v[32] = v[40] = v[48] = v[56] = dcterm;
+      } else {
+         IDCT_1D(d[ 0]*dq[ 0],d[ 8]*dq[ 8],d[16]*dq[16],d[24]*dq[24],
+                 d[32]*dq[32],d[40]*dq[40],d[48]*dq[48],d[56]*dq[56])
+         // constants scaled things up by 1<<12; let's bring them back
+         // down, but keep 2 extra bits of precision
+         x0 += 512; x1 += 512; x2 += 512; x3 += 512;
+         v[ 0] = (x0+t3) >> 10;
+         v[56] = (x0-t3) >> 10;
+         v[ 8] = (x1+t2) >> 10;
+         v[48] = (x1-t2) >> 10;
+         v[16] = (x2+t1) >> 10;
+         v[40] = (x2-t1) >> 10;
+         v[24] = (x3+t0) >> 10;
+         v[32] = (x3-t0) >> 10;
+      }
+   }
+
+   for (i=0, v=val, o=out; i < 8; ++i,v+=8,o+=out_stride) {
+      // no fast case since the first 1D IDCT spread components out
+      IDCT_1D(v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7])
+      // constants scaled things up by 1<<12, plus we had 1<<2 from first
+      // loop, plus horizontal and vertical each scale by sqrt(8) so together
+      // we've got an extra 1<<3, so 1<<17 total we need to remove.
+      x0 += 65536; x1 += 65536; x2 += 65536; x3 += 65536;
+      o[0] = clamp((x0+t3) >> 17);
+      o[7] = clamp((x0-t3) >> 17);
+      o[1] = clamp((x1+t2) >> 17);
+      o[6] = clamp((x1-t2) >> 17);
+      o[2] = clamp((x2+t1) >> 17);
+      o[5] = clamp((x2-t1) >> 17);
+      o[3] = clamp((x3+t0) >> 17);
+      o[4] = clamp((x3-t0) >> 17);
+   }
+}
+#else
+static void idct_block(uint8 *out, int out_stride, short data[64], unsigned short *dequantize)
+{
+   int i,val[64],*v=val;
+   uint8 *o;
+   unsigned short *dq = dequantize;
+   short *d = data;
+
+   // columns
+   for (i=0; i < 8; ++i,++d,++dq, ++v) {
+      // if all zeroes, shortcut -- this avoids dequantizing 0s and IDCTing
+      if (d[ 8]==0 && d[16]==0 && d[24]==0 && d[32]==0
+           && d[40]==0 && d[48]==0 && d[56]==0) {
+         //    no shortcut                 0     seconds
+         //    (1|2|3|4|5|6|7)==0          0     seconds
+         //    all separate               -0.047 seconds
+         //    1 && 2|3 && 4|5 && 6|7:    -0.047 seconds
+         int dcterm = d[0] * dq[0] << 2;
+         v[0] = v[8] = v[16] = v[24] = v[32] = v[40] = v[48] = v[56] = dcterm;
+      } else {
+         IDCT_1D(d[ 0]*dq[ 0],d[ 8]*dq[ 8],d[16]*dq[16],d[24]*dq[24],
+                 d[32]*dq[32],d[40]*dq[40],d[48]*dq[48],d[56]*dq[56])
+         // constants scaled things up by 1<<12; let's bring them back
+         // down, but keep 2 extra bits of precision
+         x0 += 512; x1 += 512; x2 += 512; x3 += 512;
+         v[ 0] = (x0+t3) >> 10;
+         v[56] = (x0-t3) >> 10;
+         v[ 8] = (x1+t2) >> 10;
+         v[48] = (x1-t2) >> 10;
+         v[16] = (x2+t1) >> 10;
+         v[40] = (x2-t1) >> 10;
+         v[24] = (x3+t0) >> 10;
+         v[32] = (x3-t0) >> 10;
+      }
+   }
+
+   for (i=0, v=val, o=out; i < 8; ++i,v+=8,o+=out_stride) {
+      // no fast case since the first 1D IDCT spread components out
+      IDCT_1D(v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7])
+      // constants scaled things up by 1<<12, plus we had 1<<2 from first
+      // loop, plus horizontal and vertical each scale by sqrt(8) so together
+      // we've got an extra 1<<3, so 1<<17 total we need to remove.
+      x0 += 65536; x1 += 65536; x2 += 65536; x3 += 65536;
+      o[0] = clamp((x0+t3) >> 17);
+      o[7] = clamp((x0-t3) >> 17);
+      o[1] = clamp((x1+t2) >> 17);
+      o[6] = clamp((x1-t2) >> 17);
+      o[2] = clamp((x2+t1) >> 17);
+      o[5] = clamp((x2-t1) >> 17);
+      o[3] = clamp((x3+t0) >> 17);
+      o[4] = clamp((x3-t0) >> 17);
+   }
+}
+static stbi_idct_8x8 stbi_idct_installed = idct_block;
+
+extern void stbi_install_idct(stbi_idct_8x8 func)
+{
+   stbi_idct_installed = func;
+}
+#endif
+
+#define MARKER_none  0xff
+// if there's a pending marker from the entropy stream, return that
+// otherwise, fetch from the stream and get a marker. if there's no
+// marker, return 0xff, which is never a valid marker value
+static uint8 get_marker(jpeg *j)
+{
+   uint8 x;
+   if (j->marker != MARKER_none) { x = j->marker; j->marker = MARKER_none; return x; }
+   x = get8u(&j->s);
+   if (x != 0xff) return MARKER_none;
+   while (x == 0xff)
+      x = get8u(&j->s);
+   return x;
+}
+
+// in each scan, we'll have scan_n components, and the order
+// of the components is specified by order[]
+#define RESTART(x)     ((x) >= 0xd0 && (x) <= 0xd7)
+
+// after a restart interval, reset the entropy decoder and
+// the dc prediction
+static void reset(jpeg *j)
+{
+   j->code_bits = 0;
+   j->code_buffer = 0;
+   j->nomore = 0;
+   j->img_comp[0].dc_pred = j->img_comp[1].dc_pred = j->img_comp[2].dc_pred = 0;
+   j->marker = MARKER_none;
+   j->todo = j->restart_interval ? j->restart_interval : 0x7fffffff;
+   // no more than 1<<31 MCUs if no restart_interal? that's plenty safe,
+   // since we don't even allow 1<<30 pixels
+}
+
+static int parse_entropy_coded_data(jpeg *z)
+{
+   reset(z);
+   if (z->scan_n == 1) {
+      int i,j;
+      #if STBI_SIMD
+      __declspec(align(16))
+      #endif
+      short data[64];
+      int n = z->order[0];
+      // non-interleaved data, we just need to process one block at a time,
+      // in trivial scanline order
+      // number of blocks to do just depends on how many actual "pixels" this
+      // component has, independent of interleaved MCU blocking and such
+      int w = (z->img_comp[n].x+7) >> 3;
+      int h = (z->img_comp[n].y+7) >> 3;
+      for (j=0; j < h; ++j) {
+         for (i=0; i < w; ++i) {
+            if (!decode_block(z, data, z->huff_dc+z->img_comp[n].hd, z->huff_ac+z->img_comp[n].ha, n)) return 0;
+            #if STBI_SIMD
+            stbi_idct_installed(z->img_comp[n].data+z->img_comp[n].w2*j*8+i*8, z->img_comp[n].w2, data, z->dequant2[z->img_comp[n].tq]);
+            #else
+            idct_block(z->img_comp[n].data+z->img_comp[n].w2*j*8+i*8, z->img_comp[n].w2, data, z->dequant[z->img_comp[n].tq]);
+            #endif
+            // every data block is an MCU, so countdown the restart interval
+            if (--z->todo <= 0) {
+               if (z->code_bits < 24) grow_buffer_unsafe(z);
+               // if it's NOT a restart, then just bail, so we get corrupt data
+               // rather than no data
+               if (!RESTART(z->marker)) return 1;
+               reset(z);
+            }
+         }
+      }
+   } else { // interleaved!
+      int i,j,k,x,y;
+      short data[64];
+      for (j=0; j < z->img_mcu_y; ++j) {
+         for (i=0; i < z->img_mcu_x; ++i) {
+            // scan an interleaved mcu... process scan_n components in order
+            for (k=0; k < z->scan_n; ++k) {
+               int n = z->order[k];
+               // scan out an mcu's worth of this component; that's just determined
+               // by the basic H and V specified for the component
+               for (y=0; y < z->img_comp[n].v; ++y) {
+                  for (x=0; x < z->img_comp[n].h; ++x) {
+                     int x2 = (i*z->img_comp[n].h + x)*8;
+                     int y2 = (j*z->img_comp[n].v + y)*8;
+                     if (!decode_block(z, data, z->huff_dc+z->img_comp[n].hd, z->huff_ac+z->img_comp[n].ha, n)) return 0;
+                     #if STBI_SIMD
+                     stbi_idct_installed(z->img_comp[n].data+z->img_comp[n].w2*y2+x2, z->img_comp[n].w2, data, z->dequant2[z->img_comp[n].tq]);
+                     #else
+                     idct_block(z->img_comp[n].data+z->img_comp[n].w2*y2+x2, z->img_comp[n].w2, data, z->dequant[z->img_comp[n].tq]);
+                     #endif
+                  }
+               }
+            }
+            // after all interleaved components, that's an interleaved MCU,
+            // so now count down the restart interval
+            if (--z->todo <= 0) {
+               if (z->code_bits < 24) grow_buffer_unsafe(z);
+               // if it's NOT a restart, then just bail, so we get corrupt data
+               // rather than no data
+               if (!RESTART(z->marker)) return 1;
+               reset(z);
+            }
+         }
+      }
+   }
+   return 1;
+}
+
+static int process_marker(jpeg *z, int m)
+{
+   int L;
+   switch (m) {
+      case MARKER_none: // no marker found
+         return e("expected marker","Corrupt JPEG");
+
+      case 0xC2: // SOF - progressive
+         return e("progressive jpeg","JPEG format not supported (progressive)");
+
+      case 0xDD: // DRI - specify restart interval
+         if (get16(&z->s) != 4) return e("bad DRI len","Corrupt JPEG");
+         z->restart_interval = get16(&z->s);
+         return 1;
+
+      case 0xDB: // DQT - define quantization table
+         L = get16(&z->s)-2;
+         while (L > 0) {
+            int q = get8(&z->s);
+            int p = q >> 4;
+            int t = q & 15,i;
+            if (p != 0) return e("bad DQT type","Corrupt JPEG");
+            if (t > 3) return e("bad DQT table","Corrupt JPEG");
+            for (i=0; i < 64; ++i)
+               z->dequant[t][dezigzag[i]] = get8u(&z->s);
+            #if STBI_SIMD
+            for (i=0; i < 64; ++i)
+               z->dequant2[t][i] = dequant[t][i];
+            #endif
+            L -= 65;
+         }
+         return L==0;
+
+      case 0xC4: // DHT - define huffman table
+         L = get16(&z->s)-2;
+         while (L > 0) {
+            uint8 *v;
+            int sizes[16],i,m=0;
+            int q = get8(&z->s);
+            int tc = q >> 4;
+            int th = q & 15;
+            if (tc > 1 || th > 3) return e("bad DHT header","Corrupt JPEG");
+            for (i=0; i < 16; ++i) {
+               sizes[i] = get8(&z->s);
+               m += sizes[i];
+            }
+            L -= 17;
+            if (tc == 0) {
+               if (!build_huffman(z->huff_dc+th, sizes)) return 0;
+               v = z->huff_dc[th].values;
+            } else {
+               if (!build_huffman(z->huff_ac+th, sizes)) return 0;
+               v = z->huff_ac[th].values;
+            }
+            for (i=0; i < m; ++i)
+               v[i] = get8u(&z->s);
+            L -= m;
+         }
+         return L==0;
+   }
+   // check for comment block or APP blocks
+   if ((m >= 0xE0 && m <= 0xEF) || m == 0xFE) {
+      skip(&z->s, get16(&z->s)-2);
+      return 1;
+   }
+   return 0;
+}
+
+// after we see SOS
+static int process_scan_header(jpeg *z)
+{
+   int i;
+   int Ls = get16(&z->s);
+   z->scan_n = get8(&z->s);
+   if (z->scan_n < 1 || z->scan_n > 4 || z->scan_n > (int) z->s.img_n) return e("bad SOS component count","Corrupt JPEG");
+   if (Ls != 6+2*z->scan_n) return e("bad SOS len","Corrupt JPEG");
+   for (i=0; i < z->scan_n; ++i) {
+      int id = get8(&z->s), which;
+      int q = get8(&z->s);
+      for (which = 0; which < z->s.img_n; ++which)
+         if (z->img_comp[which].id == id)
+            break;
+      if (which == z->s.img_n) return 0;
+      z->img_comp[which].hd = q >> 4;   if (z->img_comp[which].hd > 3) return e("bad DC huff","Corrupt JPEG");
+      z->img_comp[which].ha = q & 15;   if (z->img_comp[which].ha > 3) return e("bad AC huff","Corrupt JPEG");
+      z->order[i] = which;
+   }
+   if (get8(&z->s) != 0) return e("bad SOS","Corrupt JPEG");
+   get8(&z->s); // should be 63, but might be 0
+   if (get8(&z->s) != 0) return e("bad SOS","Corrupt JPEG");
+
+   return 1;
+}
+
+static int process_frame_header(jpeg *z, int scan)
+{
+   stbi *s = &z->s;
+   int Lf,p,i,q, h_max=1,v_max=1,c;
+   Lf = get16(s);         if (Lf < 11) return e("bad SOF len","Corrupt JPEG"); // JPEG
+   p  = get8(s);          if (p != 8) return e("only 8-bit","JPEG format not supported: 8-bit only"); // JPEG baseline
+   s->img_y = get16(s);   if (s->img_y == 0) return e("no header height", "JPEG format not supported: delayed height"); // Legal, but we don't handle it--but neither does IJG
+   s->img_x = get16(s);   if (s->img_x == 0) return e("0 width","Corrupt JPEG"); // JPEG requires
+   c = get8(s);
+   if (c != 3 && c != 1) return e("bad component count","Corrupt JPEG");    // JFIF requires
+   s->img_n = c;
+   for (i=0; i < c; ++i) {
+      z->img_comp[i].data = NULL;
+      z->img_comp[i].linebuf = NULL;
+   }
+
+   if (Lf != 8+3*s->img_n) return e("bad SOF len","Corrupt JPEG");
+
+   for (i=0; i < s->img_n; ++i) {
+      z->img_comp[i].id = get8(s);
+      if (z->img_comp[i].id != i+1)   // JFIF requires
+         if (z->img_comp[i].id != i)  // some version of jpegtran outputs non-JFIF-compliant files!
+            return e("bad component ID","Corrupt JPEG");
+      q = get8(s);
+      z->img_comp[i].h = (q >> 4);  if (!z->img_comp[i].h || z->img_comp[i].h > 4) return e("bad H","Corrupt JPEG");
+      z->img_comp[i].v = q & 15;    if (!z->img_comp[i].v || z->img_comp[i].v > 4) return e("bad V","Corrupt JPEG");
+      z->img_comp[i].tq = get8(s);  if (z->img_comp[i].tq > 3) return e("bad TQ","Corrupt JPEG");
+   }
+
+   if (scan != SCAN_load) return 1;
+
+   if ((1 << 30) / s->img_x / s->img_n < s->img_y) return e("too large", "Image too large to decode");
+
+   for (i=0; i < s->img_n; ++i) {
+      if (z->img_comp[i].h > h_max) h_max = z->img_comp[i].h;
+      if (z->img_comp[i].v > v_max) v_max = z->img_comp[i].v;
+   }
+
+   // compute interleaved mcu info
+   z->img_h_max = h_max;
+   z->img_v_max = v_max;
+   z->img_mcu_w = h_max * 8;
+   z->img_mcu_h = v_max * 8;
+   z->img_mcu_x = (s->img_x + z->img_mcu_w-1) / z->img_mcu_w;
+   z->img_mcu_y = (s->img_y + z->img_mcu_h-1) / z->img_mcu_h;
+
+   for (i=0; i < s->img_n; ++i) {
+      // number of effective pixels (e.g. for non-interleaved MCU)
+      z->img_comp[i].x = (s->img_x * z->img_comp[i].h + h_max-1) / h_max;
+      z->img_comp[i].y = (s->img_y * z->img_comp[i].v + v_max-1) / v_max;
+      // to simplify generation, we'll allocate enough memory to decode
+      // the bogus oversized data from using interleaved MCUs and their
+      // big blocks (e.g. a 16x16 iMCU on an image of width 33); we won't
+      // discard the extra data until colorspace conversion
+      z->img_comp[i].w2 = z->img_mcu_x * z->img_comp[i].h * 8;
+      z->img_comp[i].h2 = z->img_mcu_y * z->img_comp[i].v * 8;
+      z->img_comp[i].raw_data = malloc(z->img_comp[i].w2 * z->img_comp[i].h2+15);
+      if (z->img_comp[i].raw_data == NULL) {
+         for(--i; i >= 0; --i) {
+            free(z->img_comp[i].raw_data);
+            z->img_comp[i].data = NULL;
+         }
+         return e("outofmem", "Out of memory");
+      }
+      // align blocks for installable-idct using mmx/sse
+      z->img_comp[i].data = (uint8*) (((size_t) z->img_comp[i].raw_data + 15) & ~15);
+      z->img_comp[i].linebuf = NULL;
+   }
+
+   return 1;
+}
+
+// use comparisons since in some cases we handle more than one case (e.g. SOF)
+#define DNL(x)         ((x) == 0xdc)
+#define SOI(x)         ((x) == 0xd8)
+#define EOI(x)         ((x) == 0xd9)
+#define SOF(x)         ((x) == 0xc0 || (x) == 0xc1)
+#define SOS(x)         ((x) == 0xda)
+
+static int decode_jpeg_header(jpeg *z, int scan)
+{
+   int m;
+   z->marker = MARKER_none; // initialize cached marker to empty
+   m = get_marker(z);
+   if (!SOI(m)) return e("no SOI","Corrupt JPEG");
+   if (scan == SCAN_type) return 1;
+   m = get_marker(z);
+   while (!SOF(m)) {
+      if (!process_marker(z,m)) return 0;
+      m = get_marker(z);
+      while (m == MARKER_none) {
+         // some files have extra padding after their blocks, so ok, we'll scan
+         if (at_eof(&z->s)) return e("no SOF", "Corrupt JPEG");
+         m = get_marker(z);
+      }
+   }
+   if (!process_frame_header(z, scan)) return 0;
+   return 1;
+}
+
+static int decode_jpeg_image(jpeg *j)
+{
+   int m;
+   j->restart_interval = 0;
+   if (!decode_jpeg_header(j, SCAN_load)) return 0;
+   m = get_marker(j);
+   while (!EOI(m)) {
+      if (SOS(m)) {
+         if (!process_scan_header(j)) return 0;
+         if (!parse_entropy_coded_data(j)) return 0;
+      } else {
+         if (!process_marker(j, m)) return 0;
+      }
+      m = get_marker(j);
+   }
+   return 1;
+}
+
+// static jfif-centered resampling (across block boundaries)
+
+typedef uint8 *(*resample_row_func)(uint8 *out, uint8 *in0, uint8 *in1,
+                                    int w, int hs);
+
+#define div4(x) ((uint8) ((x) >> 2))
+
+static uint8 *resample_row_1(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   return in_near;
+}
+
+static uint8* resample_row_v_2(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // need to generate two samples vertically for every one in input
+   int i;
+   for (i=0; i < w; ++i)
+      out[i] = div4(3*in_near[i] + in_far[i] + 2);
+   return out;
+}
+
+static uint8*  resample_row_h_2(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // need to generate two samples horizontally for every one in input
+   int i;
+   uint8 *input = in_near;
+   if (w == 1) {
+      // if only one sample, can't do any interpolation
+      out[0] = out[1] = input[0];
+      return out;
+   }
+
+   out[0] = input[0];
+   out[1] = div4(input[0]*3 + input[1] + 2);
+   for (i=1; i < w-1; ++i) {
+      int n = 3*input[i]+2;
+      out[i*2+0] = div4(n+input[i-1]);
+      out[i*2+1] = div4(n+input[i+1]);
+   }
+   out[i*2+0] = div4(input[w-2]*3 + input[w-1] + 2);
+   out[i*2+1] = input[w-1];
+   return out;
+}
+
+#define div16(x) ((uint8) ((x) >> 4))
+
+static uint8 *resample_row_hv_2(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // need to generate 2x2 samples for every one in input
+   int i,t0,t1;
+   if (w == 1) {
+      out[0] = out[1] = div4(3*in_near[0] + in_far[0] + 2);
+      return out;
+   }
+
+   t1 = 3*in_near[0] + in_far[0];
+   out[0] = div4(t1+2);
+   for (i=1; i < w; ++i) {
+      t0 = t1;
+      t1 = 3*in_near[i]+in_far[i];
+      out[i*2-1] = div16(3*t0 + t1 + 8);
+      out[i*2  ] = div16(3*t1 + t0 + 8);
+   }
+   out[w*2-1] = div4(t1+2);
+   return out;
+}
+
+static uint8 *resample_row_generic(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // resample with nearest-neighbor
+   int i,j;
+   for (i=0; i < w; ++i)
+      for (j=0; j < hs; ++j)
+         out[i*hs+j] = in_near[i];
+   return out;
+}
+
+#define float2fixed(x)  ((int) ((x) * 65536 + 0.5))
+
+// 0.38 seconds on 3*anemones.jpg   (0.25 with processor = Pro)
+// VC6 without processor=Pro is generating multiple LEAs per multiply!
+static void YCbCr_to_RGB_row(uint8 *out, uint8 *y, uint8 *pcb, uint8 *pcr, int count, int step)
+{
+   int i;
+   for (i=0; i < count; ++i) {
+      int y_fixed = (y[i] << 16) + 32768; // rounding
+      int r,g,b;
+      int cr = pcr[i] - 128;
+      int cb = pcb[i] - 128;
+      r = y_fixed + cr*float2fixed(1.40200f);
+      g = y_fixed - cr*float2fixed(0.71414f) - cb*float2fixed(0.34414f);
+      b = y_fixed                            + cb*float2fixed(1.77200f);
+      r >>= 16;
+      g >>= 16;
+      b >>= 16;
+      if ((unsigned) r > 255) { if (r < 0) r = 0; else r = 255; }
+      if ((unsigned) g > 255) { if (g < 0) g = 0; else g = 255; }
+      if ((unsigned) b > 255) { if (b < 0) b = 0; else b = 255; }
+      out[0] = (uint8)r;
+      out[1] = (uint8)g;
+      out[2] = (uint8)b;
+      out[3] = 255;
+      out += step;
+   }
+}
+
+#if STBI_SIMD
+static stbi_YCbCr_to_RGB_run stbi_YCbCr_installed = YCbCr_to_RGB_row;
+
+void stbi_install_YCbCr_to_RGB(stbi_YCbCr_to_RGB_run func)
+{
+   stbi_YCbCr_installed = func;
+}
+#endif
+
+
+// clean up the temporary component buffers
+static void cleanup_jpeg(jpeg *j)
+{
+   int i;
+   for (i=0; i < j->s.img_n; ++i) {
+      if (j->img_comp[i].data) {
+         free(j->img_comp[i].raw_data);
+         j->img_comp[i].data = NULL;
+      }
+      if (j->img_comp[i].linebuf) {
+         free(j->img_comp[i].linebuf);
+         j->img_comp[i].linebuf = NULL;
+      }
+   }
+}
+
+typedef struct
+{
+   resample_row_func resample;
+   uint8 *line0,*line1;
+   int hs,vs;   // expansion factor in each axis
+   int w_lores; // horizontal pixels pre-expansion 
+   int ystep;   // how far through vertical expansion we are
+   int ypos;    // which pre-expansion row we're on
+} stbi_resample;
+
+static uint8 *load_jpeg_image(jpeg *z, int *out_x, int *out_y, int *comp, int req_comp)
+{
+   int n, decode_n;
+   // validate req_comp
+   if (req_comp < 0 || req_comp > 4) return epuc("bad req_comp", "Internal error");
+   z->s.img_n = 0;
+
+   // load a jpeg image from whichever source
+   if (!decode_jpeg_image(z)) { cleanup_jpeg(z); return NULL; }
+
+   // determine actual number of components to generate
+   n = req_comp ? req_comp : z->s.img_n;
+
+   if (z->s.img_n == 3 && n < 3)
+      decode_n = 1;
+   else
+      decode_n = z->s.img_n;
+
+   // resample and color-convert
+   {
+      int k;
+      uint i,j;
+      uint8 *output;
+      uint8 *coutput[4];
+
+      stbi_resample res_comp[4];
+
+      for (k=0; k < decode_n; ++k) {
+         stbi_resample *r = &res_comp[k];
+
+         // allocate line buffer big enough for upsampling off the edges
+         // with upsample factor of 4
+         z->img_comp[k].linebuf = (uint8 *) malloc(z->s.img_x + 3);
+         if (!z->img_comp[k].linebuf) { cleanup_jpeg(z); return epuc("outofmem", "Out of memory"); }
+
+         r->hs      = z->img_h_max / z->img_comp[k].h;
+         r->vs      = z->img_v_max / z->img_comp[k].v;
+         r->ystep   = r->vs >> 1;
+         r->w_lores = (z->s.img_x + r->hs-1) / r->hs;
+         r->ypos    = 0;
+         r->line0   = r->line1 = z->img_comp[k].data;
+
+         if      (r->hs == 1 && r->vs == 1) r->resample = resample_row_1;
+         else if (r->hs == 1 && r->vs == 2) r->resample = resample_row_v_2;
+         else if (r->hs == 2 && r->vs == 1) r->resample = resample_row_h_2;
+         else if (r->hs == 2 && r->vs == 2) r->resample = resample_row_hv_2;
+         else                               r->resample = resample_row_generic;
+      }
+
+      // can't error after this so, this is safe
+      output = (uint8 *) malloc(n * z->s.img_x * z->s.img_y + 1);
+      if (!output) { cleanup_jpeg(z); return epuc("outofmem", "Out of memory"); }
+
+      // now go ahead and resample
+      for (j=0; j < z->s.img_y; ++j) {
+         uint8 *out = output + n * z->s.img_x * j;
+         for (k=0; k < decode_n; ++k) {
+            stbi_resample *r = &res_comp[k];
+            int y_bot = r->ystep >= (r->vs >> 1);
+            coutput[k] = r->resample(z->img_comp[k].linebuf,
+                                     y_bot ? r->line1 : r->line0,
+                                     y_bot ? r->line0 : r->line1,
+                                     r->w_lores, r->hs);
+            if (++r->ystep >= r->vs) {
+               r->ystep = 0;
+               r->line0 = r->line1;
+               if (++r->ypos < z->img_comp[k].y)
+                  r->line1 += z->img_comp[k].w2;
+            }
+         }
+         if (n >= 3) {
+            uint8 *y = coutput[0];
+            if (z->s.img_n == 3) {
+               #if STBI_SIMD
+               stbi_YCbCr_installed(out, y, coutput[1], coutput[2], z->s.img_x, n);
+               #else
+               YCbCr_to_RGB_row(out, y, coutput[1], coutput[2], z->s.img_x, n);
+               #endif
+            } else
+               for (i=0; i < z->s.img_x; ++i) {
+                  out[0] = out[1] = out[2] = y[i];
+                  out[3] = 255; // not used if n==3
+                  out += n;
+               }
+         } else {
+            uint8 *y = coutput[0];
+            if (n == 1)
+               for (i=0; i < z->s.img_x; ++i) out[i] = y[i];
+            else
+               for (i=0; i < z->s.img_x; ++i) *out++ = y[i], *out++ = 255;
+         }
+      }
+      cleanup_jpeg(z);
+      *out_x = z->s.img_x;
+      *out_y = z->s.img_y;
+      if (comp) *comp  = z->s.img_n; // report original components, not output
+      return output;
+   }
+}
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_jpeg_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   jpeg j;
+   start_file(&j.s, f);
+   return load_jpeg_image(&j, x,y,comp,req_comp);
+}
+
+unsigned char *stbi_jpeg_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_jpeg_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+#endif
+
+unsigned char *stbi_jpeg_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   jpeg j;
+   start_mem(&j.s, buffer,len);
+   return load_jpeg_image(&j, x,y,comp,req_comp);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_jpeg_test_file(FILE *f)
+{
+   int n,r;
+   jpeg j;
+   n = ftell(f);
+   start_file(&j.s, f);
+   r = decode_jpeg_header(&j, SCAN_type);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_jpeg_test_memory(stbi_uc const *buffer, int len)
+{
+   jpeg j;
+   start_mem(&j.s, buffer,len);
+   return decode_jpeg_header(&j, SCAN_type);
+}
+
+// @TODO:
+#ifndef STBI_NO_STDIO
+extern int      stbi_jpeg_info            (char const *filename,           int *x, int *y, int *comp);
+extern int      stbi_jpeg_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_jpeg_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+// public domain zlib decode    v0.2  Sean Barrett 2006-11-18
+//    simple implementation
+//      - all input must be provided in an upfront buffer
+//      - all output is written to a single output buffer (can malloc/realloc)
+//    performance
+//      - fast huffman
+
+// fast-way is faster to check than jpeg huffman, but slow way is slower
+#define ZFAST_BITS  9 // accelerate all cases in default tables
+#define ZFAST_MASK  ((1 << ZFAST_BITS) - 1)
+
+// zlib-style huffman encoding
+// (jpegs packs from left, zlib from right, so can't share code)
+typedef struct
+{
+   uint16 fast[1 << ZFAST_BITS];
+   uint16 firstcode[16];
+   int maxcode[17];
+   uint16 firstsymbol[16];
+   uint8  size[288];
+   uint16 value[288]; 
+} zhuffman;
+
+__forceinline static int bitreverse16(int n)
+{
+  n = ((n & 0xAAAA) >>  1) | ((n & 0x5555) << 1);
+  n = ((n & 0xCCCC) >>  2) | ((n & 0x3333) << 2);
+  n = ((n & 0xF0F0) >>  4) | ((n & 0x0F0F) << 4);
+  n = ((n & 0xFF00) >>  8) | ((n & 0x00FF) << 8);
+  return n;
+}
+
+__forceinline static int bit_reverse(int v, int bits)
+{
+   assert(bits <= 16);
+   // to bit reverse n bits, reverse 16 and shift
+   // e.g. 11 bits, bit reverse and shift away 5
+   return bitreverse16(v) >> (16-bits);
+}
+
+static int zbuild_huffman(zhuffman *z, uint8 *sizelist, int num)
+{
+   int i,k=0;
+   int code, next_code[16], sizes[17];
+
+   // DEFLATE spec for generating codes
+   memset(sizes, 0, sizeof(sizes));
+   memset(z->fast, 255, sizeof(z->fast));
+   for (i=0; i < num; ++i) 
+      ++sizes[sizelist[i]];
+   sizes[0] = 0;
+   for (i=1; i < 16; ++i)
+      assert(sizes[i] <= (1 << i));
+   code = 0;
+   for (i=1; i < 16; ++i) {
+      next_code[i] = code;
+      z->firstcode[i] = (uint16) code;
+      z->firstsymbol[i] = (uint16) k;
+      code = (code + sizes[i]);
+      if (sizes[i])
+         if (code-1 >= (1 << i)) return e("bad codelengths","Corrupt JPEG");
+      z->maxcode[i] = code << (16-i); // preshift for inner loop
+      code <<= 1;
+      k += sizes[i];
+   }
+   z->maxcode[16] = 0x10000; // sentinel
+   for (i=0; i < num; ++i) {
+      int s = sizelist[i];
+      if (s) {
+         int c = next_code[s] - z->firstcode[s] + z->firstsymbol[s];
+         z->size[c] = (uint8)s;
+         z->value[c] = (uint16)i;
+         if (s <= ZFAST_BITS) {
+            int k = bit_reverse(next_code[s],s);
+            while (k < (1 << ZFAST_BITS)) {
+               z->fast[k] = (uint16) c;
+               k += (1 << s);
+            }
+         }
+         ++next_code[s];
+      }
+   }
+   return 1;
+}
+
+// zlib-from-memory implementation for PNG reading
+//    because PNG allows splitting the zlib stream arbitrarily,
+//    and it's annoying structurally to have PNG call ZLIB call PNG,
+//    we require PNG read all the IDATs and combine them into a single
+//    memory buffer
+
+typedef struct
+{
+   uint8 *zbuffer, *zbuffer_end;
+   int num_bits;
+   uint32 code_buffer;
+
+   char *zout;
+   char *zout_start;
+   char *zout_end;
+   int   z_expandable;
+
+   zhuffman z_length, z_distance;
+} zbuf;
+
+__forceinline static int zget8(zbuf *z)
+{
+   if (z->zbuffer >= z->zbuffer_end) return 0;
+   return *z->zbuffer++;
+}
+
+static void fill_bits(zbuf *z)
+{
+   do {
+      assert(z->code_buffer < (1U << z->num_bits));
+      z->code_buffer |= zget8(z) << z->num_bits;
+      z->num_bits += 8;
+   } while (z->num_bits <= 24);
+}
+
+__forceinline static unsigned int zreceive(zbuf *z, int n)
+{
+   unsigned int k;
+   if (z->num_bits < n) fill_bits(z);
+   k = z->code_buffer & ((1 << n) - 1);
+   z->code_buffer >>= n;
+   z->num_bits -= n;
+   return k;   
+}
+
+__forceinline static int zhuffman_decode(zbuf *a, zhuffman *z)
+{
+   int b,s,k;
+   if (a->num_bits < 16) fill_bits(a);
+   b = z->fast[a->code_buffer & ZFAST_MASK];
+   if (b < 0xffff) {
+      s = z->size[b];
+      a->code_buffer >>= s;
+      a->num_bits -= s;
+      return z->value[b];
+   }
+
+   // not resolved by fast table, so compute it the slow way
+   // use jpeg approach, which requires MSbits at top
+   k = bit_reverse(a->code_buffer, 16);
+   for (s=ZFAST_BITS+1; ; ++s)
+      if (k < z->maxcode[s])
+         break;
+   if (s == 16) return -1; // invalid code!
+   // code size is s, so:
+   b = (k >> (16-s)) - z->firstcode[s] + z->firstsymbol[s];
+   assert(z->size[b] == s);
+   a->code_buffer >>= s;
+   a->num_bits -= s;
+   return z->value[b];
+}
+
+static int expand(zbuf *z, int n)  // need to make room for n bytes
+{
+   char *q;
+   int cur, limit;
+   if (!z->z_expandable) return e("output buffer limit","Corrupt PNG");
+   cur   = (int) (z->zout     - z->zout_start);
+   limit = (int) (z->zout_end - z->zout_start);
+   while (cur + n > limit)
+      limit *= 2;
+   q = (char *) realloc(z->zout_start, limit);
+   if (q == NULL) return e("outofmem", "Out of memory");
+   z->zout_start = q;
+   z->zout       = q + cur;
+   z->zout_end   = q + limit;
+   return 1;
+}
+
+static int length_base[31] = {
+   3,4,5,6,7,8,9,10,11,13,
+   15,17,19,23,27,31,35,43,51,59,
+   67,83,99,115,131,163,195,227,258,0,0 };
+
+static int length_extra[31]= 
+{ 0,0,0,0,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,0,0,0 };
+
+static int dist_base[32] = { 1,2,3,4,5,7,9,13,17,25,33,49,65,97,129,193,
+257,385,513,769,1025,1537,2049,3073,4097,6145,8193,12289,16385,24577,0,0};
+
+static int dist_extra[32] =
+{ 0,0,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13};
+
+static int parse_huffman_block(zbuf *a)
+{
+   for(;;) {
+      int z = zhuffman_decode(a, &a->z_length);
+      if (z < 256) {
+         if (z < 0) return e("bad huffman code","Corrupt PNG"); // error in huffman codes
+         if (a->zout >= a->zout_end) if (!expand(a, 1)) return 0;
+         *a->zout++ = (char) z;
+      } else {
+         uint8 *p;
+         int len,dist;
+         if (z == 256) return 1;
+         z -= 257;
+         len = length_base[z];
+         if (length_extra[z]) len += zreceive(a, length_extra[z]);
+         z = zhuffman_decode(a, &a->z_distance);
+         if (z < 0) return e("bad huffman code","Corrupt PNG");
+         dist = dist_base[z];
+         if (dist_extra[z]) dist += zreceive(a, dist_extra[z]);
+         if (a->zout - a->zout_start < dist) return e("bad dist","Corrupt PNG");
+         if (a->zout + len > a->zout_end) if (!expand(a, len)) return 0;
+         p = (uint8 *) (a->zout - dist);
+         while (len--)
+            *a->zout++ = *p++;
+      }
+   }
+}
+
+static int compute_huffman_codes(zbuf *a)
+{
+   static uint8 length_dezigzag[19] = { 16,17,18,0,8,7,9,6,10,5,11,4,12,3,13,2,14,1,15 };
+   static zhuffman z_codelength; // static just to save stack space
+   uint8 lencodes[286+32+137];//padding for maximum single op
+   uint8 codelength_sizes[19];
+   int i,n;
+
+   int hlit  = zreceive(a,5) + 257;
+   int hdist = zreceive(a,5) + 1;
+   int hclen = zreceive(a,4) + 4;
+
+   memset(codelength_sizes, 0, sizeof(codelength_sizes));
+   for (i=0; i < hclen; ++i) {
+      int s = zreceive(a,3);
+      codelength_sizes[length_dezigzag[i]] = (uint8) s;
+   }
+   if (!zbuild_huffman(&z_codelength, codelength_sizes, 19)) return 0;
+
+   n = 0;
+   while (n < hlit + hdist) {
+      int c = zhuffman_decode(a, &z_codelength);
+      assert(c >= 0 && c < 19);
+      if (c < 16)
+         lencodes[n++] = (uint8) c;
+      else if (c == 16) {
+         c = zreceive(a,2)+3;
+         memset(lencodes+n, lencodes[n-1], c);
+         n += c;
+      } else if (c == 17) {
+         c = zreceive(a,3)+3;
+         memset(lencodes+n, 0, c);
+         n += c;
+      } else {
+         assert(c == 18);
+         c = zreceive(a,7)+11;
+         memset(lencodes+n, 0, c);
+         n += c;
+      }
+   }
+   if (n != hlit+hdist) return e("bad codelengths","Corrupt PNG");
+   if (!zbuild_huffman(&a->z_length, lencodes, hlit)) return 0;
+   if (!zbuild_huffman(&a->z_distance, lencodes+hlit, hdist)) return 0;
+   return 1;
+}
+
+static int parse_uncompressed_block(zbuf *a)
+{
+   uint8 header[4];
+   int len,nlen,k;
+   if (a->num_bits & 7)
+      zreceive(a, a->num_bits & 7); // discard
+   // drain the bit-packed data into header
+   k = 0;
+   while (a->num_bits > 0) {
+      header[k++] = (uint8) (a->code_buffer & 255); // wtf this warns?
+      a->code_buffer >>= 8;
+      a->num_bits -= 8;
+   }
+   assert(a->num_bits == 0);
+   // now fill header the normal way
+   while (k < 4)
+      header[k++] = (uint8) zget8(a);
+   len  = header[1] * 256 + header[0];
+   nlen = header[3] * 256 + header[2];
+   if (nlen != (len ^ 0xffff)) return e("zlib corrupt","Corrupt PNG");
+   if (a->zbuffer + len > a->zbuffer_end) return e("read past buffer","Corrupt PNG");
+   if (a->zout + len > a->zout_end)
+      if (!expand(a, len)) return 0;
+   memcpy(a->zout, a->zbuffer, len);
+   a->zbuffer += len;
+   a->zout += len;
+   return 1;
+}
+
+static int parse_zlib_header(zbuf *a)
+{
+   int cmf   = zget8(a);
+   int cm    = cmf & 15;
+   /* int cinfo = cmf >> 4; */
+   int flg   = zget8(a);
+   if ((cmf*256+flg) % 31 != 0) return e("bad zlib header","Corrupt PNG"); // zlib spec
+   if (flg & 32) return e("no preset dict","Corrupt PNG"); // preset dictionary not allowed in png
+   if (cm != 8) return e("bad compression","Corrupt PNG"); // DEFLATE required for png
+   // window = 1 << (8 + cinfo)... but who cares, we fully buffer output
+   return 1;
+}
+
+// @TODO: should statically initialize these for optimal thread safety
+static uint8 default_length[288], default_distance[32];
+static void init_defaults(void)
+{
+   int i;   // use <= to match clearly with spec
+   for (i=0; i <= 143; ++i)     default_length[i]   = 8;
+   for (   ; i <= 255; ++i)     default_length[i]   = 9;
+   for (   ; i <= 279; ++i)     default_length[i]   = 7;
+   for (   ; i <= 287; ++i)     default_length[i]   = 8;
+
+   for (i=0; i <=  31; ++i)     default_distance[i] = 5;
+}
+
+static int parse_zlib(zbuf *a, int parse_header)
+{
+   int final, type;
+   if (parse_header)
+      if (!parse_zlib_header(a)) return 0;
+   a->num_bits = 0;
+   a->code_buffer = 0;
+   do {
+      final = zreceive(a,1);
+      type = zreceive(a,2);
+      if (type == 0) {
+         if (!parse_uncompressed_block(a)) return 0;
+      } else if (type == 3) {
+         return 0;
+      } else {
+         if (type == 1) {
+            // use fixed code lengths
+            if (!default_distance[31]) init_defaults();
+            if (!zbuild_huffman(&a->z_length  , default_length  , 288)) return 0;
+            if (!zbuild_huffman(&a->z_distance, default_distance,  32)) return 0;
+         } else {
+            if (!compute_huffman_codes(a)) return 0;
+         }
+         if (!parse_huffman_block(a)) return 0;
+      }
+   } while (!final);
+   return 1;
+}
+
+static int do_zlib(zbuf *a, char *obuf, int olen, int exp, int parse_header)
+{
+   a->zout_start = obuf;
+   a->zout       = obuf;
+   a->zout_end   = obuf + olen;
+   a->z_expandable = exp;
+
+   return parse_zlib(a, parse_header);
+}
+
+char *stbi_zlib_decode_malloc_guesssize(const char *buffer, int len, int initial_size, int *outlen)
+{
+   zbuf a;
+   char *p = (char *) malloc(initial_size);
+   if (p == NULL) return NULL;
+   a.zbuffer = (uint8 *) buffer;
+   a.zbuffer_end = (uint8 *) buffer + len;
+   if (do_zlib(&a, p, initial_size, 1, 1)) {
+      if (outlen) *outlen = (int) (a.zout - a.zout_start);
+      return a.zout_start;
+   } else {
+      free(a.zout_start);
+      return NULL;
+   }
+}
+
+char *stbi_zlib_decode_malloc(char const *buffer, int len, int *outlen)
+{
+   return stbi_zlib_decode_malloc_guesssize(buffer, len, 16384, outlen);
+}
+
+int stbi_zlib_decode_buffer(char *obuffer, int olen, char const *ibuffer, int ilen)
+{
+   zbuf a;
+   a.zbuffer = (uint8 *) ibuffer;
+   a.zbuffer_end = (uint8 *) ibuffer + ilen;
+   if (do_zlib(&a, obuffer, olen, 0, 1))
+      return (int) (a.zout - a.zout_start);
+   else
+      return -1;
+}
+
+char *stbi_zlib_decode_noheader_malloc(char const *buffer, int len, int *outlen)
+{
+   zbuf a;
+   char *p = (char *) malloc(16384);
+   if (p == NULL) return NULL;
+   a.zbuffer = (uint8 *) buffer;
+   a.zbuffer_end = (uint8 *) buffer+len;
+   if (do_zlib(&a, p, 16384, 1, 0)) {
+      if (outlen) *outlen = (int) (a.zout - a.zout_start);
+      return a.zout_start;
+   } else {
+      free(a.zout_start);
+      return NULL;
+   }
+}
+
+int stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, const char *ibuffer, int ilen)
+{
+   zbuf a;
+   a.zbuffer = (uint8 *) ibuffer;
+   a.zbuffer_end = (uint8 *) ibuffer + ilen;
+   if (do_zlib(&a, obuffer, olen, 0, 0))
+      return (int) (a.zout - a.zout_start);
+   else
+      return -1;
+}
+
+// public domain "baseline" PNG decoder   v0.10  Sean Barrett 2006-11-18
+//    simple implementation
+//      - only 8-bit samples
+//      - no CRC checking
+//      - allocates lots of intermediate memory
+//        - avoids problem of streaming data between subsystems
+//        - avoids explicit window management
+//    performance
+//      - uses stb_zlib, a PD zlib implementation with fast huffman decoding
+
+
+typedef struct
+{
+   uint32 length;
+   uint32 type;
+} chunk;
+
+#define PNG_TYPE(a,b,c,d)  (((a) << 24) + ((b) << 16) + ((c) << 8) + (d))
+
+static chunk get_chunk_header(stbi *s)
+{
+   chunk c;
+   c.length = get32(s);
+   c.type   = get32(s);
+   return c;
+}
+
+static int check_png_header(stbi *s)
+{
+   static uint8 png_sig[8] = { 137,80,78,71,13,10,26,10 };
+   int i;
+   for (i=0; i < 8; ++i)
+      if (get8(s) != png_sig[i]) return e("bad png sig","Not a PNG");
+   return 1;
+}
+
+typedef struct
+{
+   stbi s;
+   uint8 *idata, *expanded, *out;
+} png;
+
+
+enum {
+   F_none=0, F_sub=1, F_up=2, F_avg=3, F_paeth=4,
+   F_avg_first, F_paeth_first,
+};
+
+static uint8 first_row_filter[5] =
+{
+   F_none, F_sub, F_none, F_avg_first, F_paeth_first
+};
+
+static int paeth(int a, int b, int c)
+{
+   int p = a + b - c;
+   int pa = abs(p-a);
+   int pb = abs(p-b);
+   int pc = abs(p-c);
+   if (pa <= pb && pa <= pc) return a;
+   if (pb <= pc) return b;
+   return c;
+}
+
+// create the png data from post-deflated data
+static int create_png_image(png *a, uint8 *raw, uint32 raw_len, int out_n)
+{
+   stbi *s = &a->s;
+   uint32 i,j,stride = s->img_x*out_n;
+   int k;
+   int img_n = s->img_n; // copy it into a local for later
+   assert(out_n == s->img_n || out_n == s->img_n+1);
+   a->out = (uint8 *) malloc(s->img_x * s->img_y * out_n);
+   if (!a->out) return e("outofmem", "Out of memory");
+   if (raw_len != (img_n * s->img_x + 1) * s->img_y) return e("not enough pixels","Corrupt PNG");
+   for (j=0; j < s->img_y; ++j) {
+      uint8 *cur = a->out + stride*j;
+      uint8 *prior = cur - stride;
+      int filter = *raw++;
+      if (filter > 4) return e("invalid filter","Corrupt PNG");
+      // if first row, use special filter that doesn't sample previous row
+      if (j == 0) filter = first_row_filter[filter];
+      // handle first pixel explicitly
+      for (k=0; k < img_n; ++k) {
+         switch(filter) {
+            case F_none       : cur[k] = raw[k]; break;
+            case F_sub        : cur[k] = raw[k]; break;
+            case F_up         : cur[k] = raw[k] + prior[k]; break;
+            case F_avg        : cur[k] = raw[k] + (prior[k]>>1); break;
+            case F_paeth      : cur[k] = (uint8) (raw[k] + paeth(0,prior[k],0)); break;
+            case F_avg_first  : cur[k] = raw[k]; break;
+            case F_paeth_first: cur[k] = raw[k]; break;
+         }
+      }
+      if (img_n != out_n) cur[img_n] = 255;
+      raw += img_n;
+      cur += out_n;
+      prior += out_n;
+      // this is a little gross, so that we don't switch per-pixel or per-component
+      if (img_n == out_n) {
+         #define CASE(f) \
+             case f:     \
+                for (i=s->img_x-1; i >= 1; --i, raw+=img_n,cur+=img_n,prior+=img_n) \
+                   for (k=0; k < img_n; ++k)
+         switch(filter) {
+            CASE(F_none)  cur[k] = raw[k]; break;
+            CASE(F_sub)   cur[k] = raw[k] + cur[k-img_n]; break;
+            CASE(F_up)    cur[k] = raw[k] + prior[k]; break;
+            CASE(F_avg)   cur[k] = raw[k] + ((prior[k] + cur[k-img_n])>>1); break;
+            CASE(F_paeth)  cur[k] = (uint8) (raw[k] + paeth(cur[k-img_n],prior[k],prior[k-img_n])); break;
+            CASE(F_avg_first)    cur[k] = raw[k] + (cur[k-img_n] >> 1); break;
+            CASE(F_paeth_first)  cur[k] = (uint8) (raw[k] + paeth(cur[k-img_n],0,0)); break;
+         }
+         #undef CASE
+      } else {
+         assert(img_n+1 == out_n);
+         #define CASE(f) \
+             case f:     \
+                for (i=s->img_x-1; i >= 1; --i, cur[img_n]=255,raw+=img_n,cur+=out_n,prior+=out_n) \
+                   for (k=0; k < img_n; ++k)
+         switch(filter) {
+            CASE(F_none)  cur[k] = raw[k]; break;
+            CASE(F_sub)   cur[k] = raw[k] + cur[k-out_n]; break;
+            CASE(F_up)    cur[k] = raw[k] + prior[k]; break;
+            CASE(F_avg)   cur[k] = raw[k] + ((prior[k] + cur[k-out_n])>>1); break;
+            CASE(F_paeth)  cur[k] = (uint8) (raw[k] + paeth(cur[k-out_n],prior[k],prior[k-out_n])); break;
+            CASE(F_avg_first)    cur[k] = raw[k] + (cur[k-out_n] >> 1); break;
+            CASE(F_paeth_first)  cur[k] = (uint8) (raw[k] + paeth(cur[k-out_n],0,0)); break;
+         }
+         #undef CASE
+      }
+   }
+   return 1;
+}
+
+static int compute_transparency(png *z, uint8 tc[3], int out_n)
+{
+   stbi *s = &z->s;
+   uint32 i, pixel_count = s->img_x * s->img_y;
+   uint8 *p = z->out;
+
+   // compute color-based transparency, assuming we've
+   // already got 255 as the alpha value in the output
+   assert(out_n == 2 || out_n == 4);
+
+   if (out_n == 2) {
+      for (i=0; i < pixel_count; ++i) {
+         p[1] = (p[0] == tc[0] ? 0 : 255);
+         p += 2;
+      }
+   } else {
+      for (i=0; i < pixel_count; ++i) {
+         if (p[0] == tc[0] && p[1] == tc[1] && p[2] == tc[2])
+            p[3] = 0;
+         p += 4;
+      }
+   }
+   return 1;
+}
+
+static int expand_palette(png *a, uint8 *palette, int len, int pal_img_n)
+{
+   uint32 i, pixel_count = a->s.img_x * a->s.img_y;
+   uint8 *p, *temp_out, *orig = a->out;
+
+   p = (uint8 *) malloc(pixel_count * pal_img_n);
+   if (p == NULL) return e("outofmem", "Out of memory");
+
+   // between here and free(out) below, exitting would leak
+   temp_out = p;
+
+   if (pal_img_n == 3) {
+      for (i=0; i < pixel_count; ++i) {
+         int n = orig[i]*4;
+         p[0] = palette[n  ];
+         p[1] = palette[n+1];
+         p[2] = palette[n+2];
+         p += 3;
+      }
+   } else {
+      for (i=0; i < pixel_count; ++i) {
+         int n = orig[i]*4;
+         p[0] = palette[n  ];
+         p[1] = palette[n+1];
+         p[2] = palette[n+2];
+         p[3] = palette[n+3];
+         p += 4;
+      }
+   }
+   free(a->out);
+   a->out = temp_out;
+   return 1;
+}
+
+static int parse_png_file(png *z, int scan, int req_comp)
+{
+   uint8 palette[1024], pal_img_n=0;
+   uint8 has_trans=0, tc[3];
+   uint32 ioff=0, idata_limit=0, i, pal_len=0;
+   int first=1,k;
+   stbi *s = &z->s;
+
+   if (!check_png_header(s)) return 0;
+
+   if (scan == SCAN_type) return 1;
+
+   for(;;first=0) {
+      chunk c = get_chunk_header(s);
+      if (first && c.type != PNG_TYPE('I','H','D','R'))
+         return e("first not IHDR","Corrupt PNG");
+      switch (c.type) {
+         case PNG_TYPE('I','H','D','R'): {
+            int depth,color,interlace,comp,filter;
+            if (!first) return e("multiple IHDR","Corrupt PNG");
+            if (c.length != 13) return e("bad IHDR len","Corrupt PNG");
+            s->img_x = get32(s); if (s->img_x > (1 << 24)) return e("too large","Very large image (corrupt?)");
+            s->img_y = get32(s); if (s->img_y > (1 << 24)) return e("too large","Very large image (corrupt?)");
+            depth = get8(s);  if (depth != 8)        return e("8bit only","PNG not supported: 8-bit only");
+            color = get8(s);  if (color > 6)         return e("bad ctype","Corrupt PNG");
+            if (color == 3) pal_img_n = 3; else if (color & 1) return e("bad ctype","Corrupt PNG");
+            comp  = get8(s);  if (comp) return e("bad comp method","Corrupt PNG");
+            filter= get8(s);  if (filter) return e("bad filter method","Corrupt PNG");
+            interlace = get8(s); if (interlace) return e("interlaced","PNG not supported: interlaced mode");
+            if (!s->img_x || !s->img_y) return e("0-pixel image","Corrupt PNG");
+            if (!pal_img_n) {
+               s->img_n = (color & 2 ? 3 : 1) + (color & 4 ? 1 : 0);
+               if ((1 << 30) / s->img_x / s->img_n < s->img_y) return e("too large", "Image too large to decode");
+               if (scan == SCAN_header) return 1;
+            } else {
+               // if paletted, then pal_n is our final components, and
+               // img_n is # components to decompress/filter.
+               s->img_n = 1;
+               if ((1 << 30) / s->img_x / 4 < s->img_y) return e("too large","Corrupt PNG");
+               // if SCAN_header, have to scan to see if we have a tRNS
+            }
+            break;
+         }
+
+         case PNG_TYPE('P','L','T','E'):  {
+            if (c.length > 256*3) return e("invalid PLTE","Corrupt PNG");
+            pal_len = c.length / 3;
+            if (pal_len * 3 != c.length) return e("invalid PLTE","Corrupt PNG");
+            for (i=0; i < pal_len; ++i) {
+               palette[i*4+0] = get8u(s);
+               palette[i*4+1] = get8u(s);
+               palette[i*4+2] = get8u(s);
+               palette[i*4+3] = 255;
+            }
+            break;
+         }
+
+         case PNG_TYPE('t','R','N','S'): {
+            if (z->idata) return e("tRNS after IDAT","Corrupt PNG");
+            if (pal_img_n) {
+               if (scan == SCAN_header) { s->img_n = 4; return 1; }
+               if (pal_len == 0) return e("tRNS before PLTE","Corrupt PNG");
+               if (c.length > pal_len) return e("bad tRNS len","Corrupt PNG");
+               pal_img_n = 4;
+               for (i=0; i < c.length; ++i)
+                  palette[i*4+3] = get8u(s);
+            } else {
+               if (!(s->img_n & 1)) return e("tRNS with alpha","Corrupt PNG");
+               if (c.length != (uint32) s->img_n*2) return e("bad tRNS len","Corrupt PNG");
+               has_trans = 1;
+               for (k=0; k < s->img_n; ++k)
+                  tc[k] = (uint8) get16(s); // non 8-bit images will be larger
+            }
+            break;
+         }
+
+         case PNG_TYPE('I','D','A','T'): {
+            if (pal_img_n && !pal_len) return e("no PLTE","Corrupt PNG");
+            if (scan == SCAN_header) { s->img_n = pal_img_n; return 1; }
+            if (ioff + c.length > idata_limit) {
+               uint8 *p;
+               if (idata_limit == 0) idata_limit = c.length > 4096 ? c.length : 4096;
+               while (ioff + c.length > idata_limit)
+                  idata_limit *= 2;
+               p = (uint8 *) realloc(z->idata, idata_limit); if (p == NULL) return e("outofmem", "Out of memory");
+               z->idata = p;
+            }
+            #ifndef STBI_NO_STDIO
+            if (s->img_file)
+            {
+               if (fread(z->idata+ioff,1,c.length,s->img_file) != c.length) return e("outofdata","Corrupt PNG");
+            }
+            else
+            #endif
+            {
+               memcpy(z->idata+ioff, s->img_buffer, c.length);
+               s->img_buffer += c.length;
+            }
+            ioff += c.length;
+            break;
+         }
+
+         case PNG_TYPE('I','E','N','D'): {
+            uint32 raw_len;
+            if (scan != SCAN_load) return 1;
+            if (z->idata == NULL) return e("no IDAT","Corrupt PNG");
+            z->expanded = (uint8 *) stbi_zlib_decode_malloc((char *) z->idata, ioff, (int *) &raw_len);
+            if (z->expanded == NULL) return 0; // zlib should set error
+            free(z->idata); z->idata = NULL;
+            if ((req_comp == s->img_n+1 && req_comp != 3 && !pal_img_n) || has_trans)
+               s->img_out_n = s->img_n+1;
+            else
+               s->img_out_n = s->img_n;
+            if (!create_png_image(z, z->expanded, raw_len, s->img_out_n)) return 0;
+            if (has_trans)
+               if (!compute_transparency(z, tc, s->img_out_n)) return 0;
+            if (pal_img_n) {
+               // pal_img_n == 3 or 4
+               s->img_n = pal_img_n; // record the actual colors we had
+               s->img_out_n = pal_img_n;
+               if (req_comp >= 3) s->img_out_n = req_comp;
+               if (!expand_palette(z, palette, pal_len, s->img_out_n))
+                  return 0;
+            }
+            free(z->expanded); z->expanded = NULL;
+            return 1;
+         }
+
+         default:
+            // if critical, fail
+            if ((c.type & (1 << 29)) == 0) {
+               #ifndef STBI_NO_FAILURE_STRINGS
+               // not threadsafe
+               static char invalid_chunk[] = "XXXX chunk not known";
+               invalid_chunk[0] = (uint8) (c.type >> 24);
+               invalid_chunk[1] = (uint8) (c.type >> 16);
+               invalid_chunk[2] = (uint8) (c.type >>  8);
+               invalid_chunk[3] = (uint8) (c.type >>  0);
+               #endif
+               return e(invalid_chunk, "PNG not supported: unknown chunk type");
+            }
+            skip(s, c.length);
+            break;
+      }
+      // end of chunk, read and skip CRC
+      get32(s);
+   }
+}
+
+static unsigned char *do_png(png *p, int *x, int *y, int *n, int req_comp)
+{
+   unsigned char *result=NULL;
+   p->expanded = NULL;
+   p->idata = NULL;
+   p->out = NULL;
+   if (req_comp < 0 || req_comp > 4) return epuc("bad req_comp", "Internal error");
+   if (parse_png_file(p, SCAN_load, req_comp)) {
+      result = p->out;
+      p->out = NULL;
+      if (req_comp && req_comp != p->s.img_out_n) {
+         result = convert_format(result, p->s.img_out_n, req_comp, p->s.img_x, p->s.img_y);
+         p->s.img_out_n = req_comp;
+         if (result == NULL) return result;
+      }
+      *x = p->s.img_x;
+      *y = p->s.img_y;
+      if (n) *n = p->s.img_n;
+   }
+   free(p->out);      p->out      = NULL;
+   free(p->expanded); p->expanded = NULL;
+   free(p->idata);    p->idata    = NULL;
+
+   return result;
+}
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_png_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   png p;
+   start_file(&p.s, f);
+   return do_png(&p, x,y,comp,req_comp);
+}
+
+unsigned char *stbi_png_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_png_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+#endif
+
+unsigned char *stbi_png_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   png p;
+   start_mem(&p.s, buffer,len);
+   return do_png(&p, x,y,comp,req_comp);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_png_test_file(FILE *f)
+{
+   png p;
+   int n,r;
+   n = ftell(f);
+   start_file(&p.s, f);
+   r = parse_png_file(&p, SCAN_type,STBI_default);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_png_test_memory(stbi_uc const *buffer, int len)
+{
+   png p;
+   start_mem(&p.s, buffer, len);
+   return parse_png_file(&p, SCAN_type,STBI_default);
+}
+
+// TODO: load header from png
+#ifndef STBI_NO_STDIO
+extern int      stbi_png_info             (char const *filename,           int *x, int *y, int *comp);
+extern int      stbi_png_info_from_file   (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_png_info_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+// Microsoft/Windows BMP image
+
+static int bmp_test(stbi *s)
+{
+   int sz;
+   if (get8(s) != 'B') return 0;
+   if (get8(s) != 'M') return 0;
+   get32le(s); // discard filesize
+   get16le(s); // discard reserved
+   get16le(s); // discard reserved
+   get32le(s); // discard data offset
+   sz = get32le(s);
+   if (sz == 12 || sz == 40 || sz == 56 || sz == 108) return 1;
+   return 0;
+}
+
+#ifndef STBI_NO_STDIO
+int      stbi_bmp_test_file        (FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s,f);
+   r = bmp_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int      stbi_bmp_test_memory      (stbi_uc const *buffer, int len)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return bmp_test(&s);
+}
+
+// returns 0..31 for the highest set bit
+static int high_bit(unsigned int z)
+{
+   int n=0;
+   if (z == 0) return -1;
+   if (z >= 0x10000) n += 16, z >>= 16;
+   if (z >= 0x00100) n +=  8, z >>=  8;
+   if (z >= 0x00010) n +=  4, z >>=  4;
+   if (z >= 0x00004) n +=  2, z >>=  2;
+   if (z >= 0x00002) n +=  1, z >>=  1;
+   return n;
+}
+
+static int bitcount(unsigned int a)
+{
+   a = (a & 0x55555555) + ((a >>  1) & 0x55555555); // max 2
+   a = (a & 0x33333333) + ((a >>  2) & 0x33333333); // max 4
+   a = (a + (a >> 4)) & 0x0f0f0f0f; // max 8 per 4, now 8 bits
+   a = (a + (a >> 8)); // max 16 per 8 bits
+   a = (a + (a >> 16)); // max 32 per 8 bits
+   return a & 0xff;
+}
+
+static int shiftsigned(int v, int shift, int bits)
+{
+   int result;
+   int z=0;
+
+   if (shift < 0) v <<= -shift;
+   else v >>= shift;
+   result = v;
+
+   z = bits;
+   while (z < 8) {
+      result += v >> z;
+      z += bits;
+   }
+   return result;
+}
+
+static stbi_uc *bmp_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+   uint8 *out;
+   unsigned int mr=0,mg=0,mb=0,ma=0;
+   stbi_uc pal[256][4];
+   int psize=0,i,j,compress=0,width;
+   int bpp, flip_vertically, pad, target, offset, hsz;
+   if (get8(s) != 'B' || get8(s) != 'M') return epuc("not BMP", "Corrupt BMP");
+   get32le(s); // discard filesize
+   get16le(s); // discard reserved
+   get16le(s); // discard reserved
+   offset = get32le(s);
+   hsz = get32le(s);
+   if (hsz != 12 && hsz != 40 && hsz != 56 && hsz != 108) return epuc("unknown BMP", "BMP type not supported: unknown");
+   failure_reason = "bad BMP";
+   if (hsz == 12) {
+      s->img_x = get16le(s);
+      s->img_y = get16le(s);
+   } else {
+      s->img_x = get32le(s);
+      s->img_y = get32le(s);
+   }
+   if (get16le(s) != 1) return 0;
+   bpp = get16le(s);
+   if (bpp == 1) return epuc("monochrome", "BMP type not supported: 1-bit");
+   flip_vertically = ((int) s->img_y) > 0;
+   s->img_y = abs((int) s->img_y);
+   if (hsz == 12) {
+      if (bpp < 24)
+         psize = (offset - 14 - 24) / 3;
+   } else {
+      compress = get32le(s);
+      if (compress == 1 || compress == 2) return epuc("BMP RLE", "BMP type not supported: RLE");
+      get32le(s); // discard sizeof
+      get32le(s); // discard hres
+      get32le(s); // discard vres
+      get32le(s); // discard colorsused
+      get32le(s); // discard max important
+      if (hsz == 40 || hsz == 56) {
+         if (hsz == 56) {
+            get32le(s);
+            get32le(s);
+            get32le(s);
+            get32le(s);
+         }
+         if (bpp == 16 || bpp == 32) {
+            mr = mg = mb = 0;
+            if (compress == 0) {
+               if (bpp == 32) {
+                  mr = 0xff << 16;
+                  mg = 0xff <<  8;
+                  mb = 0xff <<  0;
+               } else {
+                  mr = 31 << 10;
+                  mg = 31 <<  5;
+                  mb = 31 <<  0;
+               }
+            } else if (compress == 3) {
+               mr = get32le(s);
+               mg = get32le(s);
+               mb = get32le(s);
+               // not documented, but generated by photoshop and handled by mspaint
+               if (mr == mg && mg == mb) {
+                  // ?!?!?
+                  return NULL;
+               }
+            } else
+               return NULL;
+         }
+      } else {
+         assert(hsz == 108);
+         mr = get32le(s);
+         mg = get32le(s);
+         mb = get32le(s);
+         ma = get32le(s);
+         get32le(s); // discard color space
+         for (i=0; i < 12; ++i)
+            get32le(s); // discard color space parameters
+      }
+      if (bpp < 16)
+         psize = (offset - 14 - hsz) >> 2;
+   }
+   s->img_n = ma ? 4 : 3;
+   if (req_comp && req_comp >= 3) // we can directly decode 3 or 4
+      target = req_comp;
+   else
+      target = s->img_n; // if they want monochrome, we'll post-convert
+   out = (stbi_uc *) malloc(target * s->img_x * s->img_y);
+   if (!out) return epuc("outofmem", "Out of memory");
+   if (bpp < 16) {
+      int z=0;
+      if (psize == 0 || psize > 256) { free(out); return epuc("invalid", "Corrupt BMP"); }
+      for (i=0; i < psize; ++i) {
+         pal[i][2] = get8(s);
+         pal[i][1] = get8(s);
+         pal[i][0] = get8(s);
+         if (hsz != 12) get8(s);
+         pal[i][3] = 255;
+      }
+      skip(s, offset - 14 - hsz - psize * (hsz == 12 ? 3 : 4));
+      if (bpp == 4) width = (s->img_x + 1) >> 1;
+      else if (bpp == 8) width = s->img_x;
+      else { free(out); return epuc("bad bpp", "Corrupt BMP"); }
+      pad = (-width)&3;
+      for (j=0; j < (int) s->img_y; ++j) {
+         for (i=0; i < (int) s->img_x; i += 2) {
+            int v=get8(s),v2=0;
+            if (bpp == 4) {
+               v2 = v & 15;
+               v >>= 4;
+            }
+            out[z++] = pal[v][0];
+            out[z++] = pal[v][1];
+            out[z++] = pal[v][2];
+            if (target == 4) out[z++] = 255;
+            if (i+1 == (int) s->img_x) break;
+            v = (bpp == 8) ? get8(s) : v2;
+            out[z++] = pal[v][0];
+            out[z++] = pal[v][1];
+            out[z++] = pal[v][2];
+            if (target == 4) out[z++] = 255;
+         }
+         skip(s, pad);
+      }
+   } else {
+      int rshift=0,gshift=0,bshift=0,ashift=0,rcount=0,gcount=0,bcount=0,acount=0;
+      int z = 0;
+      int easy=0;
+      skip(s, offset - 14 - hsz);
+      if (bpp == 24) width = 3 * s->img_x;
+      else if (bpp == 16) width = 2*s->img_x;
+      else /* bpp = 32 and pad = 0 */ width=0;
+      pad = (-width) & 3;
+      if (bpp == 24) {
+         easy = 1;
+      } else if (bpp == 32) {
+         if (mb == 0xff && mg == 0xff00 && mr == 0xff000000 && ma == 0xff000000)
+            easy = 2;
+      }
+      if (!easy) {
+         if (!mr || !mg || !mb) return epuc("bad masks", "Corrupt BMP");
+         // right shift amt to put high bit in position #7
+         rshift = high_bit(mr)-7; rcount = bitcount(mr);
+         gshift = high_bit(mg)-7; gcount = bitcount(mr);
+         bshift = high_bit(mb)-7; bcount = bitcount(mr);
+         ashift = high_bit(ma)-7; acount = bitcount(mr);
+      }
+      for (j=0; j < (int) s->img_y; ++j) {
+         if (easy) {
+            for (i=0; i < (int) s->img_x; ++i) {
+               int a;
+               out[z+2] = get8(s);
+               out[z+1] = get8(s);
+               out[z+0] = get8(s);
+               z += 3;
+               a = (easy == 2 ? get8(s) : 255);
+               if (target == 4) out[z++] = a;
+            }
+         } else {
+            for (i=0; i < (int) s->img_x; ++i) {
+               uint32 v = (bpp == 16 ? get16le(s) : get32le(s));
+               int a;
+               out[z++] = shiftsigned(v & mr, rshift, rcount);
+               out[z++] = shiftsigned(v & mg, gshift, gcount);
+               out[z++] = shiftsigned(v & mb, bshift, bcount);
+               a = (ma ? shiftsigned(v & ma, ashift, acount) : 255);
+               if (target == 4) out[z++] = a; 
+            }
+         }
+         skip(s, pad);
+      }
+   }
+   if (flip_vertically) {
+      stbi_uc t;
+      for (j=0; j < (int) s->img_y>>1; ++j) {
+         stbi_uc *p1 = out +      j     *s->img_x*target;
+         stbi_uc *p2 = out + (s->img_y-1-j)*s->img_x*target;
+         for (i=0; i < (int) s->img_x*target; ++i) {
+            t = p1[i], p1[i] = p2[i], p2[i] = t;
+         }
+      }
+   }
+
+   if (req_comp && req_comp != target) {
+      out = convert_format(out, target, req_comp, s->img_x, s->img_y);
+      if (out == NULL) return out; // convert_format frees input on failure
+   }
+
+   *x = s->img_x;
+   *y = s->img_y;
+   if (comp) *comp = target;
+   return out;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_bmp_load             (char const *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_bmp_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_bmp_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s, f);
+   return bmp_load(&s, x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_bmp_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return bmp_load(&s, x,y,comp,req_comp);
+}
+
+// Targa Truevision - TGA
+// by Jonathan Dummer
+
+static int tga_test(stbi *s)
+{
+	int sz;
+	get8u(s);		//	discard Offset
+	sz = get8u(s);	//	color type
+	if( sz > 1 ) return 0;	//	only RGB or indexed allowed
+	sz = get8u(s);	//	image type
+	if( (sz != 1) && (sz != 2) && (sz != 3) && (sz != 9) && (sz != 10) && (sz != 11) ) return 0;	//	only RGB or grey allowed, +/- RLE
+	get16(s);		//	discard palette start
+	get16(s);		//	discard palette length
+	get8(s);			//	discard bits per palette color entry
+	get16(s);		//	discard x origin
+	get16(s);		//	discard y origin
+	if( get16(s) < 1 ) return 0;		//	test width
+	if( get16(s) < 1 ) return 0;		//	test height
+	sz = get8(s);	//	bits per pixel
+	if( (sz != 8) && (sz != 16) && (sz != 24) && (sz != 32) ) return 0;	//	only RGB or RGBA or grey allowed
+	return 1;		//	seems to have passed everything
+}
+
+#ifndef STBI_NO_STDIO
+int      stbi_tga_test_file        (FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s, f);
+   r = tga_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int      stbi_tga_test_memory      (stbi_uc const *buffer, int len)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return tga_test(&s);
+}
+
+static stbi_uc *tga_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+	//	read in the TGA header stuff
+	int tga_offset = get8u(s);
+	int tga_indexed = get8u(s);
+	int tga_image_type = get8u(s);
+	int tga_is_RLE = 0;
+	int tga_palette_start = get16le(s);
+	int tga_palette_len = get16le(s);
+	int tga_palette_bits = get8u(s);
+	int tga_x_origin = get16le(s);
+	int tga_y_origin = get16le(s);
+	int tga_width = get16le(s);
+	int tga_height = get16le(s);
+	int tga_bits_per_pixel = get8u(s);
+	int tga_inverted = get8u(s);
+	//	image data
+	unsigned char *tga_data;
+	unsigned char *tga_palette = NULL;
+	int i, j;
+	unsigned char raw_data[4];
+	unsigned char trans_data[4];
+	int RLE_count = 0;
+	int RLE_repeating = 0;
+	int read_next_pixel = 1;
+	//	do a tiny bit of precessing
+	if( tga_image_type >= 8 )
+	{
+		tga_image_type -= 8;
+		tga_is_RLE = 1;
+	}
+	/* int tga_alpha_bits = tga_inverted & 15; */
+	tga_inverted = 1 - ((tga_inverted >> 5) & 1);
+
+	//	error check
+	if( //(tga_indexed) ||
+		(tga_width < 1) || (tga_height < 1) ||
+		(tga_image_type < 1) || (tga_image_type > 3) ||
+		((tga_bits_per_pixel != 8) && (tga_bits_per_pixel != 16) &&
+		(tga_bits_per_pixel != 24) && (tga_bits_per_pixel != 32))
+		)
+	{
+		return NULL;
+	}
+
+	//	If I'm paletted, then I'll use the number of bits from the palette
+	if( tga_indexed )
+	{
+		tga_bits_per_pixel = tga_palette_bits;
+	}
+
+	//	tga info
+	*x = tga_width;
+	*y = tga_height;
+	if( (req_comp < 1) || (req_comp > 4) )
+	{
+		//	just use whatever the file was
+		req_comp = tga_bits_per_pixel / 8;
+		*comp = req_comp;
+	} else
+	{
+		//	force a new number of components
+		*comp = tga_bits_per_pixel/8;
+	}
+	tga_data = (unsigned char*)malloc( tga_width * tga_height * req_comp );
+
+	//	skip to the data's starting position (offset usually = 0)
+	skip(s, tga_offset );
+	//	do I need to load a palette?
+	if( tga_indexed )
+	{
+		//	any data to skip? (offset usually = 0)
+		skip(s, tga_palette_start );
+		//	load the palette
+		tga_palette = (unsigned char*)malloc( tga_palette_len * tga_palette_bits / 8 );
+		getn(s, tga_palette, tga_palette_len * tga_palette_bits / 8 );
+	}
+	//	load the data
+	for( i = 0; i < tga_width * tga_height; ++i )
+	{
+		//	if I'm in RLE mode, do I need to get a RLE chunk?
+		if( tga_is_RLE )
+		{
+			if( RLE_count == 0 )
+			{
+				//	yep, get the next byte as a RLE command
+				int RLE_cmd = get8u(s);
+				RLE_count = 1 + (RLE_cmd & 127);
+				RLE_repeating = RLE_cmd >> 7;
+				read_next_pixel = 1;
+			} else if( !RLE_repeating )
+			{
+				read_next_pixel = 1;
+			}
+		} else
+		{
+			read_next_pixel = 1;
+		}
+		//	OK, if I need to read a pixel, do it now
+		if( read_next_pixel )
+		{
+			//	load however much data we did have
+			if( tga_indexed )
+			{
+				//	read in 1 byte, then perform the lookup
+				int pal_idx = get8u(s);
+				if( pal_idx >= tga_palette_len )
+				{
+					//	invalid index
+					pal_idx = 0;
+				}
+				pal_idx *= tga_bits_per_pixel / 8;
+				for( j = 0; j*8 < tga_bits_per_pixel; ++j )
+				{
+					raw_data[j] = tga_palette[pal_idx+j];
+				}
+			} else
+			{
+				//	read in the data raw
+				for( j = 0; j*8 < tga_bits_per_pixel; ++j )
+				{
+					raw_data[j] = get8u(s);
+				}
+			}
+			//	convert raw to the intermediate format
+			switch( tga_bits_per_pixel )
+			{
+			case 8:
+				//	Luminous => RGBA
+				trans_data[0] = raw_data[0];
+				trans_data[1] = raw_data[0];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = 255;
+				break;
+			case 16:
+				//	Luminous,Alpha => RGBA
+				trans_data[0] = raw_data[0];
+				trans_data[1] = raw_data[0];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = raw_data[1];
+				break;
+			case 24:
+				//	BGR => RGBA
+				trans_data[0] = raw_data[2];
+				trans_data[1] = raw_data[1];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = 255;
+				break;
+			case 32:
+				//	BGRA => RGBA
+				trans_data[0] = raw_data[2];
+				trans_data[1] = raw_data[1];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = raw_data[3];
+				break;
+			}
+			//	clear the reading flag for the next pixel
+			read_next_pixel = 0;
+		} // end of reading a pixel
+		//	convert to final format
+		switch( req_comp )
+		{
+		case 1:
+			//	RGBA => Luminance
+			tga_data[i*req_comp+0] = compute_y(trans_data[0],trans_data[1],trans_data[2]);
+			break;
+		case 2:
+			//	RGBA => Luminance,Alpha
+			tga_data[i*req_comp+0] = compute_y(trans_data[0],trans_data[1],trans_data[2]);
+			tga_data[i*req_comp+1] = trans_data[3];
+			break;
+		case 3:
+			//	RGBA => RGB
+			tga_data[i*req_comp+0] = trans_data[0];
+			tga_data[i*req_comp+1] = trans_data[1];
+			tga_data[i*req_comp+2] = trans_data[2];
+			break;
+		case 4:
+			//	RGBA => RGBA
+			tga_data[i*req_comp+0] = trans_data[0];
+			tga_data[i*req_comp+1] = trans_data[1];
+			tga_data[i*req_comp+2] = trans_data[2];
+			tga_data[i*req_comp+3] = trans_data[3];
+			break;
+		}
+		//	in case we're in RLE mode, keep counting down
+		--RLE_count;
+	}
+	//	do I need to invert the image?
+	if( tga_inverted )
+	{
+		for( j = 0; j*2 < tga_height; ++j )
+		{
+			int index1 = j * tga_width * req_comp;
+			int index2 = (tga_height - 1 - j) * tga_width * req_comp;
+			for( i = tga_width * req_comp; i > 0; --i )
+			{
+				unsigned char temp = tga_data[index1];
+				tga_data[index1] = tga_data[index2];
+				tga_data[index2] = temp;
+				++index1;
+				++index2;
+			}
+		}
+	}
+	//	clear my palette, if I had one
+	if( tga_palette != NULL )
+	{
+		free( tga_palette );
+	}
+	//	the things I do to get rid of an error message, and yet keep
+	//	Microsoft's C compilers happy... [8^(
+	tga_palette_start = tga_palette_len = tga_palette_bits =
+			tga_x_origin = tga_y_origin = 0;
+	//	OK, done
+	return tga_data;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_tga_load             (char const *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_tga_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_tga_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s, f);
+   return tga_load(&s, x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_tga_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return tga_load(&s, x,y,comp,req_comp);
+}
+
+
+// *************************************************************************************************
+// Photoshop PSD loader -- PD by Thatcher Ulrich, integration by Nicholas Schulz, tweaked by STB
+
+static int psd_test(stbi *s)
+{
+	if (get32(s) != 0x38425053) return 0;	// "8BPS"
+	else return 1;
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_psd_test_file(FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s, f);
+   r = psd_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_psd_test_memory(stbi_uc const *buffer, int len)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return psd_test(&s);
+}
+
+static stbi_uc *psd_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+	int	pixelCount;
+	int channelCount, compression;
+	int channel, i, count, len;
+   int w,h;
+   uint8 *out;
+
+	// Check identifier
+	if (get32(s) != 0x38425053)	// "8BPS"
+		return epuc("not PSD", "Corrupt PSD image");
+
+	// Check file type version.
+	if (get16(s) != 1)
+		return epuc("wrong version", "Unsupported version of PSD image");
+
+	// Skip 6 reserved bytes.
+	skip(s, 6 );
+
+	// Read the number of channels (R, G, B, A, etc).
+	channelCount = get16(s);
+	if (channelCount < 0 || channelCount > 16)
+		return epuc("wrong channel count", "Unsupported number of channels in PSD image");
+
+	// Read the rows and columns of the image.
+   h = get32(s);
+   w = get32(s);
+	
+	// Make sure the depth is 8 bits.
+	if (get16(s) != 8)
+		return epuc("unsupported bit depth", "PSD bit depth is not 8 bit");
+
+	// Make sure the color mode is RGB.
+	// Valid options are:
+	//   0: Bitmap
+	//   1: Grayscale
+	//   2: Indexed color
+	//   3: RGB color
+	//   4: CMYK color
+	//   7: Multichannel
+	//   8: Duotone
+	//   9: Lab color
+	if (get16(s) != 3)
+		return epuc("wrong color format", "PSD is not in RGB color format");
+
+	// Skip the Mode Data.  (It's the palette for indexed color; other info for other modes.)
+	skip(s,get32(s) );
+
+	// Skip the image resources.  (resolution, pen tool paths, etc)
+	skip(s, get32(s) );
+
+	// Skip the reserved data.
+	skip(s, get32(s) );
+
+	// Find out if the data is compressed.
+	// Known values:
+	//   0: no compression
+	//   1: RLE compressed
+	compression = get16(s);
+	if (compression > 1)
+		return epuc("bad compression", "PSD has an unknown compression format");
+
+	// Create the destination image.
+	out = (stbi_uc *) malloc(4 * w*h);
+	if (!out) return epuc("outofmem", "Out of memory");
+   pixelCount = w*h;
+
+	// Initialize the data to zero.
+	//memset( out, 0, pixelCount * 4 );
+	
+	// Finally, the image data.
+	if (compression) {
+		// RLE as used by .PSD and .TIFF
+		// Loop until you get the number of unpacked bytes you are expecting:
+		//     Read the next source byte into n.
+		//     If n is between 0 and 127 inclusive, copy the next n+1 bytes literally.
+		//     Else if n is between -127 and -1 inclusive, copy the next byte -n+1 times.
+		//     Else if n is 128, noop.
+		// Endloop
+
+		// The RLE-compressed data is preceeded by a 2-byte data count for each row in the data,
+		// which we're going to just skip.
+		skip(s, h * channelCount * 2 );
+
+		// Read the RLE data by channel.
+		for (channel = 0; channel < 4; channel++) {
+			uint8 *p;
+			
+         p = out+channel;
+			if (channel >= channelCount) {
+				// Fill this channel with default data.
+				for (i = 0; i < pixelCount; i++) *p = (channel == 3 ? 255 : 0), p += 4;
+			} else {
+				// Read the RLE data.
+				count = 0;
+				while (count < pixelCount) {
+					len = get8(s);
+					if (len == 128) {
+						// No-op.
+					} else if (len < 128) {
+						// Copy next len+1 bytes literally.
+						len++;
+						count += len;
+						while (len) {
+							*p = get8(s);
+                     p += 4;
+							len--;
+						}
+					} else if (len > 128) {
+						uint32	val;
+						// Next -len+1 bytes in the dest are replicated from next source byte.
+						// (Interpret len as a negative 8-bit int.)
+						len ^= 0x0FF;
+						len += 2;
+                  val = get8(s);
+						count += len;
+						while (len) {
+							*p = val;
+                     p += 4;
+							len--;
+						}
+					}
+				}
+			}
+		}
+		
+	} else {
+		// We're at the raw image data.  It's each channel in order (Red, Green, Blue, Alpha, ...)
+		// where each channel consists of an 8-bit value for each pixel in the image.
+		
+		// Read the data by channel.
+		for (channel = 0; channel < 4; channel++) {
+			uint8 *p;
+			
+         p = out + channel;
+			if (channel > channelCount) {
+				// Fill this channel with default data.
+				for (i = 0; i < pixelCount; i++) *p = channel == 3 ? 255 : 0, p += 4;
+			} else {
+				// Read the data.
+				count = 0;
+				for (i = 0; i < pixelCount; i++)
+					*p = get8(s), p += 4;
+			}
+		}
+	}
+
+	if (req_comp && req_comp != 4) {
+		out = convert_format(out, 4, req_comp, w, h);
+		if (out == NULL) return out; // convert_format frees input on failure
+	}
+
+	if (comp) *comp = channelCount;
+	*y = h;
+	*x = w;
+	
+	return out;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_psd_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_psd_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_psd_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s, f);
+   return psd_load(&s, x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_psd_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return psd_load(&s, x,y,comp,req_comp);
+}
+
+
+// *************************************************************************************************
+// Radiance RGBE HDR loader
+// originally by Nicolas Schulz
+#ifndef STBI_NO_HDR
+static int hdr_test(stbi *s)
+{
+   char *signature = "#?RADIANCE\n";
+   int i;
+   for (i=0; signature[i]; ++i)
+      if (get8(s) != signature[i])
+         return 0;
+	return 1;
+}
+
+int stbi_hdr_test_memory(stbi_uc const *buffer, int len)
+{
+   stbi s;
+	start_mem(&s, buffer, len);
+	return hdr_test(&s);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_hdr_test_file(FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s, f);
+   r = hdr_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+#define HDR_BUFLEN  1024
+static char *hdr_gettoken(stbi *z, char *buffer)
+{
+   int len=0;
+	char *s = buffer, c = '\0';
+
+   c = get8(z);
+
+	while (!at_eof(z) && c != '\n') {
+		buffer[len++] = c;
+      if (len == HDR_BUFLEN-1) {
+         // flush to end of line
+         while (!at_eof(z) && get8(z) != '\n')
+            ;
+         break;
+      }
+      c = get8(z);
+	}
+
+   buffer[len] = 0;
+	return buffer;
+}
+
+static void hdr_convert(float *output, stbi_uc *input, int req_comp)
+{
+	if( input[3] != 0 ) {
+      float f1;
+		// Exponent
+		f1 = (float) ldexp(1.0f, input[3] - (int)(128 + 8));
+      if (req_comp <= 2)
+         output[0] = (input[0] + input[1] + input[2]) * f1 / 3;
+      else {
+         output[0] = input[0] * f1;
+         output[1] = input[1] * f1;
+         output[2] = input[2] * f1;
+      }
+      if (req_comp == 2) output[1] = 1;
+      if (req_comp == 4) output[3] = 1;
+	} else {
+      switch (req_comp) {
+         case 4: output[3] = 1; /* fallthrough */
+         case 3: output[0] = output[1] = output[2] = 0;
+                 break;
+         case 2: output[1] = 1; /* fallthrough */
+         case 1: output[0] = 0;
+                 break;
+      }
+	}
+}
+
+
+static float *hdr_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+   char buffer[HDR_BUFLEN];
+	char *token;
+	int valid = 0;
+	int width, height;
+   stbi_uc *scanline;
+	float *hdr_data;
+	int len;
+	unsigned char count, value;
+	int i, j, k, c1,c2, z;
+
+
+	// Check identifier
+	if (strcmp(hdr_gettoken(s,buffer), "#?RADIANCE") != 0)
+		return epf("not HDR", "Corrupt HDR image");
+	
+	// Parse header
+	while(1) {
+		token = hdr_gettoken(s,buffer);
+      if (token[0] == 0) break;
+		if (strcmp(token, "FORMAT=32-bit_rle_rgbe") == 0) valid = 1;
+   }
+
+	if (!valid)    return epf("unsupported format", "Unsupported HDR format");
+
+   // Parse width and height
+   // can't use sscanf() if we're not using stdio!
+   token = hdr_gettoken(s,buffer);
+   if (strncmp(token, "-Y ", 3))  return epf("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   height = strtol(token, &token, 10);
+   while (*token == ' ') ++token;
+   if (strncmp(token, "+X ", 3))  return epf("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   width = strtol(token, NULL, 10);
+
+	*x = width;
+	*y = height;
+
+   *comp = 3;
+	if (req_comp == 0) req_comp = 3;
+
+	// Read data
+	hdr_data = (float *) malloc(height * width * req_comp * sizeof(float));
+
+	// Load image data
+   // image data is stored as some number of sca
+	if( width < 8 || width >= 32768) {
+		// Read flat data
+      for (j=0; j < height; ++j) {
+         for (i=0; i < width; ++i) {
+            stbi_uc rgbe[4];
+           main_decode_loop:
+            getn(s, rgbe, 4);
+            hdr_convert(hdr_data + j * width * req_comp + i * req_comp, rgbe, req_comp);
+         }
+      }
+	} else {
+		// Read RLE-encoded data
+		scanline = NULL;
+
+		for (j = 0; j < height; ++j) {
+         c1 = get8(s);
+         c2 = get8(s);
+         len = get8(s);
+         if (c1 != 2 || c2 != 2 || (len & 0x80)) {
+            // not run-length encoded, so we have to actually use THIS data as a decoded
+            // pixel (note this can't be a valid pixel--one of RGB must be >= 128)
+            stbi_uc rgbe[4] = { c1,c2,len, get8(s) };
+            hdr_convert(hdr_data, rgbe, req_comp);
+            i = 1;
+            j = 0;
+            free(scanline);
+            goto main_decode_loop; // yes, this is fucking insane; blame the fucking insane format
+         }
+         len <<= 8;
+         len |= get8(s);
+         if (len != width) { free(hdr_data); free(scanline); return epf("invalid decoded scanline length", "corrupt HDR"); }
+         if (scanline == NULL) scanline = (stbi_uc *) malloc(width * 4);
+				
+			for (k = 0; k < 4; ++k) {
+				i = 0;
+				while (i < width) {
+					count = get8(s);
+					if (count > 128) {
+						// Run
+						value = get8(s);
+                  count -= 128;
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = value;
+					} else {
+						// Dump
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = get8(s);
+					}
+				}
+			}
+         for (i=0; i < width; ++i)
+            hdr_convert(hdr_data+(j*width + i)*req_comp, scanline + i*4, req_comp);
+		}
+      free(scanline);
+	}
+
+   return hdr_data;
+}
+
+#ifndef STBI_NO_STDIO
+float *stbi_hdr_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s,f);
+   return hdr_load(&s,x,y,comp,req_comp);
+}
+#endif
+
+float *stbi_hdr_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s,buffer, len);
+   return hdr_load(&s,x,y,comp,req_comp);
+}
+
+#endif // STBI_NO_HDR
+
+/////////////////////// write image ///////////////////////
+
+#ifndef STBI_NO_WRITE
+
+static void write8(FILE *f, int x) { uint8 z = (uint8) x; fwrite(&z,1,1,f); }
+
+static void writefv(FILE *f, char *fmt, va_list v)
+{
+   while (*fmt) {
+      switch (*fmt++) {
+         case ' ': break;
+         case '1': { uint8 x = va_arg(v, int); write8(f,x); break; }
+         case '2': { int16 x = va_arg(v, int); write8(f,x); write8(f,x>>8); break; }
+         case '4': { int32 x = va_arg(v, int); write8(f,x); write8(f,x>>8); write8(f,x>>16); write8(f,x>>24); break; }
+         default:
+            assert(0);
+            va_end(v);
+            return;
+      }
+   }
+}
+
+static void writef(FILE *f, char *fmt, ...)
+{
+   va_list v;
+   va_start(v, fmt);
+   writefv(f,fmt,v);
+   va_end(v);
+}
+
+static void write_pixels(FILE *f, int rgb_dir, int vdir, int x, int y, int comp, void *data, int write_alpha, int scanline_pad)
+{
+   uint8 bg[3] = { 255, 0, 255}, px[3];
+   uint32 zero = 0;
+   int i,j,k, j_end;
+
+   if (vdir < 0) 
+      j_end = -1, j = y-1;
+   else
+      j_end =  y, j = 0;
+
+   for (; j != j_end; j += vdir) {
+      for (i=0; i < x; ++i) {
+         uint8 *d = (uint8 *) data + (j*x+i)*comp;
+         if (write_alpha < 0)
+            fwrite(&d[comp-1], 1, 1, f);
+         switch (comp) {
+            case 1:
+            case 2: writef(f, "111", d[0],d[0],d[0]);
+                    break;
+            case 4:
+               if (!write_alpha) {
+                  for (k=0; k < 3; ++k)
+                     px[k] = bg[k] + ((d[k] - bg[k]) * d[3])/255;
+                  writef(f, "111", px[1-rgb_dir],px[1],px[1+rgb_dir]);
+                  break;
+               }
+               /* FALLTHROUGH */
+            case 3:
+               writef(f, "111", d[1-rgb_dir],d[1],d[1+rgb_dir]);
+               break;
+         }
+         if (write_alpha > 0)
+            fwrite(&d[comp-1], 1, 1, f);
+      }
+      fwrite(&zero,scanline_pad,1,f);
+   }
+}
+
+static int outfile(char const *filename, int rgb_dir, int vdir, int x, int y, int comp, void *data, int alpha, int pad, char *fmt, ...)
+{
+   FILE *f = fopen(filename, "wb");
+   if (f) {
+      va_list v;
+      va_start(v, fmt);
+      writefv(f, fmt, v);
+      va_end(v);
+      write_pixels(f,rgb_dir,vdir,x,y,comp,data,alpha,pad);
+      fclose(f);
+   }
+   return f != NULL;
+}
+
+int stbi_write_bmp(char const *filename, int x, int y, int comp, void *data)
+{
+   int pad = (-x*3) & 3;
+   return outfile(filename,-1,-1,x,y,comp,data,0,pad,
+           "11 4 22 4" "4 44 22 444444",
+           'B', 'M', 14+40+(x*3+pad)*y, 0,0, 14+40,  // file header
+            40, x,y, 1,24, 0,0,0,0,0,0);             // bitmap header
+}
+
+int stbi_write_tga(char const *filename, int x, int y, int comp, void *data)
+{
+   int has_alpha = !(comp & 1);
+   return outfile(filename, -1,-1, x, y, comp, data, has_alpha, 0,
+                  "111 221 2222 11", 0,0,2, 0,0,0, 0,0,x,y, 24+8*has_alpha, 8*has_alpha);
+}
+
+// any other image formats that do interleaved rgb data?
+//    PNG: requires adler32,crc32 -- significant amount of code
+//    PSD: no, channels output separately
+//    TIFF: no, stripwise-interleaved... i think
+
+#endif // STBI_NO_WRITE
+
+#endif // STBI_HEADER_FILE_ONLY
+
diff --git a/external/include/SOIL/stb_image_aug.c b/external/include/SOIL/stb_image_aug.c
new file mode 100644
index 0000000..bb088fc
--- /dev/null
+++ b/external/include/SOIL/stb_image_aug.c
@@ -0,0 +1,3682 @@
+/* stbi-1.16 - public domain JPEG/PNG reader - http://nothings.org/stb_image.c
+                      when you control the images you're loading
+
+   QUICK NOTES:
+      Primarily of interest to game developers and other people who can
+          avoid problematic images and only need the trivial interface
+
+      JPEG baseline (no JPEG progressive, no oddball channel decimations)
+      PNG non-interlaced
+      BMP non-1bpp, non-RLE
+      TGA (not sure what subset, if a subset)
+      PSD (composited view only, no extra channels)
+      HDR (radiance rgbE format)
+      writes BMP,TGA (define STBI_NO_WRITE to remove code)
+      decoded from memory or through stdio FILE (define STBI_NO_STDIO to remove code)
+      supports installable dequantizing-IDCT, YCbCr-to-RGB conversion (define STBI_SIMD)
+
+   TODO:
+      stbi_info_*
+
+   history:
+      1.16   major bugfix - convert_format converted one too many pixels
+      1.15   initialize some fields for thread safety
+      1.14   fix threadsafe conversion bug; header-file-only version (#define STBI_HEADER_FILE_ONLY before including)
+      1.13   threadsafe
+      1.12   const qualifiers in the API
+      1.11   Support installable IDCT, colorspace conversion routines
+      1.10   Fixes for 64-bit (don't use "unsigned long")
+             optimized upsampling by Fabian "ryg" Giesen
+      1.09   Fix format-conversion for PSD code (bad global variables!)
+      1.08   Thatcher Ulrich's PSD code integrated by Nicolas Schulz
+      1.07   attempt to fix C++ warning/errors again
+      1.06   attempt to fix C++ warning/errors again
+      1.05   fix TGA loading to return correct *comp and use good luminance calc
+      1.04   default float alpha is 1, not 255; use 'void *' for stbi_image_free
+      1.03   bugfixes to STBI_NO_STDIO, STBI_NO_HDR
+      1.02   support for (subset of) HDR files, float interface for preferred access to them
+      1.01   fix bug: possible bug in handling right-side up bmps... not sure
+             fix bug: the stbi_bmp_load() and stbi_tga_load() functions didn't work at all
+      1.00   interface to zlib that skips zlib header
+      0.99   correct handling of alpha in palette
+      0.98   TGA loader by lonesock; dynamically add loaders (untested)
+      0.97   jpeg errors on too large a file; also catch another malloc failure
+      0.96   fix detection of invalid v value - particleman@mollyrocket forum
+      0.95   during header scan, seek to markers in case of padding
+      0.94   STBI_NO_STDIO to disable stdio usage; rename all #defines the same
+      0.93   handle jpegtran output; verbose errors
+      0.92   read 4,8,16,24,32-bit BMP files of several formats
+      0.91   output 24-bit Windows 3.0 BMP files
+      0.90   fix a few more warnings; bump version number to approach 1.0
+      0.61   bugfixes due to Marc LeBlanc, Christopher Lloyd
+      0.60   fix compiling as c++
+      0.59   fix warnings: merge Dave Moore's -Wall fixes
+      0.58   fix bug: zlib uncompressed mode len/nlen was wrong endian
+      0.57   fix bug: jpg last huffman symbol before marker was >9 bits but less
+                      than 16 available
+      0.56   fix bug: zlib uncompressed mode len vs. nlen
+      0.55   fix bug: restart_interval not initialized to 0
+      0.54   allow NULL for 'int *comp'
+      0.53   fix bug in png 3->4; speedup png decoding
+      0.52   png handles req_comp=3,4 directly; minor cleanup; jpeg comments
+      0.51   obey req_comp requests, 1-component jpegs return as 1-component,
+             on 'test' only check type, not whether we support this variant
+*/
+
+#include "stb_image_aug.h"
+
+#ifndef STBI_NO_HDR
+#include <math.h>  // ldexp
+#include <string.h> // strcmp
+#endif
+
+#ifndef STBI_NO_STDIO
+#include <stdio.h>
+#endif
+#include <stdlib.h>
+#include <memory.h>
+#include <assert.h>
+#include <stdarg.h>
+
+#ifndef _MSC_VER
+  #ifdef __cplusplus
+  #define __forceinline inline
+  #else
+  #define __forceinline
+  #endif
+#endif
+
+
+// implementation:
+typedef unsigned char uint8;
+typedef unsigned short uint16;
+typedef   signed short  int16;
+typedef unsigned int   uint32;
+typedef   signed int    int32;
+typedef unsigned int   uint;
+
+// should produce compiler error if size is wrong
+typedef unsigned char validate_uint32[sizeof(uint32)==4];
+
+#if defined(STBI_NO_STDIO) && !defined(STBI_NO_WRITE)
+#define STBI_NO_WRITE
+#endif
+
+#ifndef STBI_NO_DDS
+#include "stbi_DDS_aug.h"
+#endif
+
+//	I (JLD) want full messages for SOIL
+#define STBI_FAILURE_USERMSG 1
+
+//////////////////////////////////////////////////////////////////////////////
+//
+// Generic API that works on all image types
+//
+
+// this is not threadsafe
+static char *failure_reason;
+
+char *stbi_failure_reason(void)
+{
+   return failure_reason;
+}
+
+static int e(char *str)
+{
+   failure_reason = str;
+   return 0;
+}
+
+#ifdef STBI_NO_FAILURE_STRINGS
+   #define e(x,y)  0
+#elif defined(STBI_FAILURE_USERMSG)
+   #define e(x,y)  e(y)
+#else
+   #define e(x,y)  e(x)
+#endif
+
+#define epf(x,y)   ((float *) (e(x,y)?NULL:NULL))
+#define epuc(x,y)  ((unsigned char *) (e(x,y)?NULL:NULL))
+
+void stbi_image_free(void *retval_from_stbi_load)
+{
+   free(retval_from_stbi_load);
+}
+
+#define MAX_LOADERS  32
+stbi_loader *loaders[MAX_LOADERS];
+static int max_loaders = 0;
+
+int stbi_register_loader(stbi_loader *loader)
+{
+   int i;
+   for (i=0; i < MAX_LOADERS; ++i) {
+      // already present?
+      if (loaders[i] == loader)
+         return 1;
+      // end of the list?
+      if (loaders[i] == NULL) {
+         loaders[i] = loader;
+         max_loaders = i+1;
+         return 1;
+      }
+   }
+   // no room for it
+   return 0;
+}
+
+#ifndef STBI_NO_HDR
+static float   *ldr_to_hdr(stbi_uc *data, int x, int y, int comp);
+static stbi_uc *hdr_to_ldr(float   *data, int x, int y, int comp);
+#endif
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   FILE *f = fopen(filename, "rb");
+   unsigned char *result;
+   if (!f) return epuc("can't fopen", "Unable to open file");
+   result = stbi_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return result;
+}
+
+unsigned char *stbi_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   int i;
+   if (stbi_jpeg_test_file(f))
+      return stbi_jpeg_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_png_test_file(f))
+      return stbi_png_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_bmp_test_file(f))
+      return stbi_bmp_load_from_file(f,x,y,comp,req_comp);
+   if (stbi_psd_test_file(f))
+      return stbi_psd_load_from_file(f,x,y,comp,req_comp);
+   #ifndef STBI_NO_DDS
+   if (stbi_dds_test_file(f))
+      return stbi_dds_load_from_file(f,x,y,comp,req_comp);
+   #endif
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_file(f)) {
+      float *hdr = stbi_hdr_load_from_file(f, x,y,comp,req_comp);
+      return hdr_to_ldr(hdr, *x, *y, req_comp ? req_comp : *comp);
+   }
+   #endif
+   for (i=0; i < max_loaders; ++i)
+      if (loaders[i]->test_file(f))
+         return loaders[i]->load_from_file(f,x,y,comp,req_comp);
+   // test tga last because it's a crappy test!
+   if (stbi_tga_test_file(f))
+      return stbi_tga_load_from_file(f,x,y,comp,req_comp);
+   return epuc("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+unsigned char *stbi_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   int i;
+   if (stbi_jpeg_test_memory(buffer,len))
+      return stbi_jpeg_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_png_test_memory(buffer,len))
+      return stbi_png_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_bmp_test_memory(buffer,len))
+      return stbi_bmp_load_from_memory(buffer,len,x,y,comp,req_comp);
+   if (stbi_psd_test_memory(buffer,len))
+      return stbi_psd_load_from_memory(buffer,len,x,y,comp,req_comp);
+   #ifndef STBI_NO_DDS
+   if (stbi_dds_test_memory(buffer,len))
+      return stbi_dds_load_from_memory(buffer,len,x,y,comp,req_comp);
+   #endif
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_memory(buffer, len)) {
+      float *hdr = stbi_hdr_load_from_memory(buffer, len,x,y,comp,req_comp);
+      return hdr_to_ldr(hdr, *x, *y, req_comp ? req_comp : *comp);
+   }
+   #endif
+   for (i=0; i < max_loaders; ++i)
+      if (loaders[i]->test_memory(buffer,len))
+         return loaders[i]->load_from_memory(buffer,len,x,y,comp,req_comp);
+   // test tga last because it's a crappy test!
+   if (stbi_tga_test_memory(buffer,len))
+      return stbi_tga_load_from_memory(buffer,len,x,y,comp,req_comp);
+   return epuc("unknown image type", "Image not of any known type, or corrupt");
+}
+
+#ifndef STBI_NO_HDR
+
+#ifndef STBI_NO_STDIO
+float *stbi_loadf(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   FILE *f = fopen(filename, "rb");
+   float *result;
+   if (!f) return epf("can't fopen", "Unable to open file");
+   result = stbi_loadf_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return result;
+}
+
+float *stbi_loadf_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_file(f))
+      return stbi_hdr_load_from_file(f,x,y,comp,req_comp);
+   #endif
+   data = stbi_load_from_file(f, x, y, comp, req_comp);
+   if (data)
+      return ldr_to_hdr(data, *x, *y, req_comp ? req_comp : *comp);
+   return epf("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+float *stbi_loadf_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   #ifndef STBI_NO_HDR
+   if (stbi_hdr_test_memory(buffer, len))
+      return stbi_hdr_load_from_memory(buffer, len,x,y,comp,req_comp);
+   #endif
+   data = stbi_load_from_memory(buffer, len, x, y, comp, req_comp);
+   if (data)
+      return ldr_to_hdr(data, *x, *y, req_comp ? req_comp : *comp);
+   return epf("unknown image type", "Image not of any known type, or corrupt");
+}
+#endif
+
+// these is-hdr-or-not is defined independent of whether STBI_NO_HDR is
+// defined, for API simplicity; if STBI_NO_HDR is defined, it always
+// reports false!
+
+int stbi_is_hdr_from_memory(stbi_uc const *buffer, int len)
+{
+   #ifndef STBI_NO_HDR
+   return stbi_hdr_test_memory(buffer, len);
+   #else
+   return 0;
+   #endif
+}
+
+#ifndef STBI_NO_STDIO
+extern int      stbi_is_hdr          (char const *filename)
+{
+   FILE *f = fopen(filename, "rb");
+   int result=0;
+   if (f) {
+      result = stbi_is_hdr_from_file(f);
+      fclose(f);
+   }
+   return result;
+}
+
+extern int      stbi_is_hdr_from_file(FILE *f)
+{
+   #ifndef STBI_NO_HDR
+   return stbi_hdr_test_file(f);
+   #else
+   return 0;
+   #endif
+}
+
+#endif
+
+// @TODO: get image dimensions & components without fully decoding
+#ifndef STBI_NO_STDIO
+extern int      stbi_info            (char const *filename,           int *x, int *y, int *comp);
+extern int      stbi_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_HDR
+static float h2l_gamma_i=1.0f/2.2f, h2l_scale_i=1.0f;
+static float l2h_gamma=2.2f, l2h_scale=1.0f;
+
+void   stbi_hdr_to_ldr_gamma(float gamma) { h2l_gamma_i = 1/gamma; }
+void   stbi_hdr_to_ldr_scale(float scale) { h2l_scale_i = 1/scale; }
+
+void   stbi_ldr_to_hdr_gamma(float gamma) { l2h_gamma = gamma; }
+void   stbi_ldr_to_hdr_scale(float scale) { l2h_scale = scale; }
+#endif
+
+
+//////////////////////////////////////////////////////////////////////////////
+//
+// Common code used by all image loaders
+//
+
+enum
+{
+   SCAN_load=0,
+   SCAN_type,
+   SCAN_header,
+};
+
+typedef struct
+{
+   uint32 img_x, img_y;
+   int img_n, img_out_n;
+
+   #ifndef STBI_NO_STDIO
+   FILE  *img_file;
+   #endif
+   uint8 *img_buffer, *img_buffer_end;
+} stbi;
+
+#ifndef STBI_NO_STDIO
+static void start_file(stbi *s, FILE *f)
+{
+   s->img_file = f;
+}
+#endif
+
+static void start_mem(stbi *s, uint8 const *buffer, int len)
+{
+#ifndef STBI_NO_STDIO
+   s->img_file = NULL;
+#endif
+   s->img_buffer = (uint8 *) buffer;
+   s->img_buffer_end = (uint8 *) buffer+len;
+}
+
+__forceinline static int get8(stbi *s)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file) {
+      int c = fgetc(s->img_file);
+      return c == EOF ? 0 : c;
+   }
+#endif
+   if (s->img_buffer < s->img_buffer_end)
+      return *s->img_buffer++;
+   return 0;
+}
+
+__forceinline static int at_eof(stbi *s)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file)
+      return feof(s->img_file);
+#endif
+   return s->img_buffer >= s->img_buffer_end;
+}
+
+__forceinline static uint8 get8u(stbi *s)
+{
+   return (uint8) get8(s);
+}
+
+static void skip(stbi *s, int n)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file)
+      fseek(s->img_file, n, SEEK_CUR);
+   else
+#endif
+      s->img_buffer += n;
+}
+
+static int get16(stbi *s)
+{
+   int z = get8(s);
+   return (z << 8) + get8(s);
+}
+
+static uint32 get32(stbi *s)
+{
+   uint32 z = get16(s);
+   return (z << 16) + get16(s);
+}
+
+static int get16le(stbi *s)
+{
+   int z = get8(s);
+   return z + (get8(s) << 8);
+}
+
+static uint32 get32le(stbi *s)
+{
+   uint32 z = get16le(s);
+   return z + (get16le(s) << 16);
+}
+
+static void getn(stbi *s, stbi_uc *buffer, int n)
+{
+#ifndef STBI_NO_STDIO
+   if (s->img_file) {
+      fread(buffer, 1, n, s->img_file);
+      return;
+   }
+#endif
+   memcpy(buffer, s->img_buffer, n);
+   s->img_buffer += n;
+}
+
+//////////////////////////////////////////////////////////////////////////////
+//
+//  generic converter from built-in img_n to req_comp
+//    individual types do this automatically as much as possible (e.g. jpeg
+//    does all cases internally since it needs to colorspace convert anyway,
+//    and it never has alpha, so very few cases ). png can automatically
+//    interleave an alpha=255 channel, but falls back to this for other cases
+//
+//  assume data buffer is malloced, so malloc a new one and free that one
+//  only failure mode is malloc failing
+
+static uint8 compute_y(int r, int g, int b)
+{
+   return (uint8) (((r*77) + (g*150) +  (29*b)) >> 8);
+}
+
+static unsigned char *convert_format(unsigned char *data, int img_n, int req_comp, uint x, uint y)
+{
+   int i,j;
+   unsigned char *good;
+
+   if (req_comp == img_n) return data;
+   assert(req_comp >= 1 && req_comp <= 4);
+
+   good = (unsigned char *) malloc(req_comp * x * y);
+   if (good == NULL) {
+      free(data);
+      return epuc("outofmem", "Out of memory");
+   }
+
+   for (j=0; j < (int) y; ++j) {
+      unsigned char *src  = data + j * x * img_n   ;
+      unsigned char *dest = good + j * x * req_comp;
+
+      #define COMBO(a,b)  ((a)*8+(b))
+      #define CASE(a,b)   case COMBO(a,b): for(i=x-1; i >= 0; --i, src += a, dest += b)
+      // convert source image with img_n components to one with req_comp components;
+      // avoid switch per pixel, so use switch per scanline and massive macros
+      switch(COMBO(img_n, req_comp)) {
+         CASE(1,2) dest[0]=src[0], dest[1]=255; break;
+         CASE(1,3) dest[0]=dest[1]=dest[2]=src[0]; break;
+         CASE(1,4) dest[0]=dest[1]=dest[2]=src[0], dest[3]=255; break;
+         CASE(2,1) dest[0]=src[0]; break;
+         CASE(2,3) dest[0]=dest[1]=dest[2]=src[0]; break;
+         CASE(2,4) dest[0]=dest[1]=dest[2]=src[0], dest[3]=src[1]; break;
+         CASE(3,4) dest[0]=src[0],dest[1]=src[1],dest[2]=src[2],dest[3]=255; break;
+         CASE(3,1) dest[0]=compute_y(src[0],src[1],src[2]); break;
+         CASE(3,2) dest[0]=compute_y(src[0],src[1],src[2]), dest[1] = 255; break;
+         CASE(4,1) dest[0]=compute_y(src[0],src[1],src[2]); break;
+         CASE(4,2) dest[0]=compute_y(src[0],src[1],src[2]), dest[1] = src[3]; break;
+         CASE(4,3) dest[0]=src[0],dest[1]=src[1],dest[2]=src[2]; break;
+         default: assert(0);
+      }
+      #undef CASE
+   }
+
+   free(data);
+   return good;
+}
+
+#ifndef STBI_NO_HDR
+static float   *ldr_to_hdr(stbi_uc *data, int x, int y, int comp)
+{
+   int i,k,n;
+   float *output = (float *) malloc(x * y * comp * sizeof(float));
+   if (output == NULL) { free(data); return epf("outofmem", "Out of memory"); }
+   // compute number of non-alpha components
+   if (comp & 1) n = comp; else n = comp-1;
+   for (i=0; i < x*y; ++i) {
+      for (k=0; k < n; ++k) {
+         output[i*comp + k] = (float) pow(data[i*comp+k]/255.0f, l2h_gamma) * l2h_scale;
+      }
+      if (k < comp) output[i*comp + k] = data[i*comp+k]/255.0f;
+   }
+   free(data);
+   return output;
+}
+
+#define float2int(x)   ((int) (x))
+static stbi_uc *hdr_to_ldr(float   *data, int x, int y, int comp)
+{
+   int i,k,n;
+   stbi_uc *output = (stbi_uc *) malloc(x * y * comp);
+   if (output == NULL) { free(data); return epuc("outofmem", "Out of memory"); }
+   // compute number of non-alpha components
+   if (comp & 1) n = comp; else n = comp-1;
+   for (i=0; i < x*y; ++i) {
+      for (k=0; k < n; ++k) {
+         float z = (float) pow(data[i*comp+k]*h2l_scale_i, h2l_gamma_i) * 255 + 0.5f;
+         if (z < 0) z = 0;
+         if (z > 255) z = 255;
+         output[i*comp + k] = float2int(z);
+      }
+      if (k < comp) {
+         float z = data[i*comp+k] * 255 + 0.5f;
+         if (z < 0) z = 0;
+         if (z > 255) z = 255;
+         output[i*comp + k] = float2int(z);
+      }
+   }
+   free(data);
+   return output;
+}
+#endif
+
+//////////////////////////////////////////////////////////////////////////////
+//
+//  "baseline" JPEG/JFIF decoder (not actually fully baseline implementation)
+//
+//    simple implementation
+//      - channel subsampling of at most 2 in each dimension
+//      - doesn't support delayed output of y-dimension
+//      - simple interface (only one output format: 8-bit interleaved RGB)
+//      - doesn't try to recover corrupt jpegs
+//      - doesn't allow partial loading, loading multiple at once
+//      - still fast on x86 (copying globals into locals doesn't help x86)
+//      - allocates lots of intermediate memory (full size of all components)
+//        - non-interleaved case requires this anyway
+//        - allows good upsampling (see next)
+//    high-quality
+//      - upsampled channels are bilinearly interpolated, even across blocks
+//      - quality integer IDCT derived from IJG's 'slow'
+//    performance
+//      - fast huffman; reasonable integer IDCT
+//      - uses a lot of intermediate memory, could cache poorly
+//      - load http://nothings.org/remote/anemones.jpg 3 times on 2.8Ghz P4
+//          stb_jpeg:   1.34 seconds (MSVC6, default release build)
+//          stb_jpeg:   1.06 seconds (MSVC6, processor = Pentium Pro)
+//          IJL11.dll:  1.08 seconds (compiled by intel)
+//          IJG 1998:   0.98 seconds (MSVC6, makefile provided by IJG)
+//          IJG 1998:   0.95 seconds (MSVC6, makefile + proc=PPro)
+
+// huffman decoding acceleration
+#define FAST_BITS   9  // larger handles more cases; smaller stomps less cache
+
+typedef struct
+{
+   uint8  fast[1 << FAST_BITS];
+   // weirdly, repacking this into AoS is a 10% speed loss, instead of a win
+   uint16 code[256];
+   uint8  values[256];
+   uint8  size[257];
+   unsigned int maxcode[18];
+   int    delta[17];   // old 'firstsymbol' - old 'firstcode'
+} huffman;
+
+typedef struct
+{
+   #if STBI_SIMD
+   unsigned short dequant2[4][64];
+   #endif
+   stbi s;
+   huffman huff_dc[4];
+   huffman huff_ac[4];
+   uint8 dequant[4][64];
+
+// sizes for components, interleaved MCUs
+   int img_h_max, img_v_max;
+   int img_mcu_x, img_mcu_y;
+   int img_mcu_w, img_mcu_h;
+
+// definition of jpeg image component
+   struct
+   {
+      int id;
+      int h,v;
+      int tq;
+      int hd,ha;
+      int dc_pred;
+
+      int x,y,w2,h2;
+      uint8 *data;
+      void *raw_data;
+      uint8 *linebuf;
+   } img_comp[4];
+
+   uint32         code_buffer; // jpeg entropy-coded buffer
+   int            code_bits;   // number of valid bits
+   unsigned char  marker;      // marker seen while filling entropy buffer
+   int            nomore;      // flag if we saw a marker so must stop
+
+   int scan_n, order[4];
+   int restart_interval, todo;
+} jpeg;
+
+static int build_huffman(huffman *h, int *count)
+{
+   int i,j,k=0,code;
+   // build size list for each symbol (from JPEG spec)
+   for (i=0; i < 16; ++i)
+      for (j=0; j < count[i]; ++j)
+         h->size[k++] = (uint8) (i+1);
+   h->size[k] = 0;
+
+   // compute actual symbols (from jpeg spec)
+   code = 0;
+   k = 0;
+   for(j=1; j <= 16; ++j) {
+      // compute delta to add to code to compute symbol id
+      h->delta[j] = k - code;
+      if (h->size[k] == j) {
+         while (h->size[k] == j)
+            h->code[k++] = (uint16) (code++);
+         if (code-1 >= (1 << j)) return e("bad code lengths","Corrupt JPEG");
+      }
+      // compute largest code + 1 for this size, preshifted as needed later
+      h->maxcode[j] = code << (16-j);
+      code <<= 1;
+   }
+   h->maxcode[j] = 0xffffffff;
+
+   // build non-spec acceleration table; 255 is flag for not-accelerated
+   memset(h->fast, 255, 1 << FAST_BITS);
+   for (i=0; i < k; ++i) {
+      int s = h->size[i];
+      if (s <= FAST_BITS) {
+         int c = h->code[i] << (FAST_BITS-s);
+         int m = 1 << (FAST_BITS-s);
+         for (j=0; j < m; ++j) {
+            h->fast[c+j] = (uint8) i;
+         }
+      }
+   }
+   return 1;
+}
+
+static void grow_buffer_unsafe(jpeg *j)
+{
+   do {
+      int b = j->nomore ? 0 : get8(&j->s);
+      if (b == 0xff) {
+         int c = get8(&j->s);
+         if (c != 0) {
+            j->marker = (unsigned char) c;
+            j->nomore = 1;
+            return;
+         }
+      }
+      j->code_buffer = (j->code_buffer << 8) | b;
+      j->code_bits += 8;
+   } while (j->code_bits <= 24);
+}
+
+// (1 << n) - 1
+static uint32 bmask[17]={0,1,3,7,15,31,63,127,255,511,1023,2047,4095,8191,16383,32767,65535};
+
+// decode a jpeg huffman value from the bitstream
+__forceinline static int decode(jpeg *j, huffman *h)
+{
+   unsigned int temp;
+   int c,k;
+
+   if (j->code_bits < 16) grow_buffer_unsafe(j);
+
+   // look at the top FAST_BITS and determine what symbol ID it is,
+   // if the code is <= FAST_BITS
+   c = (j->code_buffer >> (j->code_bits - FAST_BITS)) & ((1 << FAST_BITS)-1);
+   k = h->fast[c];
+   if (k < 255) {
+      if (h->size[k] > j->code_bits)
+         return -1;
+      j->code_bits -= h->size[k];
+      return h->values[k];
+   }
+
+   // naive test is to shift the code_buffer down so k bits are
+   // valid, then test against maxcode. To speed this up, we've
+   // preshifted maxcode left so that it has (16-k) 0s at the
+   // end; in other words, regardless of the number of bits, it
+   // wants to be compared against something shifted to have 16;
+   // that way we don't need to shift inside the loop.
+   if (j->code_bits < 16)
+      temp = (j->code_buffer << (16 - j->code_bits)) & 0xffff;
+   else
+      temp = (j->code_buffer >> (j->code_bits - 16)) & 0xffff;
+   for (k=FAST_BITS+1 ; ; ++k)
+      if (temp < h->maxcode[k])
+         break;
+   if (k == 17) {
+      // error! code not found
+      j->code_bits -= 16;
+      return -1;
+   }
+
+   if (k > j->code_bits)
+      return -1;
+
+   // convert the huffman code to the symbol id
+   c = ((j->code_buffer >> (j->code_bits - k)) & bmask[k]) + h->delta[k];
+   assert((((j->code_buffer) >> (j->code_bits - h->size[c])) & bmask[h->size[c]]) == h->code[c]);
+
+   // convert the id to a symbol
+   j->code_bits -= k;
+   return h->values[c];
+}
+
+// combined JPEG 'receive' and JPEG 'extend', since baseline
+// always extends everything it receives.
+__forceinline static int extend_receive(jpeg *j, int n)
+{
+   unsigned int m = 1 << (n-1);
+   unsigned int k;
+   if (j->code_bits < n) grow_buffer_unsafe(j);
+   k = (j->code_buffer >> (j->code_bits - n)) & bmask[n];
+   j->code_bits -= n;
+   // the following test is probably a random branch that won't
+   // predict well. I tried to table accelerate it but failed.
+   // maybe it's compiling as a conditional move?
+   if (k < m)
+      return (-1 << n) + k + 1;
+   else
+      return k;
+}
+
+// given a value that's at position X in the zigzag stream,
+// where does it appear in the 8x8 matrix coded as row-major?
+static uint8 dezigzag[64+15] =
+{
+    0,  1,  8, 16,  9,  2,  3, 10,
+   17, 24, 32, 25, 18, 11,  4,  5,
+   12, 19, 26, 33, 40, 48, 41, 34,
+   27, 20, 13,  6,  7, 14, 21, 28,
+   35, 42, 49, 56, 57, 50, 43, 36,
+   29, 22, 15, 23, 30, 37, 44, 51,
+   58, 59, 52, 45, 38, 31, 39, 46,
+   53, 60, 61, 54, 47, 55, 62, 63,
+   // let corrupt input sample past end
+   63, 63, 63, 63, 63, 63, 63, 63,
+   63, 63, 63, 63, 63, 63, 63
+};
+
+// decode one 64-entry block--
+static int decode_block(jpeg *j, short data[64], huffman *hdc, huffman *hac, int b)
+{
+   int diff,dc,k;
+   int t = decode(j, hdc);
+   if (t < 0) return e("bad huffman code","Corrupt JPEG");
+
+   // 0 all the ac values now so we can do it 32-bits at a time
+   memset(data,0,64*sizeof(data[0]));
+
+   diff = t ? extend_receive(j, t) : 0;
+   dc = j->img_comp[b].dc_pred + diff;
+   j->img_comp[b].dc_pred = dc;
+   data[0] = (short) dc;
+
+   // decode AC components, see JPEG spec
+   k = 1;
+   do {
+      int r,s;
+      int rs = decode(j, hac);
+      if (rs < 0) return e("bad huffman code","Corrupt JPEG");
+      s = rs & 15;
+      r = rs >> 4;
+      if (s == 0) {
+         if (rs != 0xf0) break; // end block
+         k += 16;
+      } else {
+         k += r;
+         // decode into unzigzag'd location
+         data[dezigzag[k++]] = (short) extend_receive(j,s);
+      }
+   } while (k < 64);
+   return 1;
+}
+
+// take a -128..127 value and clamp it and convert to 0..255
+__forceinline static uint8 clamp(int x)
+{
+   x += 128;
+   // trick to use a single test to catch both cases
+   if ((unsigned int) x > 255) {
+      if (x < 0) return 0;
+      if (x > 255) return 255;
+   }
+   return (uint8) x;
+}
+
+#define f2f(x)  (int) (((x) * 4096 + 0.5))
+#define fsh(x)  ((x) << 12)
+
+// derived from jidctint -- DCT_ISLOW
+#define IDCT_1D(s0,s1,s2,s3,s4,s5,s6,s7)       \
+   int t0,t1,t2,t3,p1,p2,p3,p4,p5,x0,x1,x2,x3; \
+   p2 = s2;                                    \
+   p3 = s6;                                    \
+   p1 = (p2+p3) * f2f(0.5411961f);             \
+   t2 = p1 + p3*f2f(-1.847759065f);            \
+   t3 = p1 + p2*f2f( 0.765366865f);            \
+   p2 = s0;                                    \
+   p3 = s4;                                    \
+   t0 = fsh(p2+p3);                            \
+   t1 = fsh(p2-p3);                            \
+   x0 = t0+t3;                                 \
+   x3 = t0-t3;                                 \
+   x1 = t1+t2;                                 \
+   x2 = t1-t2;                                 \
+   t0 = s7;                                    \
+   t1 = s5;                                    \
+   t2 = s3;                                    \
+   t3 = s1;                                    \
+   p3 = t0+t2;                                 \
+   p4 = t1+t3;                                 \
+   p1 = t0+t3;                                 \
+   p2 = t1+t2;                                 \
+   p5 = (p3+p4)*f2f( 1.175875602f);            \
+   t0 = t0*f2f( 0.298631336f);                 \
+   t1 = t1*f2f( 2.053119869f);                 \
+   t2 = t2*f2f( 3.072711026f);                 \
+   t3 = t3*f2f( 1.501321110f);                 \
+   p1 = p5 + p1*f2f(-0.899976223f);            \
+   p2 = p5 + p2*f2f(-2.562915447f);            \
+   p3 = p3*f2f(-1.961570560f);                 \
+   p4 = p4*f2f(-0.390180644f);                 \
+   t3 += p1+p4;                                \
+   t2 += p2+p3;                                \
+   t1 += p2+p4;                                \
+   t0 += p1+p3;
+
+#if !STBI_SIMD
+// .344 seconds on 3*anemones.jpg
+static void idct_block(uint8 *out, int out_stride, short data[64], uint8 *dequantize)
+{
+   int i,val[64],*v=val;
+   uint8 *o,*dq = dequantize;
+   short *d = data;
+
+   // columns
+   for (i=0; i < 8; ++i,++d,++dq, ++v) {
+      // if all zeroes, shortcut -- this avoids dequantizing 0s and IDCTing
+      if (d[ 8]==0 && d[16]==0 && d[24]==0 && d[32]==0
+           && d[40]==0 && d[48]==0 && d[56]==0) {
+         //    no shortcut                 0     seconds
+         //    (1|2|3|4|5|6|7)==0          0     seconds
+         //    all separate               -0.047 seconds
+         //    1 && 2|3 && 4|5 && 6|7:    -0.047 seconds
+         int dcterm = d[0] * dq[0] << 2;
+         v[0] = v[8] = v[16] = v[24] = v[32] = v[40] = v[48] = v[56] = dcterm;
+      } else {
+         IDCT_1D(d[ 0]*dq[ 0],d[ 8]*dq[ 8],d[16]*dq[16],d[24]*dq[24],
+                 d[32]*dq[32],d[40]*dq[40],d[48]*dq[48],d[56]*dq[56])
+         // constants scaled things up by 1<<12; let's bring them back
+         // down, but keep 2 extra bits of precision
+         x0 += 512; x1 += 512; x2 += 512; x3 += 512;
+         v[ 0] = (x0+t3) >> 10;
+         v[56] = (x0-t3) >> 10;
+         v[ 8] = (x1+t2) >> 10;
+         v[48] = (x1-t2) >> 10;
+         v[16] = (x2+t1) >> 10;
+         v[40] = (x2-t1) >> 10;
+         v[24] = (x3+t0) >> 10;
+         v[32] = (x3-t0) >> 10;
+      }
+   }
+
+   for (i=0, v=val, o=out; i < 8; ++i,v+=8,o+=out_stride) {
+      // no fast case since the first 1D IDCT spread components out
+      IDCT_1D(v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7])
+      // constants scaled things up by 1<<12, plus we had 1<<2 from first
+      // loop, plus horizontal and vertical each scale by sqrt(8) so together
+      // we've got an extra 1<<3, so 1<<17 total we need to remove.
+      x0 += 65536; x1 += 65536; x2 += 65536; x3 += 65536;
+      o[0] = clamp((x0+t3) >> 17);
+      o[7] = clamp((x0-t3) >> 17);
+      o[1] = clamp((x1+t2) >> 17);
+      o[6] = clamp((x1-t2) >> 17);
+      o[2] = clamp((x2+t1) >> 17);
+      o[5] = clamp((x2-t1) >> 17);
+      o[3] = clamp((x3+t0) >> 17);
+      o[4] = clamp((x3-t0) >> 17);
+   }
+}
+#else
+static void idct_block(uint8 *out, int out_stride, short data[64], unsigned short *dequantize)
+{
+   int i,val[64],*v=val;
+   uint8 *o;
+   unsigned short *dq = dequantize;
+   short *d = data;
+
+   // columns
+   for (i=0; i < 8; ++i,++d,++dq, ++v) {
+      // if all zeroes, shortcut -- this avoids dequantizing 0s and IDCTing
+      if (d[ 8]==0 && d[16]==0 && d[24]==0 && d[32]==0
+           && d[40]==0 && d[48]==0 && d[56]==0) {
+         //    no shortcut                 0     seconds
+         //    (1|2|3|4|5|6|7)==0          0     seconds
+         //    all separate               -0.047 seconds
+         //    1 && 2|3 && 4|5 && 6|7:    -0.047 seconds
+         int dcterm = d[0] * dq[0] << 2;
+         v[0] = v[8] = v[16] = v[24] = v[32] = v[40] = v[48] = v[56] = dcterm;
+      } else {
+         IDCT_1D(d[ 0]*dq[ 0],d[ 8]*dq[ 8],d[16]*dq[16],d[24]*dq[24],
+                 d[32]*dq[32],d[40]*dq[40],d[48]*dq[48],d[56]*dq[56])
+         // constants scaled things up by 1<<12; let's bring them back
+         // down, but keep 2 extra bits of precision
+         x0 += 512; x1 += 512; x2 += 512; x3 += 512;
+         v[ 0] = (x0+t3) >> 10;
+         v[56] = (x0-t3) >> 10;
+         v[ 8] = (x1+t2) >> 10;
+         v[48] = (x1-t2) >> 10;
+         v[16] = (x2+t1) >> 10;
+         v[40] = (x2-t1) >> 10;
+         v[24] = (x3+t0) >> 10;
+         v[32] = (x3-t0) >> 10;
+      }
+   }
+
+   for (i=0, v=val, o=out; i < 8; ++i,v+=8,o+=out_stride) {
+      // no fast case since the first 1D IDCT spread components out
+      IDCT_1D(v[0],v[1],v[2],v[3],v[4],v[5],v[6],v[7])
+      // constants scaled things up by 1<<12, plus we had 1<<2 from first
+      // loop, plus horizontal and vertical each scale by sqrt(8) so together
+      // we've got an extra 1<<3, so 1<<17 total we need to remove.
+      x0 += 65536; x1 += 65536; x2 += 65536; x3 += 65536;
+      o[0] = clamp((x0+t3) >> 17);
+      o[7] = clamp((x0-t3) >> 17);
+      o[1] = clamp((x1+t2) >> 17);
+      o[6] = clamp((x1-t2) >> 17);
+      o[2] = clamp((x2+t1) >> 17);
+      o[5] = clamp((x2-t1) >> 17);
+      o[3] = clamp((x3+t0) >> 17);
+      o[4] = clamp((x3-t0) >> 17);
+   }
+}
+static stbi_idct_8x8 stbi_idct_installed = idct_block;
+
+extern void stbi_install_idct(stbi_idct_8x8 func)
+{
+   stbi_idct_installed = func;
+}
+#endif
+
+#define MARKER_none  0xff
+// if there's a pending marker from the entropy stream, return that
+// otherwise, fetch from the stream and get a marker. if there's no
+// marker, return 0xff, which is never a valid marker value
+static uint8 get_marker(jpeg *j)
+{
+   uint8 x;
+   if (j->marker != MARKER_none) { x = j->marker; j->marker = MARKER_none; return x; }
+   x = get8u(&j->s);
+   if (x != 0xff) return MARKER_none;
+   while (x == 0xff)
+      x = get8u(&j->s);
+   return x;
+}
+
+// in each scan, we'll have scan_n components, and the order
+// of the components is specified by order[]
+#define RESTART(x)     ((x) >= 0xd0 && (x) <= 0xd7)
+
+// after a restart interval, reset the entropy decoder and
+// the dc prediction
+static void reset(jpeg *j)
+{
+   j->code_bits = 0;
+   j->code_buffer = 0;
+   j->nomore = 0;
+   j->img_comp[0].dc_pred = j->img_comp[1].dc_pred = j->img_comp[2].dc_pred = 0;
+   j->marker = MARKER_none;
+   j->todo = j->restart_interval ? j->restart_interval : 0x7fffffff;
+   // no more than 1<<31 MCUs if no restart_interal? that's plenty safe,
+   // since we don't even allow 1<<30 pixels
+}
+
+static int parse_entropy_coded_data(jpeg *z)
+{
+   reset(z);
+   if (z->scan_n == 1) {
+      int i,j;
+      #if STBI_SIMD
+      __declspec(align(16))
+      #endif
+      short data[64];
+      int n = z->order[0];
+      // non-interleaved data, we just need to process one block at a time,
+      // in trivial scanline order
+      // number of blocks to do just depends on how many actual "pixels" this
+      // component has, independent of interleaved MCU blocking and such
+      int w = (z->img_comp[n].x+7) >> 3;
+      int h = (z->img_comp[n].y+7) >> 3;
+      for (j=0; j < h; ++j) {
+         for (i=0; i < w; ++i) {
+            if (!decode_block(z, data, z->huff_dc+z->img_comp[n].hd, z->huff_ac+z->img_comp[n].ha, n)) return 0;
+            #if STBI_SIMD
+            stbi_idct_installed(z->img_comp[n].data+z->img_comp[n].w2*j*8+i*8, z->img_comp[n].w2, data, z->dequant2[z->img_comp[n].tq]);
+            #else
+            idct_block(z->img_comp[n].data+z->img_comp[n].w2*j*8+i*8, z->img_comp[n].w2, data, z->dequant[z->img_comp[n].tq]);
+            #endif
+            // every data block is an MCU, so countdown the restart interval
+            if (--z->todo <= 0) {
+               if (z->code_bits < 24) grow_buffer_unsafe(z);
+               // if it's NOT a restart, then just bail, so we get corrupt data
+               // rather than no data
+               if (!RESTART(z->marker)) return 1;
+               reset(z);
+            }
+         }
+      }
+   } else { // interleaved!
+      int i,j,k,x,y;
+      short data[64];
+      for (j=0; j < z->img_mcu_y; ++j) {
+         for (i=0; i < z->img_mcu_x; ++i) {
+            // scan an interleaved mcu... process scan_n components in order
+            for (k=0; k < z->scan_n; ++k) {
+               int n = z->order[k];
+               // scan out an mcu's worth of this component; that's just determined
+               // by the basic H and V specified for the component
+               for (y=0; y < z->img_comp[n].v; ++y) {
+                  for (x=0; x < z->img_comp[n].h; ++x) {
+                     int x2 = (i*z->img_comp[n].h + x)*8;
+                     int y2 = (j*z->img_comp[n].v + y)*8;
+                     if (!decode_block(z, data, z->huff_dc+z->img_comp[n].hd, z->huff_ac+z->img_comp[n].ha, n)) return 0;
+                     #if STBI_SIMD
+                     stbi_idct_installed(z->img_comp[n].data+z->img_comp[n].w2*y2+x2, z->img_comp[n].w2, data, z->dequant2[z->img_comp[n].tq]);
+                     #else
+                     idct_block(z->img_comp[n].data+z->img_comp[n].w2*y2+x2, z->img_comp[n].w2, data, z->dequant[z->img_comp[n].tq]);
+                     #endif
+                  }
+               }
+            }
+            // after all interleaved components, that's an interleaved MCU,
+            // so now count down the restart interval
+            if (--z->todo <= 0) {
+               if (z->code_bits < 24) grow_buffer_unsafe(z);
+               // if it's NOT a restart, then just bail, so we get corrupt data
+               // rather than no data
+               if (!RESTART(z->marker)) return 1;
+               reset(z);
+            }
+         }
+      }
+   }
+   return 1;
+}
+
+static int process_marker(jpeg *z, int m)
+{
+   int L;
+   switch (m) {
+      case MARKER_none: // no marker found
+         return e("expected marker","Corrupt JPEG");
+
+      case 0xC2: // SOF - progressive
+         return e("progressive jpeg","JPEG format not supported (progressive)");
+
+      case 0xDD: // DRI - specify restart interval
+         if (get16(&z->s) != 4) return e("bad DRI len","Corrupt JPEG");
+         z->restart_interval = get16(&z->s);
+         return 1;
+
+      case 0xDB: // DQT - define quantization table
+         L = get16(&z->s)-2;
+         while (L > 0) {
+            int q = get8(&z->s);
+            int p = q >> 4;
+            int t = q & 15,i;
+            if (p != 0) return e("bad DQT type","Corrupt JPEG");
+            if (t > 3) return e("bad DQT table","Corrupt JPEG");
+            for (i=0; i < 64; ++i)
+               z->dequant[t][dezigzag[i]] = get8u(&z->s);
+            #if STBI_SIMD
+            for (i=0; i < 64; ++i)
+               z->dequant2[t][i] = dequant[t][i];
+            #endif
+            L -= 65;
+         }
+         return L==0;
+
+      case 0xC4: // DHT - define huffman table
+         L = get16(&z->s)-2;
+         while (L > 0) {
+            uint8 *v;
+            int sizes[16],i,m=0;
+            int q = get8(&z->s);
+            int tc = q >> 4;
+            int th = q & 15;
+            if (tc > 1 || th > 3) return e("bad DHT header","Corrupt JPEG");
+            for (i=0; i < 16; ++i) {
+               sizes[i] = get8(&z->s);
+               m += sizes[i];
+            }
+            L -= 17;
+            if (tc == 0) {
+               if (!build_huffman(z->huff_dc+th, sizes)) return 0;
+               v = z->huff_dc[th].values;
+            } else {
+               if (!build_huffman(z->huff_ac+th, sizes)) return 0;
+               v = z->huff_ac[th].values;
+            }
+            for (i=0; i < m; ++i)
+               v[i] = get8u(&z->s);
+            L -= m;
+         }
+         return L==0;
+   }
+   // check for comment block or APP blocks
+   if ((m >= 0xE0 && m <= 0xEF) || m == 0xFE) {
+      skip(&z->s, get16(&z->s)-2);
+      return 1;
+   }
+   return 0;
+}
+
+// after we see SOS
+static int process_scan_header(jpeg *z)
+{
+   int i;
+   int Ls = get16(&z->s);
+   z->scan_n = get8(&z->s);
+   if (z->scan_n < 1 || z->scan_n > 4 || z->scan_n > (int) z->s.img_n) return e("bad SOS component count","Corrupt JPEG");
+   if (Ls != 6+2*z->scan_n) return e("bad SOS len","Corrupt JPEG");
+   for (i=0; i < z->scan_n; ++i) {
+      int id = get8(&z->s), which;
+      int q = get8(&z->s);
+      for (which = 0; which < z->s.img_n; ++which)
+         if (z->img_comp[which].id == id)
+            break;
+      if (which == z->s.img_n) return 0;
+      z->img_comp[which].hd = q >> 4;   if (z->img_comp[which].hd > 3) return e("bad DC huff","Corrupt JPEG");
+      z->img_comp[which].ha = q & 15;   if (z->img_comp[which].ha > 3) return e("bad AC huff","Corrupt JPEG");
+      z->order[i] = which;
+   }
+   if (get8(&z->s) != 0) return e("bad SOS","Corrupt JPEG");
+   get8(&z->s); // should be 63, but might be 0
+   if (get8(&z->s) != 0) return e("bad SOS","Corrupt JPEG");
+
+   return 1;
+}
+
+static int process_frame_header(jpeg *z, int scan)
+{
+   stbi *s = &z->s;
+   int Lf,p,i,q, h_max=1,v_max=1,c;
+   Lf = get16(s);         if (Lf < 11) return e("bad SOF len","Corrupt JPEG"); // JPEG
+   p  = get8(s);          if (p != 8) return e("only 8-bit","JPEG format not supported: 8-bit only"); // JPEG baseline
+   s->img_y = get16(s);   if (s->img_y == 0) return e("no header height", "JPEG format not supported: delayed height"); // Legal, but we don't handle it--but neither does IJG
+   s->img_x = get16(s);   if (s->img_x == 0) return e("0 width","Corrupt JPEG"); // JPEG requires
+   c = get8(s);
+   if (c != 3 && c != 1) return e("bad component count","Corrupt JPEG");    // JFIF requires
+   s->img_n = c;
+   for (i=0; i < c; ++i) {
+      z->img_comp[i].data = NULL;
+      z->img_comp[i].linebuf = NULL;
+   }
+
+   if (Lf != 8+3*s->img_n) return e("bad SOF len","Corrupt JPEG");
+
+   for (i=0; i < s->img_n; ++i) {
+      z->img_comp[i].id = get8(s);
+      if (z->img_comp[i].id != i+1)   // JFIF requires
+         if (z->img_comp[i].id != i)  // some version of jpegtran outputs non-JFIF-compliant files!
+            return e("bad component ID","Corrupt JPEG");
+      q = get8(s);
+      z->img_comp[i].h = (q >> 4);  if (!z->img_comp[i].h || z->img_comp[i].h > 4) return e("bad H","Corrupt JPEG");
+      z->img_comp[i].v = q & 15;    if (!z->img_comp[i].v || z->img_comp[i].v > 4) return e("bad V","Corrupt JPEG");
+      z->img_comp[i].tq = get8(s);  if (z->img_comp[i].tq > 3) return e("bad TQ","Corrupt JPEG");
+   }
+
+   if (scan != SCAN_load) return 1;
+
+   if ((1 << 30) / s->img_x / s->img_n < s->img_y) return e("too large", "Image too large to decode");
+
+   for (i=0; i < s->img_n; ++i) {
+      if (z->img_comp[i].h > h_max) h_max = z->img_comp[i].h;
+      if (z->img_comp[i].v > v_max) v_max = z->img_comp[i].v;
+   }
+
+   // compute interleaved mcu info
+   z->img_h_max = h_max;
+   z->img_v_max = v_max;
+   z->img_mcu_w = h_max * 8;
+   z->img_mcu_h = v_max * 8;
+   z->img_mcu_x = (s->img_x + z->img_mcu_w-1) / z->img_mcu_w;
+   z->img_mcu_y = (s->img_y + z->img_mcu_h-1) / z->img_mcu_h;
+
+   for (i=0; i < s->img_n; ++i) {
+      // number of effective pixels (e.g. for non-interleaved MCU)
+      z->img_comp[i].x = (s->img_x * z->img_comp[i].h + h_max-1) / h_max;
+      z->img_comp[i].y = (s->img_y * z->img_comp[i].v + v_max-1) / v_max;
+      // to simplify generation, we'll allocate enough memory to decode
+      // the bogus oversized data from using interleaved MCUs and their
+      // big blocks (e.g. a 16x16 iMCU on an image of width 33); we won't
+      // discard the extra data until colorspace conversion
+      z->img_comp[i].w2 = z->img_mcu_x * z->img_comp[i].h * 8;
+      z->img_comp[i].h2 = z->img_mcu_y * z->img_comp[i].v * 8;
+      z->img_comp[i].raw_data = malloc(z->img_comp[i].w2 * z->img_comp[i].h2+15);
+      if (z->img_comp[i].raw_data == NULL) {
+         for(--i; i >= 0; --i) {
+            free(z->img_comp[i].raw_data);
+            z->img_comp[i].data = NULL;
+         }
+         return e("outofmem", "Out of memory");
+      }
+      // align blocks for installable-idct using mmx/sse
+      z->img_comp[i].data = (uint8*) (((size_t) z->img_comp[i].raw_data + 15) & ~15);
+      z->img_comp[i].linebuf = NULL;
+   }
+
+   return 1;
+}
+
+// use comparisons since in some cases we handle more than one case (e.g. SOF)
+#define DNL(x)         ((x) == 0xdc)
+#define SOI(x)         ((x) == 0xd8)
+#define EOI(x)         ((x) == 0xd9)
+#define SOF(x)         ((x) == 0xc0 || (x) == 0xc1)
+#define SOS(x)         ((x) == 0xda)
+
+static int decode_jpeg_header(jpeg *z, int scan)
+{
+   int m;
+   z->marker = MARKER_none; // initialize cached marker to empty
+   m = get_marker(z);
+   if (!SOI(m)) return e("no SOI","Corrupt JPEG");
+   if (scan == SCAN_type) return 1;
+   m = get_marker(z);
+   while (!SOF(m)) {
+      if (!process_marker(z,m)) return 0;
+      m = get_marker(z);
+      while (m == MARKER_none) {
+         // some files have extra padding after their blocks, so ok, we'll scan
+         if (at_eof(&z->s)) return e("no SOF", "Corrupt JPEG");
+         m = get_marker(z);
+      }
+   }
+   if (!process_frame_header(z, scan)) return 0;
+   return 1;
+}
+
+static int decode_jpeg_image(jpeg *j)
+{
+   int m;
+   j->restart_interval = 0;
+   if (!decode_jpeg_header(j, SCAN_load)) return 0;
+   m = get_marker(j);
+   while (!EOI(m)) {
+      if (SOS(m)) {
+         if (!process_scan_header(j)) return 0;
+         if (!parse_entropy_coded_data(j)) return 0;
+      } else {
+         if (!process_marker(j, m)) return 0;
+      }
+      m = get_marker(j);
+   }
+   return 1;
+}
+
+// static jfif-centered resampling (across block boundaries)
+
+typedef uint8 *(*resample_row_func)(uint8 *out, uint8 *in0, uint8 *in1,
+                                    int w, int hs);
+
+#define div4(x) ((uint8) ((x) >> 2))
+
+static uint8 *resample_row_1(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   return in_near;
+}
+
+static uint8* resample_row_v_2(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // need to generate two samples vertically for every one in input
+   int i;
+   for (i=0; i < w; ++i)
+      out[i] = div4(3*in_near[i] + in_far[i] + 2);
+   return out;
+}
+
+static uint8*  resample_row_h_2(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // need to generate two samples horizontally for every one in input
+   int i;
+   uint8 *input = in_near;
+   if (w == 1) {
+      // if only one sample, can't do any interpolation
+      out[0] = out[1] = input[0];
+      return out;
+   }
+
+   out[0] = input[0];
+   out[1] = div4(input[0]*3 + input[1] + 2);
+   for (i=1; i < w-1; ++i) {
+      int n = 3*input[i]+2;
+      out[i*2+0] = div4(n+input[i-1]);
+      out[i*2+1] = div4(n+input[i+1]);
+   }
+   out[i*2+0] = div4(input[w-2]*3 + input[w-1] + 2);
+   out[i*2+1] = input[w-1];
+   return out;
+}
+
+#define div16(x) ((uint8) ((x) >> 4))
+
+static uint8 *resample_row_hv_2(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // need to generate 2x2 samples for every one in input
+   int i,t0,t1;
+   if (w == 1) {
+      out[0] = out[1] = div4(3*in_near[0] + in_far[0] + 2);
+      return out;
+   }
+
+   t1 = 3*in_near[0] + in_far[0];
+   out[0] = div4(t1+2);
+   for (i=1; i < w; ++i) {
+      t0 = t1;
+      t1 = 3*in_near[i]+in_far[i];
+      out[i*2-1] = div16(3*t0 + t1 + 8);
+      out[i*2  ] = div16(3*t1 + t0 + 8);
+   }
+   out[w*2-1] = div4(t1+2);
+   return out;
+}
+
+static uint8 *resample_row_generic(uint8 *out, uint8 *in_near, uint8 *in_far, int w, int hs)
+{
+   // resample with nearest-neighbor
+   int i,j;
+   for (i=0; i < w; ++i)
+      for (j=0; j < hs; ++j)
+         out[i*hs+j] = in_near[i];
+   return out;
+}
+
+#define float2fixed(x)  ((int) ((x) * 65536 + 0.5))
+
+// 0.38 seconds on 3*anemones.jpg   (0.25 with processor = Pro)
+// VC6 without processor=Pro is generating multiple LEAs per multiply!
+static void YCbCr_to_RGB_row(uint8 *out, uint8 *y, uint8 *pcb, uint8 *pcr, int count, int step)
+{
+   int i;
+   for (i=0; i < count; ++i) {
+      int y_fixed = (y[i] << 16) + 32768; // rounding
+      int r,g,b;
+      int cr = pcr[i] - 128;
+      int cb = pcb[i] - 128;
+      r = y_fixed + cr*float2fixed(1.40200f);
+      g = y_fixed - cr*float2fixed(0.71414f) - cb*float2fixed(0.34414f);
+      b = y_fixed                            + cb*float2fixed(1.77200f);
+      r >>= 16;
+      g >>= 16;
+      b >>= 16;
+      if ((unsigned) r > 255) { if (r < 0) r = 0; else r = 255; }
+      if ((unsigned) g > 255) { if (g < 0) g = 0; else g = 255; }
+      if ((unsigned) b > 255) { if (b < 0) b = 0; else b = 255; }
+      out[0] = (uint8)r;
+      out[1] = (uint8)g;
+      out[2] = (uint8)b;
+      out[3] = 255;
+      out += step;
+   }
+}
+
+#if STBI_SIMD
+static stbi_YCbCr_to_RGB_run stbi_YCbCr_installed = YCbCr_to_RGB_row;
+
+void stbi_install_YCbCr_to_RGB(stbi_YCbCr_to_RGB_run func)
+{
+   stbi_YCbCr_installed = func;
+}
+#endif
+
+
+// clean up the temporary component buffers
+static void cleanup_jpeg(jpeg *j)
+{
+   int i;
+   for (i=0; i < j->s.img_n; ++i) {
+      if (j->img_comp[i].data) {
+         free(j->img_comp[i].raw_data);
+         j->img_comp[i].data = NULL;
+      }
+      if (j->img_comp[i].linebuf) {
+         free(j->img_comp[i].linebuf);
+         j->img_comp[i].linebuf = NULL;
+      }
+   }
+}
+
+typedef struct
+{
+   resample_row_func resample;
+   uint8 *line0,*line1;
+   int hs,vs;   // expansion factor in each axis
+   int w_lores; // horizontal pixels pre-expansion
+   int ystep;   // how far through vertical expansion we are
+   int ypos;    // which pre-expansion row we're on
+} stbi_resample;
+
+static uint8 *load_jpeg_image(jpeg *z, int *out_x, int *out_y, int *comp, int req_comp)
+{
+   int n, decode_n;
+   // validate req_comp
+   if (req_comp < 0 || req_comp > 4) return epuc("bad req_comp", "Internal error");
+   z->s.img_n = 0;
+
+   // load a jpeg image from whichever source
+   if (!decode_jpeg_image(z)) { cleanup_jpeg(z); return NULL; }
+
+   // determine actual number of components to generate
+   n = req_comp ? req_comp : z->s.img_n;
+
+   if (z->s.img_n == 3 && n < 3)
+      decode_n = 1;
+   else
+      decode_n = z->s.img_n;
+
+   // resample and color-convert
+   {
+      int k;
+      uint i,j;
+      uint8 *output;
+      uint8 *coutput[4];
+
+      stbi_resample res_comp[4];
+
+      for (k=0; k < decode_n; ++k) {
+         stbi_resample *r = &res_comp[k];
+
+         // allocate line buffer big enough for upsampling off the edges
+         // with upsample factor of 4
+         z->img_comp[k].linebuf = (uint8 *) malloc(z->s.img_x + 3);
+         if (!z->img_comp[k].linebuf) { cleanup_jpeg(z); return epuc("outofmem", "Out of memory"); }
+
+         r->hs      = z->img_h_max / z->img_comp[k].h;
+         r->vs      = z->img_v_max / z->img_comp[k].v;
+         r->ystep   = r->vs >> 1;
+         r->w_lores = (z->s.img_x + r->hs-1) / r->hs;
+         r->ypos    = 0;
+         r->line0   = r->line1 = z->img_comp[k].data;
+
+         if      (r->hs == 1 && r->vs == 1) r->resample = resample_row_1;
+         else if (r->hs == 1 && r->vs == 2) r->resample = resample_row_v_2;
+         else if (r->hs == 2 && r->vs == 1) r->resample = resample_row_h_2;
+         else if (r->hs == 2 && r->vs == 2) r->resample = resample_row_hv_2;
+         else                               r->resample = resample_row_generic;
+      }
+
+      // can't error after this so, this is safe
+      output = (uint8 *) malloc(n * z->s.img_x * z->s.img_y + 1);
+      if (!output) { cleanup_jpeg(z); return epuc("outofmem", "Out of memory"); }
+
+      // now go ahead and resample
+      for (j=0; j < z->s.img_y; ++j) {
+         uint8 *out = output + n * z->s.img_x * j;
+         for (k=0; k < decode_n; ++k) {
+            stbi_resample *r = &res_comp[k];
+            int y_bot = r->ystep >= (r->vs >> 1);
+            coutput[k] = r->resample(z->img_comp[k].linebuf,
+                                     y_bot ? r->line1 : r->line0,
+                                     y_bot ? r->line0 : r->line1,
+                                     r->w_lores, r->hs);
+            if (++r->ystep >= r->vs) {
+               r->ystep = 0;
+               r->line0 = r->line1;
+               if (++r->ypos < z->img_comp[k].y)
+                  r->line1 += z->img_comp[k].w2;
+            }
+         }
+         if (n >= 3) {
+            uint8 *y = coutput[0];
+            if (z->s.img_n == 3) {
+               #if STBI_SIMD
+               stbi_YCbCr_installed(out, y, coutput[1], coutput[2], z->s.img_x, n);
+               #else
+               YCbCr_to_RGB_row(out, y, coutput[1], coutput[2], z->s.img_x, n);
+               #endif
+            } else
+               for (i=0; i < z->s.img_x; ++i) {
+                  out[0] = out[1] = out[2] = y[i];
+                  out[3] = 255; // not used if n==3
+                  out += n;
+               }
+         } else {
+            uint8 *y = coutput[0];
+            if (n == 1)
+               for (i=0; i < z->s.img_x; ++i) out[i] = y[i];
+            else
+               for (i=0; i < z->s.img_x; ++i) *out++ = y[i], *out++ = 255;
+         }
+      }
+      cleanup_jpeg(z);
+      *out_x = z->s.img_x;
+      *out_y = z->s.img_y;
+      if (comp) *comp  = z->s.img_n; // report original components, not output
+      return output;
+   }
+}
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_jpeg_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   jpeg j;
+   start_file(&j.s, f);
+   return load_jpeg_image(&j, x,y,comp,req_comp);
+}
+
+unsigned char *stbi_jpeg_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_jpeg_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+#endif
+
+unsigned char *stbi_jpeg_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   jpeg j;
+   start_mem(&j.s, buffer,len);
+   return load_jpeg_image(&j, x,y,comp,req_comp);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_jpeg_test_file(FILE *f)
+{
+   int n,r;
+   jpeg j;
+   n = ftell(f);
+   start_file(&j.s, f);
+   r = decode_jpeg_header(&j, SCAN_type);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_jpeg_test_memory(stbi_uc const *buffer, int len)
+{
+   jpeg j;
+   start_mem(&j.s, buffer,len);
+   return decode_jpeg_header(&j, SCAN_type);
+}
+
+// @TODO:
+#ifndef STBI_NO_STDIO
+extern int      stbi_jpeg_info            (char const *filename,           int *x, int *y, int *comp);
+extern int      stbi_jpeg_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_jpeg_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+// public domain zlib decode    v0.2  Sean Barrett 2006-11-18
+//    simple implementation
+//      - all input must be provided in an upfront buffer
+//      - all output is written to a single output buffer (can malloc/realloc)
+//    performance
+//      - fast huffman
+
+// fast-way is faster to check than jpeg huffman, but slow way is slower
+#define ZFAST_BITS  9 // accelerate all cases in default tables
+#define ZFAST_MASK  ((1 << ZFAST_BITS) - 1)
+
+// zlib-style huffman encoding
+// (jpegs packs from left, zlib from right, so can't share code)
+typedef struct
+{
+   uint16 fast[1 << ZFAST_BITS];
+   uint16 firstcode[16];
+   int maxcode[17];
+   uint16 firstsymbol[16];
+   uint8  size[288];
+   uint16 value[288];
+} zhuffman;
+
+__forceinline static int bitreverse16(int n)
+{
+  n = ((n & 0xAAAA) >>  1) | ((n & 0x5555) << 1);
+  n = ((n & 0xCCCC) >>  2) | ((n & 0x3333) << 2);
+  n = ((n & 0xF0F0) >>  4) | ((n & 0x0F0F) << 4);
+  n = ((n & 0xFF00) >>  8) | ((n & 0x00FF) << 8);
+  return n;
+}
+
+__forceinline static int bit_reverse(int v, int bits)
+{
+   assert(bits <= 16);
+   // to bit reverse n bits, reverse 16 and shift
+   // e.g. 11 bits, bit reverse and shift away 5
+   return bitreverse16(v) >> (16-bits);
+}
+
+static int zbuild_huffman(zhuffman *z, uint8 *sizelist, int num)
+{
+   int i,k=0;
+   int code, next_code[16], sizes[17];
+
+   // DEFLATE spec for generating codes
+   memset(sizes, 0, sizeof(sizes));
+   memset(z->fast, 255, sizeof(z->fast));
+   for (i=0; i < num; ++i)
+      ++sizes[sizelist[i]];
+   sizes[0] = 0;
+   for (i=1; i < 16; ++i)
+      assert(sizes[i] <= (1 << i));
+   code = 0;
+   for (i=1; i < 16; ++i) {
+      next_code[i] = code;
+      z->firstcode[i] = (uint16) code;
+      z->firstsymbol[i] = (uint16) k;
+      code = (code + sizes[i]);
+      if (sizes[i])
+         if (code-1 >= (1 << i)) return e("bad codelengths","Corrupt JPEG");
+      z->maxcode[i] = code << (16-i); // preshift for inner loop
+      code <<= 1;
+      k += sizes[i];
+   }
+   z->maxcode[16] = 0x10000; // sentinel
+   for (i=0; i < num; ++i) {
+      int s = sizelist[i];
+      if (s) {
+         int c = next_code[s] - z->firstcode[s] + z->firstsymbol[s];
+         z->size[c] = (uint8)s;
+         z->value[c] = (uint16)i;
+         if (s <= ZFAST_BITS) {
+            int k = bit_reverse(next_code[s],s);
+            while (k < (1 << ZFAST_BITS)) {
+               z->fast[k] = (uint16) c;
+               k += (1 << s);
+            }
+         }
+         ++next_code[s];
+      }
+   }
+   return 1;
+}
+
+// zlib-from-memory implementation for PNG reading
+//    because PNG allows splitting the zlib stream arbitrarily,
+//    and it's annoying structurally to have PNG call ZLIB call PNG,
+//    we require PNG read all the IDATs and combine them into a single
+//    memory buffer
+
+typedef struct
+{
+   uint8 *zbuffer, *zbuffer_end;
+   int num_bits;
+   uint32 code_buffer;
+
+   char *zout;
+   char *zout_start;
+   char *zout_end;
+   int   z_expandable;
+
+   zhuffman z_length, z_distance;
+} zbuf;
+
+__forceinline static int zget8(zbuf *z)
+{
+   if (z->zbuffer >= z->zbuffer_end) return 0;
+   return *z->zbuffer++;
+}
+
+static void fill_bits(zbuf *z)
+{
+   do {
+      assert(z->code_buffer < (1U << z->num_bits));
+      z->code_buffer |= zget8(z) << z->num_bits;
+      z->num_bits += 8;
+   } while (z->num_bits <= 24);
+}
+
+__forceinline static unsigned int zreceive(zbuf *z, int n)
+{
+   unsigned int k;
+   if (z->num_bits < n) fill_bits(z);
+   k = z->code_buffer & ((1 << n) - 1);
+   z->code_buffer >>= n;
+   z->num_bits -= n;
+   return k;
+}
+
+__forceinline static int zhuffman_decode(zbuf *a, zhuffman *z)
+{
+   int b,s,k;
+   if (a->num_bits < 16) fill_bits(a);
+   b = z->fast[a->code_buffer & ZFAST_MASK];
+   if (b < 0xffff) {
+      s = z->size[b];
+      a->code_buffer >>= s;
+      a->num_bits -= s;
+      return z->value[b];
+   }
+
+   // not resolved by fast table, so compute it the slow way
+   // use jpeg approach, which requires MSbits at top
+   k = bit_reverse(a->code_buffer, 16);
+   for (s=ZFAST_BITS+1; ; ++s)
+      if (k < z->maxcode[s])
+         break;
+   if (s == 16) return -1; // invalid code!
+   // code size is s, so:
+   b = (k >> (16-s)) - z->firstcode[s] + z->firstsymbol[s];
+   assert(z->size[b] == s);
+   a->code_buffer >>= s;
+   a->num_bits -= s;
+   return z->value[b];
+}
+
+static int expand(zbuf *z, int n)  // need to make room for n bytes
+{
+   char *q;
+   int cur, limit;
+   if (!z->z_expandable) return e("output buffer limit","Corrupt PNG");
+   cur   = (int) (z->zout     - z->zout_start);
+   limit = (int) (z->zout_end - z->zout_start);
+   while (cur + n > limit)
+      limit *= 2;
+   q = (char *) realloc(z->zout_start, limit);
+   if (q == NULL) return e("outofmem", "Out of memory");
+   z->zout_start = q;
+   z->zout       = q + cur;
+   z->zout_end   = q + limit;
+   return 1;
+}
+
+static int length_base[31] = {
+   3,4,5,6,7,8,9,10,11,13,
+   15,17,19,23,27,31,35,43,51,59,
+   67,83,99,115,131,163,195,227,258,0,0 };
+
+static int length_extra[31]=
+{ 0,0,0,0,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,0,0,0 };
+
+static int dist_base[32] = { 1,2,3,4,5,7,9,13,17,25,33,49,65,97,129,193,
+257,385,513,769,1025,1537,2049,3073,4097,6145,8193,12289,16385,24577,0,0};
+
+static int dist_extra[32] =
+{ 0,0,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13};
+
+static int parse_huffman_block(zbuf *a)
+{
+   for(;;) {
+      int z = zhuffman_decode(a, &a->z_length);
+      if (z < 256) {
+         if (z < 0) return e("bad huffman code","Corrupt PNG"); // error in huffman codes
+         if (a->zout >= a->zout_end) if (!expand(a, 1)) return 0;
+         *a->zout++ = (char) z;
+      } else {
+         uint8 *p;
+         int len,dist;
+         if (z == 256) return 1;
+         z -= 257;
+         len = length_base[z];
+         if (length_extra[z]) len += zreceive(a, length_extra[z]);
+         z = zhuffman_decode(a, &a->z_distance);
+         if (z < 0) return e("bad huffman code","Corrupt PNG");
+         dist = dist_base[z];
+         if (dist_extra[z]) dist += zreceive(a, dist_extra[z]);
+         if (a->zout - a->zout_start < dist) return e("bad dist","Corrupt PNG");
+         if (a->zout + len > a->zout_end) if (!expand(a, len)) return 0;
+         p = (uint8 *) (a->zout - dist);
+         while (len--)
+            *a->zout++ = *p++;
+      }
+   }
+}
+
+static int compute_huffman_codes(zbuf *a)
+{
+   static uint8 length_dezigzag[19] = { 16,17,18,0,8,7,9,6,10,5,11,4,12,3,13,2,14,1,15 };
+   static zhuffman z_codelength; // static just to save stack space
+   uint8 lencodes[286+32+137];//padding for maximum single op
+   uint8 codelength_sizes[19];
+   int i,n;
+
+   int hlit  = zreceive(a,5) + 257;
+   int hdist = zreceive(a,5) + 1;
+   int hclen = zreceive(a,4) + 4;
+
+   memset(codelength_sizes, 0, sizeof(codelength_sizes));
+   for (i=0; i < hclen; ++i) {
+      int s = zreceive(a,3);
+      codelength_sizes[length_dezigzag[i]] = (uint8) s;
+   }
+   if (!zbuild_huffman(&z_codelength, codelength_sizes, 19)) return 0;
+
+   n = 0;
+   while (n < hlit + hdist) {
+      int c = zhuffman_decode(a, &z_codelength);
+      assert(c >= 0 && c < 19);
+      if (c < 16)
+         lencodes[n++] = (uint8) c;
+      else if (c == 16) {
+         c = zreceive(a,2)+3;
+         memset(lencodes+n, lencodes[n-1], c);
+         n += c;
+      } else if (c == 17) {
+         c = zreceive(a,3)+3;
+         memset(lencodes+n, 0, c);
+         n += c;
+      } else {
+         assert(c == 18);
+         c = zreceive(a,7)+11;
+         memset(lencodes+n, 0, c);
+         n += c;
+      }
+   }
+   if (n != hlit+hdist) return e("bad codelengths","Corrupt PNG");
+   if (!zbuild_huffman(&a->z_length, lencodes, hlit)) return 0;
+   if (!zbuild_huffman(&a->z_distance, lencodes+hlit, hdist)) return 0;
+   return 1;
+}
+
+static int parse_uncompressed_block(zbuf *a)
+{
+   uint8 header[4];
+   int len,nlen,k;
+   if (a->num_bits & 7)
+      zreceive(a, a->num_bits & 7); // discard
+   // drain the bit-packed data into header
+   k = 0;
+   while (a->num_bits > 0) {
+      header[k++] = (uint8) (a->code_buffer & 255); // wtf this warns?
+      a->code_buffer >>= 8;
+      a->num_bits -= 8;
+   }
+   assert(a->num_bits == 0);
+   // now fill header the normal way
+   while (k < 4)
+      header[k++] = (uint8) zget8(a);
+   len  = header[1] * 256 + header[0];
+   nlen = header[3] * 256 + header[2];
+   if (nlen != (len ^ 0xffff)) return e("zlib corrupt","Corrupt PNG");
+   if (a->zbuffer + len > a->zbuffer_end) return e("read past buffer","Corrupt PNG");
+   if (a->zout + len > a->zout_end)
+      if (!expand(a, len)) return 0;
+   memcpy(a->zout, a->zbuffer, len);
+   a->zbuffer += len;
+   a->zout += len;
+   return 1;
+}
+
+static int parse_zlib_header(zbuf *a)
+{
+   int cmf   = zget8(a);
+   int cm    = cmf & 15;
+   /* int cinfo = cmf >> 4; */
+   int flg   = zget8(a);
+   if ((cmf*256+flg) % 31 != 0) return e("bad zlib header","Corrupt PNG"); // zlib spec
+   if (flg & 32) return e("no preset dict","Corrupt PNG"); // preset dictionary not allowed in png
+   if (cm != 8) return e("bad compression","Corrupt PNG"); // DEFLATE required for png
+   // window = 1 << (8 + cinfo)... but who cares, we fully buffer output
+   return 1;
+}
+
+// @TODO: should statically initialize these for optimal thread safety
+static uint8 default_length[288], default_distance[32];
+static void init_defaults(void)
+{
+   int i;   // use <= to match clearly with spec
+   for (i=0; i <= 143; ++i)     default_length[i]   = 8;
+   for (   ; i <= 255; ++i)     default_length[i]   = 9;
+   for (   ; i <= 279; ++i)     default_length[i]   = 7;
+   for (   ; i <= 287; ++i)     default_length[i]   = 8;
+
+   for (i=0; i <=  31; ++i)     default_distance[i] = 5;
+}
+
+static int parse_zlib(zbuf *a, int parse_header)
+{
+   int final, type;
+   if (parse_header)
+      if (!parse_zlib_header(a)) return 0;
+   a->num_bits = 0;
+   a->code_buffer = 0;
+   do {
+      final = zreceive(a,1);
+      type = zreceive(a,2);
+      if (type == 0) {
+         if (!parse_uncompressed_block(a)) return 0;
+      } else if (type == 3) {
+         return 0;
+      } else {
+         if (type == 1) {
+            // use fixed code lengths
+            if (!default_distance[31]) init_defaults();
+            if (!zbuild_huffman(&a->z_length  , default_length  , 288)) return 0;
+            if (!zbuild_huffman(&a->z_distance, default_distance,  32)) return 0;
+         } else {
+            if (!compute_huffman_codes(a)) return 0;
+         }
+         if (!parse_huffman_block(a)) return 0;
+      }
+   } while (!final);
+   return 1;
+}
+
+static int do_zlib(zbuf *a, char *obuf, int olen, int exp, int parse_header)
+{
+   a->zout_start = obuf;
+   a->zout       = obuf;
+   a->zout_end   = obuf + olen;
+   a->z_expandable = exp;
+
+   return parse_zlib(a, parse_header);
+}
+
+char *stbi_zlib_decode_malloc_guesssize(const char *buffer, int len, int initial_size, int *outlen)
+{
+   zbuf a;
+   char *p = (char *) malloc(initial_size);
+   if (p == NULL) return NULL;
+   a.zbuffer = (uint8 *) buffer;
+   a.zbuffer_end = (uint8 *) buffer + len;
+   if (do_zlib(&a, p, initial_size, 1, 1)) {
+      if (outlen) *outlen = (int) (a.zout - a.zout_start);
+      return a.zout_start;
+   } else {
+      free(a.zout_start);
+      return NULL;
+   }
+}
+
+char *stbi_zlib_decode_malloc(char const *buffer, int len, int *outlen)
+{
+   return stbi_zlib_decode_malloc_guesssize(buffer, len, 16384, outlen);
+}
+
+int stbi_zlib_decode_buffer(char *obuffer, int olen, char const *ibuffer, int ilen)
+{
+   zbuf a;
+   a.zbuffer = (uint8 *) ibuffer;
+   a.zbuffer_end = (uint8 *) ibuffer + ilen;
+   if (do_zlib(&a, obuffer, olen, 0, 1))
+      return (int) (a.zout - a.zout_start);
+   else
+      return -1;
+}
+
+char *stbi_zlib_decode_noheader_malloc(char const *buffer, int len, int *outlen)
+{
+   zbuf a;
+   char *p = (char *) malloc(16384);
+   if (p == NULL) return NULL;
+   a.zbuffer = (uint8 *) buffer;
+   a.zbuffer_end = (uint8 *) buffer+len;
+   if (do_zlib(&a, p, 16384, 1, 0)) {
+      if (outlen) *outlen = (int) (a.zout - a.zout_start);
+      return a.zout_start;
+   } else {
+      free(a.zout_start);
+      return NULL;
+   }
+}
+
+int stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, const char *ibuffer, int ilen)
+{
+   zbuf a;
+   a.zbuffer = (uint8 *) ibuffer;
+   a.zbuffer_end = (uint8 *) ibuffer + ilen;
+   if (do_zlib(&a, obuffer, olen, 0, 0))
+      return (int) (a.zout - a.zout_start);
+   else
+      return -1;
+}
+
+// public domain "baseline" PNG decoder   v0.10  Sean Barrett 2006-11-18
+//    simple implementation
+//      - only 8-bit samples
+//      - no CRC checking
+//      - allocates lots of intermediate memory
+//        - avoids problem of streaming data between subsystems
+//        - avoids explicit window management
+//    performance
+//      - uses stb_zlib, a PD zlib implementation with fast huffman decoding
+
+
+typedef struct
+{
+   uint32 length;
+   uint32 type;
+} chunk;
+
+#define PNG_TYPE(a,b,c,d)  (((a) << 24) + ((b) << 16) + ((c) << 8) + (d))
+
+static chunk get_chunk_header(stbi *s)
+{
+   chunk c;
+   c.length = get32(s);
+   c.type   = get32(s);
+   return c;
+}
+
+static int check_png_header(stbi *s)
+{
+   static uint8 png_sig[8] = { 137,80,78,71,13,10,26,10 };
+   int i;
+   for (i=0; i < 8; ++i)
+      if (get8(s) != png_sig[i]) return e("bad png sig","Not a PNG");
+   return 1;
+}
+
+typedef struct
+{
+   stbi s;
+   uint8 *idata, *expanded, *out;
+} png;
+
+
+enum {
+   F_none=0, F_sub=1, F_up=2, F_avg=3, F_paeth=4,
+   F_avg_first, F_paeth_first,
+};
+
+static uint8 first_row_filter[5] =
+{
+   F_none, F_sub, F_none, F_avg_first, F_paeth_first
+};
+
+static int paeth(int a, int b, int c)
+{
+   int p = a + b - c;
+   int pa = abs(p-a);
+   int pb = abs(p-b);
+   int pc = abs(p-c);
+   if (pa <= pb && pa <= pc) return a;
+   if (pb <= pc) return b;
+   return c;
+}
+
+// create the png data from post-deflated data
+static int create_png_image(png *a, uint8 *raw, uint32 raw_len, int out_n)
+{
+   stbi *s = &a->s;
+   uint32 i,j,stride = s->img_x*out_n;
+   int k;
+   int img_n = s->img_n; // copy it into a local for later
+   assert(out_n == s->img_n || out_n == s->img_n+1);
+   a->out = (uint8 *) malloc(s->img_x * s->img_y * out_n);
+   if (!a->out) return e("outofmem", "Out of memory");
+   if (raw_len != (img_n * s->img_x + 1) * s->img_y) return e("not enough pixels","Corrupt PNG");
+   for (j=0; j < s->img_y; ++j) {
+      uint8 *cur = a->out + stride*j;
+      uint8 *prior = cur - stride;
+      int filter = *raw++;
+      if (filter > 4) return e("invalid filter","Corrupt PNG");
+      // if first row, use special filter that doesn't sample previous row
+      if (j == 0) filter = first_row_filter[filter];
+      // handle first pixel explicitly
+      for (k=0; k < img_n; ++k) {
+         switch(filter) {
+            case F_none       : cur[k] = raw[k]; break;
+            case F_sub        : cur[k] = raw[k]; break;
+            case F_up         : cur[k] = raw[k] + prior[k]; break;
+            case F_avg        : cur[k] = raw[k] + (prior[k]>>1); break;
+            case F_paeth      : cur[k] = (uint8) (raw[k] + paeth(0,prior[k],0)); break;
+            case F_avg_first  : cur[k] = raw[k]; break;
+            case F_paeth_first: cur[k] = raw[k]; break;
+         }
+      }
+      if (img_n != out_n) cur[img_n] = 255;
+      raw += img_n;
+      cur += out_n;
+      prior += out_n;
+      // this is a little gross, so that we don't switch per-pixel or per-component
+      if (img_n == out_n) {
+         #define CASE(f) \
+             case f:     \
+                for (i=s->img_x-1; i >= 1; --i, raw+=img_n,cur+=img_n,prior+=img_n) \
+                   for (k=0; k < img_n; ++k)
+         switch(filter) {
+            CASE(F_none)  cur[k] = raw[k]; break;
+            CASE(F_sub)   cur[k] = raw[k] + cur[k-img_n]; break;
+            CASE(F_up)    cur[k] = raw[k] + prior[k]; break;
+            CASE(F_avg)   cur[k] = raw[k] + ((prior[k] + cur[k-img_n])>>1); break;
+            CASE(F_paeth)  cur[k] = (uint8) (raw[k] + paeth(cur[k-img_n],prior[k],prior[k-img_n])); break;
+            CASE(F_avg_first)    cur[k] = raw[k] + (cur[k-img_n] >> 1); break;
+            CASE(F_paeth_first)  cur[k] = (uint8) (raw[k] + paeth(cur[k-img_n],0,0)); break;
+         }
+         #undef CASE
+      } else {
+         assert(img_n+1 == out_n);
+         #define CASE(f) \
+             case f:     \
+                for (i=s->img_x-1; i >= 1; --i, cur[img_n]=255,raw+=img_n,cur+=out_n,prior+=out_n) \
+                   for (k=0; k < img_n; ++k)
+         switch(filter) {
+            CASE(F_none)  cur[k] = raw[k]; break;
+            CASE(F_sub)   cur[k] = raw[k] + cur[k-out_n]; break;
+            CASE(F_up)    cur[k] = raw[k] + prior[k]; break;
+            CASE(F_avg)   cur[k] = raw[k] + ((prior[k] + cur[k-out_n])>>1); break;
+            CASE(F_paeth)  cur[k] = (uint8) (raw[k] + paeth(cur[k-out_n],prior[k],prior[k-out_n])); break;
+            CASE(F_avg_first)    cur[k] = raw[k] + (cur[k-out_n] >> 1); break;
+            CASE(F_paeth_first)  cur[k] = (uint8) (raw[k] + paeth(cur[k-out_n],0,0)); break;
+         }
+         #undef CASE
+      }
+   }
+   return 1;
+}
+
+static int compute_transparency(png *z, uint8 tc[3], int out_n)
+{
+   stbi *s = &z->s;
+   uint32 i, pixel_count = s->img_x * s->img_y;
+   uint8 *p = z->out;
+
+   // compute color-based transparency, assuming we've
+   // already got 255 as the alpha value in the output
+   assert(out_n == 2 || out_n == 4);
+
+   if (out_n == 2) {
+      for (i=0; i < pixel_count; ++i) {
+         p[1] = (p[0] == tc[0] ? 0 : 255);
+         p += 2;
+      }
+   } else {
+      for (i=0; i < pixel_count; ++i) {
+         if (p[0] == tc[0] && p[1] == tc[1] && p[2] == tc[2])
+            p[3] = 0;
+         p += 4;
+      }
+   }
+   return 1;
+}
+
+static int expand_palette(png *a, uint8 *palette, int len, int pal_img_n)
+{
+   uint32 i, pixel_count = a->s.img_x * a->s.img_y;
+   uint8 *p, *temp_out, *orig = a->out;
+
+   p = (uint8 *) malloc(pixel_count * pal_img_n);
+   if (p == NULL) return e("outofmem", "Out of memory");
+
+   // between here and free(out) below, exitting would leak
+   temp_out = p;
+
+   if (pal_img_n == 3) {
+      for (i=0; i < pixel_count; ++i) {
+         int n = orig[i]*4;
+         p[0] = palette[n  ];
+         p[1] = palette[n+1];
+         p[2] = palette[n+2];
+         p += 3;
+      }
+   } else {
+      for (i=0; i < pixel_count; ++i) {
+         int n = orig[i]*4;
+         p[0] = palette[n  ];
+         p[1] = palette[n+1];
+         p[2] = palette[n+2];
+         p[3] = palette[n+3];
+         p += 4;
+      }
+   }
+   free(a->out);
+   a->out = temp_out;
+   return 1;
+}
+
+static int parse_png_file(png *z, int scan, int req_comp)
+{
+   uint8 palette[1024], pal_img_n=0;
+   uint8 has_trans=0, tc[3];
+   uint32 ioff=0, idata_limit=0, i, pal_len=0;
+   int first=1,k;
+   stbi *s = &z->s;
+
+   if (!check_png_header(s)) return 0;
+
+   if (scan == SCAN_type) return 1;
+
+   for(;;first=0) {
+      chunk c = get_chunk_header(s);
+      if (first && c.type != PNG_TYPE('I','H','D','R'))
+         return e("first not IHDR","Corrupt PNG");
+      switch (c.type) {
+         case PNG_TYPE('I','H','D','R'): {
+            int depth,color,interlace,comp,filter;
+            if (!first) return e("multiple IHDR","Corrupt PNG");
+            if (c.length != 13) return e("bad IHDR len","Corrupt PNG");
+            s->img_x = get32(s); if (s->img_x > (1 << 24)) return e("too large","Very large image (corrupt?)");
+            s->img_y = get32(s); if (s->img_y > (1 << 24)) return e("too large","Very large image (corrupt?)");
+            depth = get8(s);  if (depth != 8)        return e("8bit only","PNG not supported: 8-bit only");
+            color = get8(s);  if (color > 6)         return e("bad ctype","Corrupt PNG");
+            if (color == 3) pal_img_n = 3; else if (color & 1) return e("bad ctype","Corrupt PNG");
+            comp  = get8(s);  if (comp) return e("bad comp method","Corrupt PNG");
+            filter= get8(s);  if (filter) return e("bad filter method","Corrupt PNG");
+            interlace = get8(s); if (interlace) return e("interlaced","PNG not supported: interlaced mode");
+            if (!s->img_x || !s->img_y) return e("0-pixel image","Corrupt PNG");
+            if (!pal_img_n) {
+               s->img_n = (color & 2 ? 3 : 1) + (color & 4 ? 1 : 0);
+               if ((1 << 30) / s->img_x / s->img_n < s->img_y) return e("too large", "Image too large to decode");
+               if (scan == SCAN_header) return 1;
+            } else {
+               // if paletted, then pal_n is our final components, and
+               // img_n is # components to decompress/filter.
+               s->img_n = 1;
+               if ((1 << 30) / s->img_x / 4 < s->img_y) return e("too large","Corrupt PNG");
+               // if SCAN_header, have to scan to see if we have a tRNS
+            }
+            break;
+         }
+
+         case PNG_TYPE('P','L','T','E'):  {
+            if (c.length > 256*3) return e("invalid PLTE","Corrupt PNG");
+            pal_len = c.length / 3;
+            if (pal_len * 3 != c.length) return e("invalid PLTE","Corrupt PNG");
+            for (i=0; i < pal_len; ++i) {
+               palette[i*4+0] = get8u(s);
+               palette[i*4+1] = get8u(s);
+               palette[i*4+2] = get8u(s);
+               palette[i*4+3] = 255;
+            }
+            break;
+         }
+
+         case PNG_TYPE('t','R','N','S'): {
+            if (z->idata) return e("tRNS after IDAT","Corrupt PNG");
+            if (pal_img_n) {
+               if (scan == SCAN_header) { s->img_n = 4; return 1; }
+               if (pal_len == 0) return e("tRNS before PLTE","Corrupt PNG");
+               if (c.length > pal_len) return e("bad tRNS len","Corrupt PNG");
+               pal_img_n = 4;
+               for (i=0; i < c.length; ++i)
+                  palette[i*4+3] = get8u(s);
+            } else {
+               if (!(s->img_n & 1)) return e("tRNS with alpha","Corrupt PNG");
+               if (c.length != (uint32) s->img_n*2) return e("bad tRNS len","Corrupt PNG");
+               has_trans = 1;
+               for (k=0; k < s->img_n; ++k)
+                  tc[k] = (uint8) get16(s); // non 8-bit images will be larger
+            }
+            break;
+         }
+
+         case PNG_TYPE('I','D','A','T'): {
+            if (pal_img_n && !pal_len) return e("no PLTE","Corrupt PNG");
+            if (scan == SCAN_header) { s->img_n = pal_img_n; return 1; }
+            if (ioff + c.length > idata_limit) {
+               uint8 *p;
+               if (idata_limit == 0) idata_limit = c.length > 4096 ? c.length : 4096;
+               while (ioff + c.length > idata_limit)
+                  idata_limit *= 2;
+               p = (uint8 *) realloc(z->idata, idata_limit); if (p == NULL) return e("outofmem", "Out of memory");
+               z->idata = p;
+            }
+            #ifndef STBI_NO_STDIO
+            if (s->img_file)
+            {
+               if (fread(z->idata+ioff,1,c.length,s->img_file) != c.length) return e("outofdata","Corrupt PNG");
+            }
+            else
+            #endif
+            {
+               memcpy(z->idata+ioff, s->img_buffer, c.length);
+               s->img_buffer += c.length;
+            }
+            ioff += c.length;
+            break;
+         }
+
+         case PNG_TYPE('I','E','N','D'): {
+            uint32 raw_len;
+            if (scan != SCAN_load) return 1;
+            if (z->idata == NULL) return e("no IDAT","Corrupt PNG");
+            z->expanded = (uint8 *) stbi_zlib_decode_malloc((char *) z->idata, ioff, (int *) &raw_len);
+            if (z->expanded == NULL) return 0; // zlib should set error
+            free(z->idata); z->idata = NULL;
+            if ((req_comp == s->img_n+1 && req_comp != 3 && !pal_img_n) || has_trans)
+               s->img_out_n = s->img_n+1;
+            else
+               s->img_out_n = s->img_n;
+            if (!create_png_image(z, z->expanded, raw_len, s->img_out_n)) return 0;
+            if (has_trans)
+               if (!compute_transparency(z, tc, s->img_out_n)) return 0;
+            if (pal_img_n) {
+               // pal_img_n == 3 or 4
+               s->img_n = pal_img_n; // record the actual colors we had
+               s->img_out_n = pal_img_n;
+               if (req_comp >= 3) s->img_out_n = req_comp;
+               if (!expand_palette(z, palette, pal_len, s->img_out_n))
+                  return 0;
+            }
+            free(z->expanded); z->expanded = NULL;
+            return 1;
+         }
+
+         default:
+            // if critical, fail
+            if ((c.type & (1 << 29)) == 0) {
+               #ifndef STBI_NO_FAILURE_STRINGS
+               // not threadsafe
+               static char invalid_chunk[] = "XXXX chunk not known";
+               invalid_chunk[0] = (uint8) (c.type >> 24);
+               invalid_chunk[1] = (uint8) (c.type >> 16);
+               invalid_chunk[2] = (uint8) (c.type >>  8);
+               invalid_chunk[3] = (uint8) (c.type >>  0);
+               #endif
+               return e(invalid_chunk, "PNG not supported: unknown chunk type");
+            }
+            skip(s, c.length);
+            break;
+      }
+      // end of chunk, read and skip CRC
+      get32(s);
+   }
+}
+
+static unsigned char *do_png(png *p, int *x, int *y, int *n, int req_comp)
+{
+   unsigned char *result=NULL;
+   p->expanded = NULL;
+   p->idata = NULL;
+   p->out = NULL;
+   if (req_comp < 0 || req_comp > 4) return epuc("bad req_comp", "Internal error");
+   if (parse_png_file(p, SCAN_load, req_comp)) {
+      result = p->out;
+      p->out = NULL;
+      if (req_comp && req_comp != p->s.img_out_n) {
+         result = convert_format(result, p->s.img_out_n, req_comp, p->s.img_x, p->s.img_y);
+         p->s.img_out_n = req_comp;
+         if (result == NULL) return result;
+      }
+      *x = p->s.img_x;
+      *y = p->s.img_y;
+      if (n) *n = p->s.img_n;
+   }
+   free(p->out);      p->out      = NULL;
+   free(p->expanded); p->expanded = NULL;
+   free(p->idata);    p->idata    = NULL;
+
+   return result;
+}
+
+#ifndef STBI_NO_STDIO
+unsigned char *stbi_png_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   png p;
+   start_file(&p.s, f);
+   return do_png(&p, x,y,comp,req_comp);
+}
+
+unsigned char *stbi_png_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   unsigned char *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_png_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+#endif
+
+unsigned char *stbi_png_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   png p;
+   start_mem(&p.s, buffer,len);
+   return do_png(&p, x,y,comp,req_comp);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_png_test_file(FILE *f)
+{
+   png p;
+   int n,r;
+   n = ftell(f);
+   start_file(&p.s, f);
+   r = parse_png_file(&p, SCAN_type,STBI_default);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_png_test_memory(stbi_uc const *buffer, int len)
+{
+   png p;
+   start_mem(&p.s, buffer, len);
+   return parse_png_file(&p, SCAN_type,STBI_default);
+}
+
+// TODO: load header from png
+#ifndef STBI_NO_STDIO
+extern int      stbi_png_info             (char const *filename,           int *x, int *y, int *comp);
+extern int      stbi_png_info_from_file   (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern int      stbi_png_info_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+// Microsoft/Windows BMP image
+
+static int bmp_test(stbi *s)
+{
+   int sz;
+   if (get8(s) != 'B') return 0;
+   if (get8(s) != 'M') return 0;
+   get32le(s); // discard filesize
+   get16le(s); // discard reserved
+   get16le(s); // discard reserved
+   get32le(s); // discard data offset
+   sz = get32le(s);
+   if (sz == 12 || sz == 40 || sz == 56 || sz == 108) return 1;
+   return 0;
+}
+
+#ifndef STBI_NO_STDIO
+int      stbi_bmp_test_file        (FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s,f);
+   r = bmp_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int      stbi_bmp_test_memory      (stbi_uc const *buffer, int len)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return bmp_test(&s);
+}
+
+// returns 0..31 for the highest set bit
+static int high_bit(unsigned int z)
+{
+   int n=0;
+   if (z == 0) return -1;
+   if (z >= 0x10000) n += 16, z >>= 16;
+   if (z >= 0x00100) n +=  8, z >>=  8;
+   if (z >= 0x00010) n +=  4, z >>=  4;
+   if (z >= 0x00004) n +=  2, z >>=  2;
+   if (z >= 0x00002) n +=  1, z >>=  1;
+   return n;
+}
+
+static int bitcount(unsigned int a)
+{
+   a = (a & 0x55555555) + ((a >>  1) & 0x55555555); // max 2
+   a = (a & 0x33333333) + ((a >>  2) & 0x33333333); // max 4
+   a = (a + (a >> 4)) & 0x0f0f0f0f; // max 8 per 4, now 8 bits
+   a = (a + (a >> 8)); // max 16 per 8 bits
+   a = (a + (a >> 16)); // max 32 per 8 bits
+   return a & 0xff;
+}
+
+static int shiftsigned(int v, int shift, int bits)
+{
+   int result;
+   int z=0;
+
+   if (shift < 0) v <<= -shift;
+   else v >>= shift;
+   result = v;
+
+   z = bits;
+   while (z < 8) {
+      result += v >> z;
+      z += bits;
+   }
+   return result;
+}
+
+static stbi_uc *bmp_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+   uint8 *out;
+   unsigned int mr=0,mg=0,mb=0,ma=0;
+   stbi_uc pal[256][4];
+   int psize=0,i,j,compress=0,width;
+   int bpp, flip_vertically, pad, target, offset, hsz;
+   if (get8(s) != 'B' || get8(s) != 'M') return epuc("not BMP", "Corrupt BMP");
+   get32le(s); // discard filesize
+   get16le(s); // discard reserved
+   get16le(s); // discard reserved
+   offset = get32le(s);
+   hsz = get32le(s);
+   if (hsz != 12 && hsz != 40 && hsz != 56 && hsz != 108) return epuc("unknown BMP", "BMP type not supported: unknown");
+   failure_reason = "bad BMP";
+   if (hsz == 12) {
+      s->img_x = get16le(s);
+      s->img_y = get16le(s);
+   } else {
+      s->img_x = get32le(s);
+      s->img_y = get32le(s);
+   }
+   if (get16le(s) != 1) return 0;
+   bpp = get16le(s);
+   if (bpp == 1) return epuc("monochrome", "BMP type not supported: 1-bit");
+   flip_vertically = ((int) s->img_y) > 0;
+   s->img_y = abs((int) s->img_y);
+   if (hsz == 12) {
+      if (bpp < 24)
+         psize = (offset - 14 - 24) / 3;
+   } else {
+      compress = get32le(s);
+      if (compress == 1 || compress == 2) return epuc("BMP RLE", "BMP type not supported: RLE");
+      get32le(s); // discard sizeof
+      get32le(s); // discard hres
+      get32le(s); // discard vres
+      get32le(s); // discard colorsused
+      get32le(s); // discard max important
+      if (hsz == 40 || hsz == 56) {
+         if (hsz == 56) {
+            get32le(s);
+            get32le(s);
+            get32le(s);
+            get32le(s);
+         }
+         if (bpp == 16 || bpp == 32) {
+            mr = mg = mb = 0;
+            if (compress == 0) {
+               if (bpp == 32) {
+                  mr = 0xff << 16;
+                  mg = 0xff <<  8;
+                  mb = 0xff <<  0;
+               } else {
+                  mr = 31 << 10;
+                  mg = 31 <<  5;
+                  mb = 31 <<  0;
+               }
+            } else if (compress == 3) {
+               mr = get32le(s);
+               mg = get32le(s);
+               mb = get32le(s);
+               // not documented, but generated by photoshop and handled by mspaint
+               if (mr == mg && mg == mb) {
+                  // ?!?!?
+                  return NULL;
+               }
+            } else
+               return NULL;
+         }
+      } else {
+         assert(hsz == 108);
+         mr = get32le(s);
+         mg = get32le(s);
+         mb = get32le(s);
+         ma = get32le(s);
+         get32le(s); // discard color space
+         for (i=0; i < 12; ++i)
+            get32le(s); // discard color space parameters
+      }
+      if (bpp < 16)
+         psize = (offset - 14 - hsz) >> 2;
+   }
+   s->img_n = ma ? 4 : 3;
+   if (req_comp && req_comp >= 3) // we can directly decode 3 or 4
+      target = req_comp;
+   else
+      target = s->img_n; // if they want monochrome, we'll post-convert
+   out = (stbi_uc *) malloc(target * s->img_x * s->img_y);
+   if (!out) return epuc("outofmem", "Out of memory");
+   if (bpp < 16) {
+      int z=0;
+      if (psize == 0 || psize > 256) { free(out); return epuc("invalid", "Corrupt BMP"); }
+      for (i=0; i < psize; ++i) {
+         pal[i][2] = get8(s);
+         pal[i][1] = get8(s);
+         pal[i][0] = get8(s);
+         if (hsz != 12) get8(s);
+         pal[i][3] = 255;
+      }
+      skip(s, offset - 14 - hsz - psize * (hsz == 12 ? 3 : 4));
+      if (bpp == 4) width = (s->img_x + 1) >> 1;
+      else if (bpp == 8) width = s->img_x;
+      else { free(out); return epuc("bad bpp", "Corrupt BMP"); }
+      pad = (-width)&3;
+      for (j=0; j < (int) s->img_y; ++j) {
+         for (i=0; i < (int) s->img_x; i += 2) {
+            int v=get8(s),v2=0;
+            if (bpp == 4) {
+               v2 = v & 15;
+               v >>= 4;
+            }
+            out[z++] = pal[v][0];
+            out[z++] = pal[v][1];
+            out[z++] = pal[v][2];
+            if (target == 4) out[z++] = 255;
+            if (i+1 == (int) s->img_x) break;
+            v = (bpp == 8) ? get8(s) : v2;
+            out[z++] = pal[v][0];
+            out[z++] = pal[v][1];
+            out[z++] = pal[v][2];
+            if (target == 4) out[z++] = 255;
+         }
+         skip(s, pad);
+      }
+   } else {
+      int rshift=0,gshift=0,bshift=0,ashift=0,rcount=0,gcount=0,bcount=0,acount=0;
+      int z = 0;
+      int easy=0;
+      skip(s, offset - 14 - hsz);
+      if (bpp == 24) width = 3 * s->img_x;
+      else if (bpp == 16) width = 2*s->img_x;
+      else /* bpp = 32 and pad = 0 */ width=0;
+      pad = (-width) & 3;
+      if (bpp == 24) {
+         easy = 1;
+      } else if (bpp == 32) {
+         if (mb == 0xff && mg == 0xff00 && mr == 0xff000000 && ma == 0xff000000)
+            easy = 2;
+      }
+      if (!easy) {
+         if (!mr || !mg || !mb) return epuc("bad masks", "Corrupt BMP");
+         // right shift amt to put high bit in position #7
+         rshift = high_bit(mr)-7; rcount = bitcount(mr);
+         gshift = high_bit(mg)-7; gcount = bitcount(mr);
+         bshift = high_bit(mb)-7; bcount = bitcount(mr);
+         ashift = high_bit(ma)-7; acount = bitcount(mr);
+      }
+      for (j=0; j < (int) s->img_y; ++j) {
+         if (easy) {
+            for (i=0; i < (int) s->img_x; ++i) {
+               int a;
+               out[z+2] = get8(s);
+               out[z+1] = get8(s);
+               out[z+0] = get8(s);
+               z += 3;
+               a = (easy == 2 ? get8(s) : 255);
+               if (target == 4) out[z++] = a;
+            }
+         } else {
+            for (i=0; i < (int) s->img_x; ++i) {
+               uint32 v = (bpp == 16 ? get16le(s) : get32le(s));
+               int a;
+               out[z++] = shiftsigned(v & mr, rshift, rcount);
+               out[z++] = shiftsigned(v & mg, gshift, gcount);
+               out[z++] = shiftsigned(v & mb, bshift, bcount);
+               a = (ma ? shiftsigned(v & ma, ashift, acount) : 255);
+               if (target == 4) out[z++] = a;
+            }
+         }
+         skip(s, pad);
+      }
+   }
+   if (flip_vertically) {
+      stbi_uc t;
+      for (j=0; j < (int) s->img_y>>1; ++j) {
+         stbi_uc *p1 = out +      j     *s->img_x*target;
+         stbi_uc *p2 = out + (s->img_y-1-j)*s->img_x*target;
+         for (i=0; i < (int) s->img_x*target; ++i) {
+            t = p1[i], p1[i] = p2[i], p2[i] = t;
+         }
+      }
+   }
+
+   if (req_comp && req_comp != target) {
+      out = convert_format(out, target, req_comp, s->img_x, s->img_y);
+      if (out == NULL) return out; // convert_format frees input on failure
+   }
+
+   *x = s->img_x;
+   *y = s->img_y;
+   if (comp) *comp = target;
+   return out;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_bmp_load             (char const *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_bmp_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_bmp_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s, f);
+   return bmp_load(&s, x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_bmp_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return bmp_load(&s, x,y,comp,req_comp);
+}
+
+// Targa Truevision - TGA
+// by Jonathan Dummer
+
+static int tga_test(stbi *s)
+{
+	int sz;
+	get8u(s);		//	discard Offset
+	sz = get8u(s);	//	color type
+	if( sz > 1 ) return 0;	//	only RGB or indexed allowed
+	sz = get8u(s);	//	image type
+	if( (sz != 1) && (sz != 2) && (sz != 3) && (sz != 9) && (sz != 10) && (sz != 11) ) return 0;	//	only RGB or grey allowed, +/- RLE
+	get16(s);		//	discard palette start
+	get16(s);		//	discard palette length
+	get8(s);			//	discard bits per palette color entry
+	get16(s);		//	discard x origin
+	get16(s);		//	discard y origin
+	if( get16(s) < 1 ) return 0;		//	test width
+	if( get16(s) < 1 ) return 0;		//	test height
+	sz = get8(s);	//	bits per pixel
+	if( (sz != 8) && (sz != 16) && (sz != 24) && (sz != 32) ) return 0;	//	only RGB or RGBA or grey allowed
+	return 1;		//	seems to have passed everything
+}
+
+#ifndef STBI_NO_STDIO
+int      stbi_tga_test_file        (FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s, f);
+   r = tga_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int      stbi_tga_test_memory      (stbi_uc const *buffer, int len)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return tga_test(&s);
+}
+
+static stbi_uc *tga_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+	//	read in the TGA header stuff
+	int tga_offset = get8u(s);
+	int tga_indexed = get8u(s);
+	int tga_image_type = get8u(s);
+	int tga_is_RLE = 0;
+	int tga_palette_start = get16le(s);
+	int tga_palette_len = get16le(s);
+	int tga_palette_bits = get8u(s);
+	int tga_x_origin = get16le(s);
+	int tga_y_origin = get16le(s);
+	int tga_width = get16le(s);
+	int tga_height = get16le(s);
+	int tga_bits_per_pixel = get8u(s);
+	int tga_inverted = get8u(s);
+	//	image data
+	unsigned char *tga_data;
+	unsigned char *tga_palette = NULL;
+	int i, j;
+	unsigned char raw_data[4];
+	unsigned char trans_data[] = { 0,0,0,0 };
+	int RLE_count = 0;
+	int RLE_repeating = 0;
+	int read_next_pixel = 1;
+	//	do a tiny bit of precessing
+	if( tga_image_type >= 8 )
+	{
+		tga_image_type -= 8;
+		tga_is_RLE = 1;
+	}
+	/* int tga_alpha_bits = tga_inverted & 15; */
+	tga_inverted = 1 - ((tga_inverted >> 5) & 1);
+
+	//	error check
+	if( //(tga_indexed) ||
+		(tga_width < 1) || (tga_height < 1) ||
+		(tga_image_type < 1) || (tga_image_type > 3) ||
+		((tga_bits_per_pixel != 8) && (tga_bits_per_pixel != 16) &&
+		(tga_bits_per_pixel != 24) && (tga_bits_per_pixel != 32))
+		)
+	{
+		return NULL;
+	}
+
+	//	If I'm paletted, then I'll use the number of bits from the palette
+	if( tga_indexed )
+	{
+		tga_bits_per_pixel = tga_palette_bits;
+	}
+
+	//	tga info
+	*x = tga_width;
+	*y = tga_height;
+	if( (req_comp < 1) || (req_comp > 4) )
+	{
+		//	just use whatever the file was
+		req_comp = tga_bits_per_pixel / 8;
+		*comp = req_comp;
+	} else
+	{
+		//	force a new number of components
+		*comp = tga_bits_per_pixel/8;
+	}
+	tga_data = (unsigned char*)malloc( tga_width * tga_height * req_comp );
+
+	//	skip to the data's starting position (offset usually = 0)
+	skip(s, tga_offset );
+	//	do I need to load a palette?
+	if( tga_indexed )
+	{
+		//	any data to skip? (offset usually = 0)
+		skip(s, tga_palette_start );
+		//	load the palette
+		tga_palette = (unsigned char*)malloc( tga_palette_len * tga_palette_bits / 8 );
+		getn(s, tga_palette, tga_palette_len * tga_palette_bits / 8 );
+	}
+	//	load the data
+	for( i = 0; i < tga_width * tga_height; ++i )
+	{
+		//	if I'm in RLE mode, do I need to get a RLE chunk?
+		if( tga_is_RLE )
+		{
+			if( RLE_count == 0 )
+			{
+				//	yep, get the next byte as a RLE command
+				int RLE_cmd = get8u(s);
+				RLE_count = 1 + (RLE_cmd & 127);
+				RLE_repeating = RLE_cmd >> 7;
+				read_next_pixel = 1;
+			} else if( !RLE_repeating )
+			{
+				read_next_pixel = 1;
+			}
+		} else
+		{
+			read_next_pixel = 1;
+		}
+		//	OK, if I need to read a pixel, do it now
+		if( read_next_pixel )
+		{
+			//	load however much data we did have
+			if( tga_indexed )
+			{
+				//	read in 1 byte, then perform the lookup
+				int pal_idx = get8u(s);
+				if( pal_idx >= tga_palette_len )
+				{
+					//	invalid index
+					pal_idx = 0;
+				}
+				pal_idx *= tga_bits_per_pixel / 8;
+				for( j = 0; j*8 < tga_bits_per_pixel; ++j )
+				{
+					raw_data[j] = tga_palette[pal_idx+j];
+				}
+			} else
+			{
+				//	read in the data raw
+				for( j = 0; j*8 < tga_bits_per_pixel; ++j )
+				{
+					raw_data[j] = get8u(s);
+				}
+			}
+			//	convert raw to the intermediate format
+			switch( tga_bits_per_pixel )
+			{
+			case 8:
+				//	Luminous => RGBA
+				trans_data[0] = raw_data[0];
+				trans_data[1] = raw_data[0];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = 255;
+				break;
+			case 16:
+				//	Luminous,Alpha => RGBA
+				trans_data[0] = raw_data[0];
+				trans_data[1] = raw_data[0];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = raw_data[1];
+				break;
+			case 24:
+				//	BGR => RGBA
+				trans_data[0] = raw_data[2];
+				trans_data[1] = raw_data[1];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = 255;
+				break;
+			case 32:
+				//	BGRA => RGBA
+				trans_data[0] = raw_data[2];
+				trans_data[1] = raw_data[1];
+				trans_data[2] = raw_data[0];
+				trans_data[3] = raw_data[3];
+				break;
+			}
+			//	clear the reading flag for the next pixel
+			read_next_pixel = 0;
+		} // end of reading a pixel
+		//	convert to final format
+		switch( req_comp )
+		{
+		case 1:
+			//	RGBA => Luminance
+			tga_data[i*req_comp+0] = compute_y(trans_data[0],trans_data[1],trans_data[2]);
+			break;
+		case 2:
+			//	RGBA => Luminance,Alpha
+			tga_data[i*req_comp+0] = compute_y(trans_data[0],trans_data[1],trans_data[2]);
+			tga_data[i*req_comp+1] = trans_data[3];
+			break;
+		case 3:
+			//	RGBA => RGB
+			tga_data[i*req_comp+0] = trans_data[0];
+			tga_data[i*req_comp+1] = trans_data[1];
+			tga_data[i*req_comp+2] = trans_data[2];
+			break;
+		case 4:
+			//	RGBA => RGBA
+			tga_data[i*req_comp+0] = trans_data[0];
+			tga_data[i*req_comp+1] = trans_data[1];
+			tga_data[i*req_comp+2] = trans_data[2];
+			tga_data[i*req_comp+3] = trans_data[3];
+			break;
+		}
+		//	in case we're in RLE mode, keep counting down
+		--RLE_count;
+	}
+	//	do I need to invert the image?
+	if( tga_inverted )
+	{
+		for( j = 0; j*2 < tga_height; ++j )
+		{
+			int index1 = j * tga_width * req_comp;
+			int index2 = (tga_height - 1 - j) * tga_width * req_comp;
+			for( i = tga_width * req_comp; i > 0; --i )
+			{
+				unsigned char temp = tga_data[index1];
+				tga_data[index1] = tga_data[index2];
+				tga_data[index2] = temp;
+				++index1;
+				++index2;
+			}
+		}
+	}
+	//	clear my palette, if I had one
+	if( tga_palette != NULL )
+	{
+		free( tga_palette );
+	}
+	//	the things I do to get rid of an error message, and yet keep
+	//	Microsoft's C compilers happy... [8^(
+	tga_palette_start = tga_palette_len = tga_palette_bits =
+			tga_x_origin = tga_y_origin = 0;
+	//	OK, done
+	return tga_data;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_tga_load             (char const *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_tga_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_tga_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s, f);
+   return tga_load(&s, x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_tga_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return tga_load(&s, x,y,comp,req_comp);
+}
+
+
+// *************************************************************************************************
+// Photoshop PSD loader -- PD by Thatcher Ulrich, integration by Nicholas Schulz, tweaked by STB
+
+static int psd_test(stbi *s)
+{
+	if (get32(s) != 0x38425053) return 0;	// "8BPS"
+	else return 1;
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_psd_test_file(FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s, f);
+   r = psd_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int stbi_psd_test_memory(stbi_uc const *buffer, int len)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return psd_test(&s);
+}
+
+static stbi_uc *psd_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+	int	pixelCount;
+	int channelCount, compression;
+	int channel, i, count, len;
+   int w,h;
+   uint8 *out;
+
+	// Check identifier
+	if (get32(s) != 0x38425053)	// "8BPS"
+		return epuc("not PSD", "Corrupt PSD image");
+
+	// Check file type version.
+	if (get16(s) != 1)
+		return epuc("wrong version", "Unsupported version of PSD image");
+
+	// Skip 6 reserved bytes.
+	skip(s, 6 );
+
+	// Read the number of channels (R, G, B, A, etc).
+	channelCount = get16(s);
+	if (channelCount < 0 || channelCount > 16)
+		return epuc("wrong channel count", "Unsupported number of channels in PSD image");
+
+	// Read the rows and columns of the image.
+   h = get32(s);
+   w = get32(s);
+
+	// Make sure the depth is 8 bits.
+	if (get16(s) != 8)
+		return epuc("unsupported bit depth", "PSD bit depth is not 8 bit");
+
+	// Make sure the color mode is RGB.
+	// Valid options are:
+	//   0: Bitmap
+	//   1: Grayscale
+	//   2: Indexed color
+	//   3: RGB color
+	//   4: CMYK color
+	//   7: Multichannel
+	//   8: Duotone
+	//   9: Lab color
+	if (get16(s) != 3)
+		return epuc("wrong color format", "PSD is not in RGB color format");
+
+	// Skip the Mode Data.  (It's the palette for indexed color; other info for other modes.)
+	skip(s,get32(s) );
+
+	// Skip the image resources.  (resolution, pen tool paths, etc)
+	skip(s, get32(s) );
+
+	// Skip the reserved data.
+	skip(s, get32(s) );
+
+	// Find out if the data is compressed.
+	// Known values:
+	//   0: no compression
+	//   1: RLE compressed
+	compression = get16(s);
+	if (compression > 1)
+		return epuc("bad compression", "PSD has an unknown compression format");
+
+	// Create the destination image.
+	out = (stbi_uc *) malloc(4 * w*h);
+	if (!out) return epuc("outofmem", "Out of memory");
+   pixelCount = w*h;
+
+	// Initialize the data to zero.
+	//memset( out, 0, pixelCount * 4 );
+
+	// Finally, the image data.
+	if (compression) {
+		// RLE as used by .PSD and .TIFF
+		// Loop until you get the number of unpacked bytes you are expecting:
+		//     Read the next source byte into n.
+		//     If n is between 0 and 127 inclusive, copy the next n+1 bytes literally.
+		//     Else if n is between -127 and -1 inclusive, copy the next byte -n+1 times.
+		//     Else if n is 128, noop.
+		// Endloop
+
+		// The RLE-compressed data is preceeded by a 2-byte data count for each row in the data,
+		// which we're going to just skip.
+		skip(s, h * channelCount * 2 );
+
+		// Read the RLE data by channel.
+		for (channel = 0; channel < 4; channel++) {
+			uint8 *p;
+
+         p = out+channel;
+			if (channel >= channelCount) {
+				// Fill this channel with default data.
+				for (i = 0; i < pixelCount; i++) *p = (channel == 3 ? 255 : 0), p += 4;
+			} else {
+				// Read the RLE data.
+				count = 0;
+				while (count < pixelCount) {
+					len = get8(s);
+					if (len == 128) {
+						// No-op.
+					} else if (len < 128) {
+						// Copy next len+1 bytes literally.
+						len++;
+						count += len;
+						while (len) {
+							*p = get8(s);
+                     p += 4;
+							len--;
+						}
+					} else if (len > 128) {
+						uint32	val;
+						// Next -len+1 bytes in the dest are replicated from next source byte.
+						// (Interpret len as a negative 8-bit int.)
+						len ^= 0x0FF;
+						len += 2;
+                  val = get8(s);
+						count += len;
+						while (len) {
+							*p = val;
+                     p += 4;
+							len--;
+						}
+					}
+				}
+			}
+		}
+
+	} else {
+		// We're at the raw image data.  It's each channel in order (Red, Green, Blue, Alpha, ...)
+		// where each channel consists of an 8-bit value for each pixel in the image.
+
+		// Read the data by channel.
+		for (channel = 0; channel < 4; channel++) {
+			uint8 *p;
+
+         p = out + channel;
+			if (channel > channelCount) {
+				// Fill this channel with default data.
+				for (i = 0; i < pixelCount; i++) *p = channel == 3 ? 255 : 0, p += 4;
+			} else {
+				// Read the data.
+				count = 0;
+				for (i = 0; i < pixelCount; i++)
+					*p = get8(s), p += 4;
+			}
+		}
+	}
+
+	if (req_comp && req_comp != 4) {
+		out = convert_format(out, 4, req_comp, w, h);
+		if (out == NULL) return out; // convert_format frees input on failure
+	}
+
+	if (comp) *comp = channelCount;
+	*y = h;
+	*x = w;
+
+	return out;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_psd_load(char const *filename, int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_psd_load_from_file(f, x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+
+stbi_uc *stbi_psd_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s, f);
+   return psd_load(&s, x,y,comp,req_comp);
+}
+#endif
+
+stbi_uc *stbi_psd_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s, buffer, len);
+   return psd_load(&s, x,y,comp,req_comp);
+}
+
+
+// *************************************************************************************************
+// Radiance RGBE HDR loader
+// originally by Nicolas Schulz
+#ifndef STBI_NO_HDR
+static int hdr_test(stbi *s)
+{
+   char *signature = "#?RADIANCE\n";
+   int i;
+   for (i=0; signature[i]; ++i)
+      if (get8(s) != signature[i])
+         return 0;
+	return 1;
+}
+
+int stbi_hdr_test_memory(stbi_uc const *buffer, int len)
+{
+   stbi s;
+	start_mem(&s, buffer, len);
+	return hdr_test(&s);
+}
+
+#ifndef STBI_NO_STDIO
+int stbi_hdr_test_file(FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s, f);
+   r = hdr_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+#define HDR_BUFLEN  1024
+static char *hdr_gettoken(stbi *z, char *buffer)
+{
+   int len=0;
+	//char *s = buffer,
+	char c = '\0';
+
+   c = get8(z);
+
+	while (!at_eof(z) && c != '\n') {
+		buffer[len++] = c;
+      if (len == HDR_BUFLEN-1) {
+         // flush to end of line
+         while (!at_eof(z) && get8(z) != '\n')
+            ;
+         break;
+      }
+      c = get8(z);
+	}
+
+   buffer[len] = 0;
+	return buffer;
+}
+
+static void hdr_convert(float *output, stbi_uc *input, int req_comp)
+{
+	if( input[3] != 0 ) {
+      float f1;
+		// Exponent
+		f1 = (float) ldexp(1.0f, input[3] - (int)(128 + 8));
+      if (req_comp <= 2)
+         output[0] = (input[0] + input[1] + input[2]) * f1 / 3;
+      else {
+         output[0] = input[0] * f1;
+         output[1] = input[1] * f1;
+         output[2] = input[2] * f1;
+      }
+      if (req_comp == 2) output[1] = 1;
+      if (req_comp == 4) output[3] = 1;
+	} else {
+      switch (req_comp) {
+         case 4: output[3] = 1; /* fallthrough */
+         case 3: output[0] = output[1] = output[2] = 0;
+                 break;
+         case 2: output[1] = 1; /* fallthrough */
+         case 1: output[0] = 0;
+                 break;
+      }
+	}
+}
+
+
+static float *hdr_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+   char buffer[HDR_BUFLEN];
+	char *token;
+	int valid = 0;
+	int width, height;
+   stbi_uc *scanline;
+	float *hdr_data;
+	int len;
+	unsigned char count, value;
+	int i, j, k, c1,c2, z;
+
+
+	// Check identifier
+	if (strcmp(hdr_gettoken(s,buffer), "#?RADIANCE") != 0)
+		return epf("not HDR", "Corrupt HDR image");
+
+	// Parse header
+	while(1) {
+		token = hdr_gettoken(s,buffer);
+      if (token[0] == 0) break;
+		if (strcmp(token, "FORMAT=32-bit_rle_rgbe") == 0) valid = 1;
+   }
+
+	if (!valid)    return epf("unsupported format", "Unsupported HDR format");
+
+   // Parse width and height
+   // can't use sscanf() if we're not using stdio!
+   token = hdr_gettoken(s,buffer);
+   if (strncmp(token, "-Y ", 3))  return epf("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   height = strtol(token, &token, 10);
+   while (*token == ' ') ++token;
+   if (strncmp(token, "+X ", 3))  return epf("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   width = strtol(token, NULL, 10);
+
+	*x = width;
+	*y = height;
+
+   *comp = 3;
+	if (req_comp == 0) req_comp = 3;
+
+	// Read data
+	hdr_data = (float *) malloc(height * width * req_comp * sizeof(float));
+
+	// Load image data
+   // image data is stored as some number of sca
+	if( width < 8 || width >= 32768) {
+		// Read flat data
+      for (j=0; j < height; ++j) {
+         for (i=0; i < width; ++i) {
+            stbi_uc rgbe[4];
+           main_decode_loop:
+            getn(s, rgbe, 4);
+            hdr_convert(hdr_data + j * width * req_comp + i * req_comp, rgbe, req_comp);
+         }
+      }
+	} else {
+		// Read RLE-encoded data
+		scanline = NULL;
+
+		for (j = 0; j < height; ++j) {
+         c1 = get8(s);
+         c2 = get8(s);
+         len = get8(s);
+         if (c1 != 2 || c2 != 2 || (len & 0x80)) {
+            // not run-length encoded, so we have to actually use THIS data as a decoded
+            // pixel (note this can't be a valid pixel--one of RGB must be >= 128)
+            stbi_uc rgbe[4] = { c1,c2,len, get8(s) };
+            hdr_convert(hdr_data, rgbe, req_comp);
+            i = 1;
+            j = 0;
+            free(scanline);
+            goto main_decode_loop; // yes, this is fucking insane; blame the fucking insane format
+         }
+         len <<= 8;
+         len |= get8(s);
+         if (len != width) { free(hdr_data); free(scanline); return epf("invalid decoded scanline length", "corrupt HDR"); }
+         if (scanline == NULL) scanline = (stbi_uc *) malloc(width * 4);
+
+			for (k = 0; k < 4; ++k) {
+				i = 0;
+				while (i < width) {
+					count = get8(s);
+					if (count > 128) {
+						// Run
+						value = get8(s);
+                  count -= 128;
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = value;
+					} else {
+						// Dump
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = get8(s);
+					}
+				}
+			}
+         for (i=0; i < width; ++i)
+            hdr_convert(hdr_data+(j*width + i)*req_comp, scanline + i*4, req_comp);
+		}
+      free(scanline);
+	}
+
+   return hdr_data;
+}
+
+static stbi_uc *hdr_load_rgbe(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+   char buffer[HDR_BUFLEN];
+	char *token;
+	int valid = 0;
+	int width, height;
+   stbi_uc *scanline;
+	stbi_uc *rgbe_data;
+	int len;
+	unsigned char count, value;
+	int i, j, k, c1,c2, z;
+
+
+	// Check identifier
+	if (strcmp(hdr_gettoken(s,buffer), "#?RADIANCE") != 0)
+		return epuc("not HDR", "Corrupt HDR image");
+
+	// Parse header
+	while(1) {
+		token = hdr_gettoken(s,buffer);
+      if (token[0] == 0) break;
+		if (strcmp(token, "FORMAT=32-bit_rle_rgbe") == 0) valid = 1;
+   }
+
+	if (!valid)    return epuc("unsupported format", "Unsupported HDR format");
+
+   // Parse width and height
+   // can't use sscanf() if we're not using stdio!
+   token = hdr_gettoken(s,buffer);
+   if (strncmp(token, "-Y ", 3))  return epuc("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   height = strtol(token, &token, 10);
+   while (*token == ' ') ++token;
+   if (strncmp(token, "+X ", 3))  return epuc("unsupported data layout", "Unsupported HDR format");
+   token += 3;
+   width = strtol(token, NULL, 10);
+
+	*x = width;
+	*y = height;
+
+	// RGBE _MUST_ come out as 4 components
+   *comp = 4;
+	req_comp = 4;
+
+	// Read data
+	rgbe_data = (stbi_uc *) malloc(height * width * req_comp * sizeof(stbi_uc));
+	//	point to the beginning
+	scanline = rgbe_data;
+
+	// Load image data
+   // image data is stored as some number of scan lines
+	if( width < 8 || width >= 32768) {
+		// Read flat data
+      for (j=0; j < height; ++j) {
+         for (i=0; i < width; ++i) {
+           main_decode_loop:
+            //getn(rgbe, 4);
+            getn(s,scanline, 4);
+			scanline += 4;
+         }
+      }
+	} else {
+		// Read RLE-encoded data
+		for (j = 0; j < height; ++j) {
+         c1 = get8(s);
+         c2 = get8(s);
+         len = get8(s);
+         if (c1 != 2 || c2 != 2 || (len & 0x80)) {
+            // not run-length encoded, so we have to actually use THIS data as a decoded
+            // pixel (note this can't be a valid pixel--one of RGB must be >= 128)
+            scanline[0] = c1;
+            scanline[1] = c2;
+            scanline[2] = len;
+            scanline[3] = get8(s);
+            scanline += 4;
+            i = 1;
+            j = 0;
+            goto main_decode_loop; // yes, this is insane; blame the insane format
+         }
+         len <<= 8;
+         len |= get8(s);
+         if (len != width) { free(rgbe_data); return epuc("invalid decoded scanline length", "corrupt HDR"); }
+			for (k = 0; k < 4; ++k) {
+				i = 0;
+				while (i < width) {
+					count = get8(s);
+					if (count > 128) {
+						// Run
+						value = get8(s);
+                  count -= 128;
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = value;
+					} else {
+						// Dump
+						for (z = 0; z < count; ++z)
+							scanline[i++ * 4 + k] = get8(s);
+					}
+				}
+			}
+			//	move the scanline on
+			scanline += 4 * width;
+		}
+	}
+
+   return rgbe_data;
+}
+
+#ifndef STBI_NO_STDIO
+float *stbi_hdr_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s,f);
+   return hdr_load(&s,x,y,comp,req_comp);
+}
+
+stbi_uc *stbi_hdr_load_rgbe_file(FILE *f, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_file(&s,f);
+   return hdr_load_rgbe(&s,x,y,comp,req_comp);
+}
+
+stbi_uc *stbi_hdr_load_rgbe        (char const *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   FILE *f = fopen(filename, "rb");
+   unsigned char *result;
+   if (!f) return epuc("can't fopen", "Unable to open file");
+   result = stbi_hdr_load_rgbe_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return result;
+}
+#endif
+
+float *stbi_hdr_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s,buffer, len);
+   return hdr_load(&s,x,y,comp,req_comp);
+}
+
+stbi_uc *stbi_hdr_load_rgbe_memory(stbi_uc *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+   stbi s;
+   start_mem(&s,buffer, len);
+   return hdr_load_rgbe(&s,x,y,comp,req_comp);
+}
+
+#endif // STBI_NO_HDR
+
+/////////////////////// write image ///////////////////////
+
+#ifndef STBI_NO_WRITE
+
+static void write8(FILE *f, int x) { uint8 z = (uint8) x; fwrite(&z,1,1,f); }
+
+static void writefv(FILE *f, char *fmt, va_list v)
+{
+   while (*fmt) {
+      switch (*fmt++) {
+         case ' ': break;
+         case '1': { uint8 x = va_arg(v, int); write8(f,x); break; }
+         case '2': { int16 x = va_arg(v, int); write8(f,x); write8(f,x>>8); break; }
+         case '4': { int32 x = va_arg(v, int); write8(f,x); write8(f,x>>8); write8(f,x>>16); write8(f,x>>24); break; }
+         default:
+            assert(0);
+            va_end(v);
+            return;
+      }
+   }
+}
+
+static void writef(FILE *f, char *fmt, ...)
+{
+   va_list v;
+   va_start(v, fmt);
+   writefv(f,fmt,v);
+   va_end(v);
+}
+
+static void write_pixels(FILE *f, int rgb_dir, int vdir, int x, int y, int comp, void *data, int write_alpha, int scanline_pad)
+{
+   uint8 bg[3] = { 255, 0, 255}, px[3];
+   uint32 zero = 0;
+   int i,j,k, j_end;
+
+   if (vdir < 0)
+      j_end = -1, j = y-1;
+   else
+      j_end =  y, j = 0;
+
+   for (; j != j_end; j += vdir) {
+      for (i=0; i < x; ++i) {
+         uint8 *d = (uint8 *) data + (j*x+i)*comp;
+         if (write_alpha < 0)
+            fwrite(&d[comp-1], 1, 1, f);
+         switch (comp) {
+            case 1:
+            case 2: writef(f, "111", d[0],d[0],d[0]);
+                    break;
+            case 4:
+               if (!write_alpha) {
+                  for (k=0; k < 3; ++k)
+                     px[k] = bg[k] + ((d[k] - bg[k]) * d[3])/255;
+                  writef(f, "111", px[1-rgb_dir],px[1],px[1+rgb_dir]);
+                  break;
+               }
+               /* FALLTHROUGH */
+            case 3:
+               writef(f, "111", d[1-rgb_dir],d[1],d[1+rgb_dir]);
+               break;
+         }
+         if (write_alpha > 0)
+            fwrite(&d[comp-1], 1, 1, f);
+      }
+      fwrite(&zero,scanline_pad,1,f);
+   }
+}
+
+static int outfile(char const *filename, int rgb_dir, int vdir, int x, int y, int comp, void *data, int alpha, int pad, char *fmt, ...)
+{
+   FILE *f = fopen(filename, "wb");
+   if (f) {
+      va_list v;
+      va_start(v, fmt);
+      writefv(f, fmt, v);
+      va_end(v);
+      write_pixels(f,rgb_dir,vdir,x,y,comp,data,alpha,pad);
+      fclose(f);
+   }
+   return f != NULL;
+}
+
+int stbi_write_bmp(char const *filename, int x, int y, int comp, void *data)
+{
+   int pad = (-x*3) & 3;
+   return outfile(filename,-1,-1,x,y,comp,data,0,pad,
+           "11 4 22 4" "4 44 22 444444",
+           'B', 'M', 14+40+(x*3+pad)*y, 0,0, 14+40,  // file header
+            40, x,y, 1,24, 0,0,0,0,0,0);             // bitmap header
+}
+
+int stbi_write_tga(char const *filename, int x, int y, int comp, void *data)
+{
+   int has_alpha = !(comp & 1);
+   return outfile(filename, -1,-1, x, y, comp, data, has_alpha, 0,
+                  "111 221 2222 11", 0,0,2, 0,0,0, 0,0,x,y, 24+8*has_alpha, 8*has_alpha);
+}
+
+// any other image formats that do interleaved rgb data?
+//    PNG: requires adler32,crc32 -- significant amount of code
+//    PSD: no, channels output separately
+//    TIFF: no, stripwise-interleaved... i think
+
+#endif // STBI_NO_WRITE
+
+//	add in my DDS loading support
+#ifndef STBI_NO_DDS
+#include "stbi_DDS_aug_c.h"
+#endif
diff --git a/external/include/SOIL/stb_image_aug.h b/external/include/SOIL/stb_image_aug.h
new file mode 100644
index 0000000..e59f2eb
--- /dev/null
+++ b/external/include/SOIL/stb_image_aug.h
@@ -0,0 +1,354 @@
+/* stbi-1.16 - public domain JPEG/PNG reader - http://nothings.org/stb_image.c
+                      when you control the images you're loading
+
+   QUICK NOTES:
+      Primarily of interest to game developers and other people who can
+          avoid problematic images and only need the trivial interface
+
+      JPEG baseline (no JPEG progressive, no oddball channel decimations)
+      PNG non-interlaced
+      BMP non-1bpp, non-RLE
+      TGA (not sure what subset, if a subset)
+      PSD (composited view only, no extra channels)
+      HDR (radiance rgbE format)
+      writes BMP,TGA (define STBI_NO_WRITE to remove code)
+      decoded from memory or through stdio FILE (define STBI_NO_STDIO to remove code)
+      supports installable dequantizing-IDCT, YCbCr-to-RGB conversion (define STBI_SIMD)
+        
+   TODO:
+      stbi_info_*
+  
+   history:
+      1.16   major bugfix - convert_format converted one too many pixels
+      1.15   initialize some fields for thread safety
+      1.14   fix threadsafe conversion bug; header-file-only version (#define STBI_HEADER_FILE_ONLY before including)
+      1.13   threadsafe
+      1.12   const qualifiers in the API
+      1.11   Support installable IDCT, colorspace conversion routines
+      1.10   Fixes for 64-bit (don't use "unsigned long")
+             optimized upsampling by Fabian "ryg" Giesen
+      1.09   Fix format-conversion for PSD code (bad global variables!)
+      1.08   Thatcher Ulrich's PSD code integrated by Nicolas Schulz
+      1.07   attempt to fix C++ warning/errors again
+      1.06   attempt to fix C++ warning/errors again
+      1.05   fix TGA loading to return correct *comp and use good luminance calc
+      1.04   default float alpha is 1, not 255; use 'void *' for stbi_image_free
+      1.03   bugfixes to STBI_NO_STDIO, STBI_NO_HDR
+      1.02   support for (subset of) HDR files, float interface for preferred access to them
+      1.01   fix bug: possible bug in handling right-side up bmps... not sure
+             fix bug: the stbi_bmp_load() and stbi_tga_load() functions didn't work at all
+      1.00   interface to zlib that skips zlib header
+      0.99   correct handling of alpha in palette
+      0.98   TGA loader by lonesock; dynamically add loaders (untested)
+      0.97   jpeg errors on too large a file; also catch another malloc failure
+      0.96   fix detection of invalid v value - particleman@mollyrocket forum
+      0.95   during header scan, seek to markers in case of padding
+      0.94   STBI_NO_STDIO to disable stdio usage; rename all #defines the same
+      0.93   handle jpegtran output; verbose errors
+      0.92   read 4,8,16,24,32-bit BMP files of several formats
+      0.91   output 24-bit Windows 3.0 BMP files
+      0.90   fix a few more warnings; bump version number to approach 1.0
+      0.61   bugfixes due to Marc LeBlanc, Christopher Lloyd
+      0.60   fix compiling as c++
+      0.59   fix warnings: merge Dave Moore's -Wall fixes
+      0.58   fix bug: zlib uncompressed mode len/nlen was wrong endian
+      0.57   fix bug: jpg last huffman symbol before marker was >9 bits but less
+                      than 16 available
+      0.56   fix bug: zlib uncompressed mode len vs. nlen
+      0.55   fix bug: restart_interval not initialized to 0
+      0.54   allow NULL for 'int *comp'
+      0.53   fix bug in png 3->4; speedup png decoding
+      0.52   png handles req_comp=3,4 directly; minor cleanup; jpeg comments
+      0.51   obey req_comp requests, 1-component jpegs return as 1-component,
+             on 'test' only check type, not whether we support this variant
+*/
+
+#ifndef HEADER_STB_IMAGE_AUGMENTED
+#define HEADER_STB_IMAGE_AUGMENTED
+
+////   begin header file  ////////////////////////////////////////////////////
+//
+// Limitations:
+//    - no progressive/interlaced support (jpeg, png)
+//    - 8-bit samples only (jpeg, png)
+//    - not threadsafe
+//    - channel subsampling of at most 2 in each dimension (jpeg)
+//    - no delayed line count (jpeg) -- IJG doesn't support either
+//
+// Basic usage (see HDR discussion below):
+//    int x,y,n;
+//    unsigned char *data = stbi_load(filename, &x, &y, &n, 0);
+//    // ... process data if not NULL ... 
+//    // ... x = width, y = height, n = # 8-bit components per pixel ...
+//    // ... replace '0' with '1'..'4' to force that many components per pixel
+//    stbi_image_free(data)
+//
+// Standard parameters:
+//    int *x       -- outputs image width in pixels
+//    int *y       -- outputs image height in pixels
+//    int *comp    -- outputs # of image components in image file
+//    int req_comp -- if non-zero, # of image components requested in result
+//
+// The return value from an image loader is an 'unsigned char *' which points
+// to the pixel data. The pixel data consists of *y scanlines of *x pixels,
+// with each pixel consisting of N interleaved 8-bit components; the first
+// pixel pointed to is top-left-most in the image. There is no padding between
+// image scanlines or between pixels, regardless of format. The number of
+// components N is 'req_comp' if req_comp is non-zero, or *comp otherwise.
+// If req_comp is non-zero, *comp has the number of components that _would_
+// have been output otherwise. E.g. if you set req_comp to 4, you will always
+// get RGBA output, but you can check *comp to easily see if it's opaque.
+//
+// An output image with N components has the following components interleaved
+// in this order in each pixel:
+//
+//     N=#comp     components
+//       1           grey
+//       2           grey, alpha
+//       3           red, green, blue
+//       4           red, green, blue, alpha
+//
+// If image loading fails for any reason, the return value will be NULL,
+// and *x, *y, *comp will be unchanged. The function stbi_failure_reason()
+// can be queried for an extremely brief, end-user unfriendly explanation
+// of why the load failed. Define STBI_NO_FAILURE_STRINGS to avoid
+// compiling these strings at all, and STBI_FAILURE_USERMSG to get slightly
+// more user-friendly ones.
+//
+// Paletted PNG and BMP images are automatically depalettized.
+//
+//
+// ===========================================================================
+//
+// HDR image support   (disable by defining STBI_NO_HDR)
+//
+// stb_image now supports loading HDR images in general, and currently
+// the Radiance .HDR file format, although the support is provided
+// generically. You can still load any file through the existing interface;
+// if you attempt to load an HDR file, it will be automatically remapped to
+// LDR, assuming gamma 2.2 and an arbitrary scale factor defaulting to 1;
+// both of these constants can be reconfigured through this interface:
+//
+//     stbi_hdr_to_ldr_gamma(2.2f);
+//     stbi_hdr_to_ldr_scale(1.0f);
+//
+// (note, do not use _inverse_ constants; stbi_image will invert them
+// appropriately).
+//
+// Additionally, there is a new, parallel interface for loading files as
+// (linear) floats to preserve the full dynamic range:
+//
+//    float *data = stbi_loadf(filename, &x, &y, &n, 0);
+// 
+// If you load LDR images through this interface, those images will
+// be promoted to floating point values, run through the inverse of
+// constants corresponding to the above:
+//
+//     stbi_ldr_to_hdr_scale(1.0f);
+//     stbi_ldr_to_hdr_gamma(2.2f);
+//
+// Finally, given a filename (or an open file or memory block--see header
+// file for details) containing image data, you can query for the "most
+// appropriate" interface to use (that is, whether the image is HDR or
+// not), using:
+//
+//     stbi_is_hdr(char *filename);
+
+#ifndef STBI_NO_STDIO
+#include <stdio.h>
+#endif
+
+#define STBI_VERSION 1
+
+enum
+{
+   STBI_default = 0, // only used for req_comp
+
+   STBI_grey       = 1,
+   STBI_grey_alpha = 2,
+   STBI_rgb        = 3,
+   STBI_rgb_alpha  = 4,
+};
+
+typedef unsigned char stbi_uc;
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// WRITING API
+
+#if !defined(STBI_NO_WRITE) && !defined(STBI_NO_STDIO)
+// write a BMP/TGA file given tightly packed 'comp' channels (no padding, nor bmp-stride-padding)
+// (you must include the appropriate extension in the filename).
+// returns TRUE on success, FALSE if couldn't open file, error writing file
+extern int      stbi_write_bmp       (char const *filename,     int x, int y, int comp, void *data);
+extern int      stbi_write_tga       (char const *filename,     int x, int y, int comp, void *data);
+#endif
+
+// PRIMARY API - works on images of any type
+
+// load image by filename, open file, or memory buffer
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_load            (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_load_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+extern int      stbi_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+extern stbi_uc *stbi_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+// for stbi_load_from_file, file pointer is left pointing immediately after image
+
+#ifndef STBI_NO_HDR
+#ifndef STBI_NO_STDIO
+extern float *stbi_loadf            (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern float *stbi_loadf_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+extern float *stbi_loadf_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+
+extern void   stbi_hdr_to_ldr_gamma(float gamma);
+extern void   stbi_hdr_to_ldr_scale(float scale);
+
+extern void   stbi_ldr_to_hdr_gamma(float gamma);
+extern void   stbi_ldr_to_hdr_scale(float scale);
+
+#endif // STBI_NO_HDR
+
+// get a VERY brief reason for failure
+// NOT THREADSAFE
+extern char    *stbi_failure_reason  (void); 
+
+// free the loaded image -- this is just free()
+extern void     stbi_image_free      (void *retval_from_stbi_load);
+
+// get image dimensions & components without fully decoding
+extern int      stbi_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+extern int      stbi_is_hdr_from_memory(stbi_uc const *buffer, int len);
+#ifndef STBI_NO_STDIO
+extern int      stbi_info            (char const *filename,     int *x, int *y, int *comp);
+extern int      stbi_is_hdr          (char const *filename);
+extern int      stbi_is_hdr_from_file(FILE *f);
+#endif
+
+// ZLIB client - used by PNG, available for other purposes
+
+extern char *stbi_zlib_decode_malloc_guesssize(const char *buffer, int len, int initial_size, int *outlen);
+extern char *stbi_zlib_decode_malloc(const char *buffer, int len, int *outlen);
+extern int   stbi_zlib_decode_buffer(char *obuffer, int olen, const char *ibuffer, int ilen);
+
+extern char *stbi_zlib_decode_noheader_malloc(const char *buffer, int len, int *outlen);
+extern int   stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, const char *ibuffer, int ilen);
+
+// TYPE-SPECIFIC ACCESS
+
+// is it a jpeg?
+extern int      stbi_jpeg_test_memory     (stbi_uc const *buffer, int len);
+extern stbi_uc *stbi_jpeg_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+extern int      stbi_jpeg_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_jpeg_load            (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern int      stbi_jpeg_test_file       (FILE *f);
+extern stbi_uc *stbi_jpeg_load_from_file  (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+
+extern int      stbi_jpeg_info            (char const *filename,     int *x, int *y, int *comp);
+extern int      stbi_jpeg_info_from_file  (FILE *f,                  int *x, int *y, int *comp);
+#endif
+
+// is it a png?
+extern int      stbi_png_test_memory      (stbi_uc const *buffer, int len);
+extern stbi_uc *stbi_png_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp);
+
+#ifndef STBI_NO_STDIO
+extern stbi_uc *stbi_png_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info             (char const *filename,     int *x, int *y, int *comp);
+extern int      stbi_png_test_file        (FILE *f);
+extern stbi_uc *stbi_png_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+extern int      stbi_png_info_from_file   (FILE *f,                  int *x, int *y, int *comp);
+#endif
+
+// is it a bmp?
+extern int      stbi_bmp_test_memory      (stbi_uc const *buffer, int len);
+
+extern stbi_uc *stbi_bmp_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_bmp_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_bmp_test_file        (FILE *f);
+extern stbi_uc *stbi_bmp_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it a tga?
+extern int      stbi_tga_test_memory      (stbi_uc const *buffer, int len);
+
+extern stbi_uc *stbi_tga_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_tga_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_tga_test_file        (FILE *f);
+extern stbi_uc *stbi_tga_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it a psd?
+extern int      stbi_psd_test_memory      (stbi_uc const *buffer, int len);
+
+extern stbi_uc *stbi_psd_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_psd_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_psd_test_file        (FILE *f);
+extern stbi_uc *stbi_psd_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// is it an hdr?
+extern int      stbi_hdr_test_memory      (stbi_uc const *buffer, int len);
+
+extern float *  stbi_hdr_load             (char const *filename,     int *x, int *y, int *comp, int req_comp);
+extern float *  stbi_hdr_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_hdr_load_rgbe        (char const *filename,           int *x, int *y, int *comp, int req_comp);
+extern float *  stbi_hdr_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_hdr_test_file        (FILE *f);
+extern float *  stbi_hdr_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_hdr_load_rgbe_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+// define new loaders
+typedef struct
+{
+   int       (*test_memory)(stbi_uc const *buffer, int len);
+   stbi_uc * (*load_from_memory)(stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+   #ifndef STBI_NO_STDIO
+   int       (*test_file)(FILE *f);
+   stbi_uc * (*load_from_file)(FILE *f, int *x, int *y, int *comp, int req_comp);
+   #endif
+} stbi_loader;
+
+// register a loader by filling out the above structure (you must defined ALL functions)
+// returns 1 if added or already added, 0 if not added (too many loaders)
+// NOT THREADSAFE
+extern int stbi_register_loader(stbi_loader *loader);
+
+// define faster low-level operations (typically SIMD support)
+#if STBI_SIMD
+typedef void (*stbi_idct_8x8)(uint8 *out, int out_stride, short data[64], unsigned short *dequantize);
+// compute an integer IDCT on "input"
+//     input[x] = data[x] * dequantize[x]
+//     write results to 'out': 64 samples, each run of 8 spaced by 'out_stride'
+//                             CLAMP results to 0..255
+typedef void (*stbi_YCbCr_to_RGB_run)(uint8 *output, uint8 const *y, uint8 const *cb, uint8 const *cr, int count, int step);
+// compute a conversion from YCbCr to RGB
+//     'count' pixels
+//     write pixels to 'output'; each pixel is 'step' bytes (either 3 or 4; if 4, write '255' as 4th), order R,G,B
+//     y: Y input channel
+//     cb: Cb input channel; scale/biased to be 0..255
+//     cr: Cr input channel; scale/biased to be 0..255
+
+extern void stbi_install_idct(stbi_idct_8x8 func);
+extern void stbi_install_YCbCr_to_RGB(stbi_YCbCr_to_RGB_run func);
+#endif // STBI_SIMD
+
+#ifdef __cplusplus
+}
+#endif
+
+//
+//
+////   end header file   /////////////////////////////////////////////////////
+#endif // STBI_INCLUDE_STB_IMAGE_H
diff --git a/external/include/SOIL/stbi_DDS_aug.h b/external/include/SOIL/stbi_DDS_aug.h
new file mode 100644
index 0000000..c7da9f7
--- /dev/null
+++ b/external/include/SOIL/stbi_DDS_aug.h
@@ -0,0 +1,21 @@
+/*
+	adding DDS loading support to stbi
+*/
+
+#ifndef HEADER_STB_IMAGE_DDS_AUGMENTATION
+#define HEADER_STB_IMAGE_DDS_AUGMENTATION
+
+//	is it a DDS file?
+extern int      stbi_dds_test_memory      (stbi_uc const *buffer, int len);
+
+extern stbi_uc *stbi_dds_load             (char *filename,           int *x, int *y, int *comp, int req_comp);
+extern stbi_uc *stbi_dds_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp);
+#ifndef STBI_NO_STDIO
+extern int      stbi_dds_test_file        (FILE *f);
+extern stbi_uc *stbi_dds_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp);
+#endif
+
+//
+//
+////   end header file   /////////////////////////////////////////////////////
+#endif // HEADER_STB_IMAGE_DDS_AUGMENTATION
diff --git a/external/include/SOIL/stbi_DDS_aug_c.h b/external/include/SOIL/stbi_DDS_aug_c.h
new file mode 100644
index 0000000..f49407a
--- /dev/null
+++ b/external/include/SOIL/stbi_DDS_aug_c.h
@@ -0,0 +1,511 @@
+
+///	DDS file support, does decoding, _not_ direct uploading
+///	(use SOIL for that ;-)
+
+///	A bunch of DirectDraw Surface structures and flags
+typedef struct {
+    unsigned int    dwMagic;
+    unsigned int    dwSize;
+    unsigned int    dwFlags;
+    unsigned int    dwHeight;
+    unsigned int    dwWidth;
+    unsigned int    dwPitchOrLinearSize;
+    unsigned int    dwDepth;
+    unsigned int    dwMipMapCount;
+    unsigned int    dwReserved1[ 11 ];
+
+    //  DDPIXELFORMAT
+    struct {
+      unsigned int    dwSize;
+      unsigned int    dwFlags;
+      unsigned int    dwFourCC;
+      unsigned int    dwRGBBitCount;
+      unsigned int    dwRBitMask;
+      unsigned int    dwGBitMask;
+      unsigned int    dwBBitMask;
+      unsigned int    dwAlphaBitMask;
+    }               sPixelFormat;
+
+    //  DDCAPS2
+    struct {
+      unsigned int    dwCaps1;
+      unsigned int    dwCaps2;
+      unsigned int    dwDDSX;
+      unsigned int    dwReserved;
+    }               sCaps;
+    unsigned int    dwReserved2;
+} DDS_header ;
+
+//	the following constants were copied directly off the MSDN website
+
+//	The dwFlags member of the original DDSURFACEDESC2 structure
+//	can be set to one or more of the following values.
+#define DDSD_CAPS	0x00000001
+#define DDSD_HEIGHT	0x00000002
+#define DDSD_WIDTH	0x00000004
+#define DDSD_PITCH	0x00000008
+#define DDSD_PIXELFORMAT	0x00001000
+#define DDSD_MIPMAPCOUNT	0x00020000
+#define DDSD_LINEARSIZE	0x00080000
+#define DDSD_DEPTH	0x00800000
+
+//	DirectDraw Pixel Format
+#define DDPF_ALPHAPIXELS	0x00000001
+#define DDPF_FOURCC	0x00000004
+#define DDPF_RGB	0x00000040
+
+//	The dwCaps1 member of the DDSCAPS2 structure can be
+//	set to one or more of the following values.
+#define DDSCAPS_COMPLEX	0x00000008
+#define DDSCAPS_TEXTURE	0x00001000
+#define DDSCAPS_MIPMAP	0x00400000
+
+//	The dwCaps2 member of the DDSCAPS2 structure can be
+//	set to one or more of the following values.
+#define DDSCAPS2_CUBEMAP	0x00000200
+#define DDSCAPS2_CUBEMAP_POSITIVEX	0x00000400
+#define DDSCAPS2_CUBEMAP_NEGATIVEX	0x00000800
+#define DDSCAPS2_CUBEMAP_POSITIVEY	0x00001000
+#define DDSCAPS2_CUBEMAP_NEGATIVEY	0x00002000
+#define DDSCAPS2_CUBEMAP_POSITIVEZ	0x00004000
+#define DDSCAPS2_CUBEMAP_NEGATIVEZ	0x00008000
+#define DDSCAPS2_VOLUME	0x00200000
+
+static int dds_test(stbi *s)
+{
+	//	check the magic number
+	if (get8(s) != 'D') return 0;
+	if (get8(s) != 'D') return 0;
+	if (get8(s) != 'S') return 0;
+	if (get8(s) != ' ') return 0;
+	//	check header size
+	if (get32le(s) != 124) return 0;
+	return 1;
+}
+#ifndef STBI_NO_STDIO
+int      stbi_dds_test_file        (FILE *f)
+{
+   stbi s;
+   int r,n = ftell(f);
+   start_file(&s,f);
+   r = dds_test(&s);
+   fseek(f,n,SEEK_SET);
+   return r;
+}
+#endif
+
+int      stbi_dds_test_memory      (stbi_uc const *buffer, int len)
+{
+   stbi s;
+   start_mem(&s,buffer, len);
+   return dds_test(&s);
+}
+
+//	helper functions
+int stbi_convert_bit_range( int c, int from_bits, int to_bits )
+{
+	int b = (1 << (from_bits - 1)) + c * ((1 << to_bits) - 1);
+	return (b + (b >> from_bits)) >> from_bits;
+}
+void stbi_rgb_888_from_565( unsigned int c, int *r, int *g, int *b )
+{
+	*r = stbi_convert_bit_range( (c >> 11) & 31, 5, 8 );
+	*g = stbi_convert_bit_range( (c >> 05) & 63, 6, 8 );
+	*b = stbi_convert_bit_range( (c >> 00) & 31, 5, 8 );
+}
+void stbi_decode_DXT1_block(
+			unsigned char uncompressed[16*4],
+			unsigned char compressed[8] )
+{
+	int next_bit = 4*8;
+	int i, r, g, b;
+	int c0, c1;
+	unsigned char decode_colors[4*4];
+	//	find the 2 primary colors
+	c0 = compressed[0] + (compressed[1] << 8);
+	c1 = compressed[2] + (compressed[3] << 8);
+	stbi_rgb_888_from_565( c0, &r, &g, &b );
+	decode_colors[0] = r;
+	decode_colors[1] = g;
+	decode_colors[2] = b;
+	decode_colors[3] = 255;
+	stbi_rgb_888_from_565( c1, &r, &g, &b );
+	decode_colors[4] = r;
+	decode_colors[5] = g;
+	decode_colors[6] = b;
+	decode_colors[7] = 255;
+	if( c0 > c1 )
+	{
+		//	no alpha, 2 interpolated colors
+		decode_colors[8] = (2*decode_colors[0] + decode_colors[4]) / 3;
+		decode_colors[9] = (2*decode_colors[1] + decode_colors[5]) / 3;
+		decode_colors[10] = (2*decode_colors[2] + decode_colors[6]) / 3;
+		decode_colors[11] = 255;
+		decode_colors[12] = (decode_colors[0] + 2*decode_colors[4]) / 3;
+		decode_colors[13] = (decode_colors[1] + 2*decode_colors[5]) / 3;
+		decode_colors[14] = (decode_colors[2] + 2*decode_colors[6]) / 3;
+		decode_colors[15] = 255;
+	} else
+	{
+		//	1 interpolated color, alpha
+		decode_colors[8] = (decode_colors[0] + decode_colors[4]) / 2;
+		decode_colors[9] = (decode_colors[1] + decode_colors[5]) / 2;
+		decode_colors[10] = (decode_colors[2] + decode_colors[6]) / 2;
+		decode_colors[11] = 255;
+		decode_colors[12] = 0;
+		decode_colors[13] = 0;
+		decode_colors[14] = 0;
+		decode_colors[15] = 0;
+	}
+	//	decode the block
+	for( i = 0; i < 16*4; i += 4 )
+	{
+		int idx = ((compressed[next_bit>>3] >> (next_bit & 7)) & 3) * 4;
+		next_bit += 2;
+		uncompressed[i+0] = decode_colors[idx+0];
+		uncompressed[i+1] = decode_colors[idx+1];
+		uncompressed[i+2] = decode_colors[idx+2];
+		uncompressed[i+3] = decode_colors[idx+3];
+	}
+	//	done
+}
+void stbi_decode_DXT23_alpha_block(
+			unsigned char uncompressed[16*4],
+			unsigned char compressed[8] )
+{
+	int i, next_bit = 0;
+	//	each alpha value gets 4 bits
+	for( i = 3; i < 16*4; i += 4 )
+	{
+		uncompressed[i] = stbi_convert_bit_range(
+				(compressed[next_bit>>3] >> (next_bit&7)) & 15,
+				4, 8 );
+		next_bit += 4;
+	}
+}
+void stbi_decode_DXT45_alpha_block(
+			unsigned char uncompressed[16*4],
+			unsigned char compressed[8] )
+{
+	int i, next_bit = 8*2;
+	unsigned char decode_alpha[8];
+	//	each alpha value gets 3 bits, and the 1st 2 bytes are the range
+	decode_alpha[0] = compressed[0];
+	decode_alpha[1] = compressed[1];
+	if( decode_alpha[0] > decode_alpha[1] )
+	{
+		//	6 step intermediate
+		decode_alpha[2] = (6*decode_alpha[0] + 1*decode_alpha[1]) / 7;
+		decode_alpha[3] = (5*decode_alpha[0] + 2*decode_alpha[1]) / 7;
+		decode_alpha[4] = (4*decode_alpha[0] + 3*decode_alpha[1]) / 7;
+		decode_alpha[5] = (3*decode_alpha[0] + 4*decode_alpha[1]) / 7;
+		decode_alpha[6] = (2*decode_alpha[0] + 5*decode_alpha[1]) / 7;
+		decode_alpha[7] = (1*decode_alpha[0] + 6*decode_alpha[1]) / 7;
+	} else
+	{
+		//	4 step intermediate, pluss full and none
+		decode_alpha[2] = (4*decode_alpha[0] + 1*decode_alpha[1]) / 5;
+		decode_alpha[3] = (3*decode_alpha[0] + 2*decode_alpha[1]) / 5;
+		decode_alpha[4] = (2*decode_alpha[0] + 3*decode_alpha[1]) / 5;
+		decode_alpha[5] = (1*decode_alpha[0] + 4*decode_alpha[1]) / 5;
+		decode_alpha[6] = 0;
+		decode_alpha[7] = 255;
+	}
+	for( i = 3; i < 16*4; i += 4 )
+	{
+		int idx = 0, bit;
+		bit = (compressed[next_bit>>3] >> (next_bit&7)) & 1;
+		idx += bit << 0;
+		++next_bit;
+		bit = (compressed[next_bit>>3] >> (next_bit&7)) & 1;
+		idx += bit << 1;
+		++next_bit;
+		bit = (compressed[next_bit>>3] >> (next_bit&7)) & 1;
+		idx += bit << 2;
+		++next_bit;
+		uncompressed[i] = decode_alpha[idx & 7];
+	}
+	//	done
+}
+void stbi_decode_DXT_color_block(
+			unsigned char uncompressed[16*4],
+			unsigned char compressed[8] )
+{
+	int next_bit = 4*8;
+	int i, r, g, b;
+	int c0, c1;
+	unsigned char decode_colors[4*3];
+	//	find the 2 primary colors
+	c0 = compressed[0] + (compressed[1] << 8);
+	c1 = compressed[2] + (compressed[3] << 8);
+	stbi_rgb_888_from_565( c0, &r, &g, &b );
+	decode_colors[0] = r;
+	decode_colors[1] = g;
+	decode_colors[2] = b;
+	stbi_rgb_888_from_565( c1, &r, &g, &b );
+	decode_colors[3] = r;
+	decode_colors[4] = g;
+	decode_colors[5] = b;
+	//	Like DXT1, but no choicees:
+	//	no alpha, 2 interpolated colors
+	decode_colors[6] = (2*decode_colors[0] + decode_colors[3]) / 3;
+	decode_colors[7] = (2*decode_colors[1] + decode_colors[4]) / 3;
+	decode_colors[8] = (2*decode_colors[2] + decode_colors[5]) / 3;
+	decode_colors[9] = (decode_colors[0] + 2*decode_colors[3]) / 3;
+	decode_colors[10] = (decode_colors[1] + 2*decode_colors[4]) / 3;
+	decode_colors[11] = (decode_colors[2] + 2*decode_colors[5]) / 3;
+	//	decode the block
+	for( i = 0; i < 16*4; i += 4 )
+	{
+		int idx = ((compressed[next_bit>>3] >> (next_bit & 7)) & 3) * 3;
+		next_bit += 2;
+		uncompressed[i+0] = decode_colors[idx+0];
+		uncompressed[i+1] = decode_colors[idx+1];
+		uncompressed[i+2] = decode_colors[idx+2];
+	}
+	//	done
+}
+static stbi_uc *dds_load(stbi *s, int *x, int *y, int *comp, int req_comp)
+{
+	//	all variables go up front
+	stbi_uc *dds_data = NULL;
+	stbi_uc block[16*4];
+	stbi_uc compressed[8];
+	int flags, DXT_family;
+	int has_alpha, has_mipmap;
+	int is_compressed, cubemap_faces;
+	int block_pitch, num_blocks;
+	DDS_header header;
+	int i, sz, cf;
+	//	load the header
+	if( sizeof( DDS_header ) != 128 )
+	{
+		return NULL;
+	}
+	getn( s, (stbi_uc*)(&header), 128 );
+	//	and do some checking
+	if( header.dwMagic != (('D' << 0) | ('D' << 8) | ('S' << 16) | (' ' << 24)) ) return NULL;
+	if( header.dwSize != 124 ) return NULL;
+	flags = DDSD_CAPS | DDSD_HEIGHT | DDSD_WIDTH | DDSD_PIXELFORMAT;
+	if( (header.dwFlags & flags) != flags ) return NULL;
+	/*	According to the MSDN spec, the dwFlags should contain
+		DDSD_LINEARSIZE if it's compressed, or DDSD_PITCH if
+		uncompressed.  Some DDS writers do not conform to the
+		spec, so I need to make my reader more tolerant	*/
+	if( header.sPixelFormat.dwSize != 32 ) return NULL;
+	flags = DDPF_FOURCC | DDPF_RGB;
+	if( (header.sPixelFormat.dwFlags & flags) == 0 ) return NULL;
+	if( (header.sCaps.dwCaps1 & DDSCAPS_TEXTURE) == 0 ) return NULL;
+	//	get the image data
+	s->img_x = header.dwWidth;
+	s->img_y = header.dwHeight;
+	s->img_n = 4;
+	is_compressed = (header.sPixelFormat.dwFlags & DDPF_FOURCC) / DDPF_FOURCC;
+	has_alpha = (header.sPixelFormat.dwFlags & DDPF_ALPHAPIXELS) / DDPF_ALPHAPIXELS;
+	has_mipmap = (header.sCaps.dwCaps1 & DDSCAPS_MIPMAP) && (header.dwMipMapCount > 1);
+	cubemap_faces = (header.sCaps.dwCaps2 & DDSCAPS2_CUBEMAP) / DDSCAPS2_CUBEMAP;
+	/*	I need cubemaps to have square faces	*/
+	cubemap_faces &= (s->img_x == s->img_y);
+	cubemap_faces *= 5;
+	cubemap_faces += 1;
+	block_pitch = (s->img_x+3) >> 2;
+	num_blocks = block_pitch * ((s->img_y+3) >> 2);
+	/*	let the user know what's going on	*/
+	*x = s->img_x;
+	*y = s->img_y;
+	*comp = s->img_n;
+	/*	is this uncompressed?	*/
+	if( is_compressed )
+	{
+		/*	compressed	*/
+		//	note: header.sPixelFormat.dwFourCC is something like (('D'<<0)|('X'<<8)|('T'<<16)|('1'<<24))
+		DXT_family = 1 + (header.sPixelFormat.dwFourCC >> 24) - '1';
+		if( (DXT_family < 1) || (DXT_family > 5) ) return NULL;
+		/*	check the expected size...oops, nevermind...
+			those non-compliant writers leave
+			dwPitchOrLinearSize == 0	*/
+		//	passed all the tests, get the RAM for decoding
+		sz = (s->img_x)*(s->img_y)*4*cubemap_faces;
+		dds_data = (unsigned char*)malloc( sz );
+		/*	do this once for each face	*/
+		for( cf = 0; cf < cubemap_faces; ++ cf )
+		{
+			//	now read and decode all the blocks
+			for( i = 0; i < num_blocks; ++i )
+			{
+				//	where are we?
+				int bx, by, bw=4, bh=4;
+				int ref_x = 4 * (i % block_pitch);
+				int ref_y = 4 * (i / block_pitch);
+				//	get the next block's worth of compressed data, and decompress it
+				if( DXT_family == 1 )
+				{
+					//	DXT1
+					getn( s, compressed, 8 );
+					stbi_decode_DXT1_block( block, compressed );
+				} else if( DXT_family < 4 )
+				{
+					//	DXT2/3
+					getn( s, compressed, 8 );
+					stbi_decode_DXT23_alpha_block ( block, compressed );
+					getn( s, compressed, 8 );
+					stbi_decode_DXT_color_block ( block, compressed );
+				} else
+				{
+					//	DXT4/5
+					getn( s, compressed, 8 );
+					stbi_decode_DXT45_alpha_block ( block, compressed );
+					getn( s, compressed, 8 );
+					stbi_decode_DXT_color_block ( block, compressed );
+				}
+				//	is this a partial block?
+				if( ref_x + 4 > s->img_x )
+				{
+					bw = s->img_x - ref_x;
+				}
+				if( ref_y + 4 > s->img_y )
+				{
+					bh = s->img_y - ref_y;
+				}
+				//	now drop our decompressed data into the buffer
+				for( by = 0; by < bh; ++by )
+				{
+					int idx = 4*((ref_y+by+cf*s->img_x)*s->img_x + ref_x);
+					for( bx = 0; bx < bw*4; ++bx )
+					{
+
+						dds_data[idx+bx] = block[by*16+bx];
+					}
+				}
+			}
+			/*	done reading and decoding the main image...
+				skip MIPmaps if present	*/
+			if( has_mipmap )
+			{
+				int block_size = 16;
+				if( DXT_family == 1 )
+				{
+					block_size = 8;
+				}
+				for( i = 1; i < header.dwMipMapCount; ++i )
+				{
+					int mx = s->img_x >> (i + 2);
+					int my = s->img_y >> (i + 2);
+					if( mx < 1 )
+					{
+						mx = 1;
+					}
+					if( my < 1 )
+					{
+						my = 1;
+					}
+					skip( s, mx*my*block_size );
+				}
+			}
+		}/* per cubemap face */
+	} else
+	{
+		/*	uncompressed	*/
+		DXT_family = 0;
+		s->img_n = 3;
+		if( has_alpha )
+		{
+			s->img_n = 4;
+		}
+		*comp = s->img_n;
+		sz = s->img_x*s->img_y*s->img_n*cubemap_faces;
+		dds_data = (unsigned char*)malloc( sz );
+		/*	do this once for each face	*/
+		for( cf = 0; cf < cubemap_faces; ++ cf )
+		{
+			/*	read the main image for this face	*/
+			getn( s, &dds_data[cf*s->img_x*s->img_y*s->img_n], s->img_x*s->img_y*s->img_n );
+			/*	done reading and decoding the main image...
+				skip MIPmaps if present	*/
+			if( has_mipmap )
+			{
+				for( i = 1; i < header.dwMipMapCount; ++i )
+				{
+					int mx = s->img_x >> i;
+					int my = s->img_y >> i;
+					if( mx < 1 )
+					{
+						mx = 1;
+					}
+					if( my < 1 )
+					{
+						my = 1;
+					}
+					skip( s, mx*my*s->img_n );
+				}
+			}
+		}
+		/*	data was BGR, I need it RGB	*/
+		for( i = 0; i < sz; i += s->img_n )
+		{
+			unsigned char temp = dds_data[i];
+			dds_data[i] = dds_data[i+2];
+			dds_data[i+2] = temp;
+		}
+	}
+	/*	finished decompressing into RGBA,
+		adjust the y size if we have a cubemap
+		note: sz is already up to date	*/
+	s->img_y *= cubemap_faces;
+	*y = s->img_y;
+	//	did the user want something else, or
+	//	see if all the alpha values are 255 (i.e. no transparency)
+	has_alpha = 0;
+	if( s->img_n == 4)
+	{
+		for( i = 3; (i < sz) && (has_alpha == 0); i += 4 )
+		{
+			has_alpha |= (dds_data[i] < 255);
+		}
+	}
+	if( (req_comp <= 4) && (req_comp >= 1) )
+	{
+		//	user has some requirements, meet them
+		if( req_comp != s->img_n )
+		{
+			dds_data = convert_format( dds_data, s->img_n, req_comp, s->img_x, s->img_y );
+			*comp = s->img_n;
+		}
+	} else
+	{
+		//	user had no requirements, only drop to RGB is no alpha
+		if( (has_alpha == 0) && (s->img_n == 4) )
+		{
+			dds_data = convert_format( dds_data, 4, 3, s->img_x, s->img_y );
+			*comp = 3;
+		}
+	}
+	//	OK, done
+	return dds_data;
+}
+
+#ifndef STBI_NO_STDIO
+stbi_uc *stbi_dds_load_from_file   (FILE *f,                  int *x, int *y, int *comp, int req_comp)
+{
+	stbi s;
+   start_file(&s,f);
+   return dds_load(&s,x,y,comp,req_comp);
+}
+
+stbi_uc *stbi_dds_load             (char *filename,           int *x, int *y, int *comp, int req_comp)
+{
+   stbi_uc *data;
+   FILE *f = fopen(filename, "rb");
+   if (!f) return NULL;
+   data = stbi_dds_load_from_file(f,x,y,comp,req_comp);
+   fclose(f);
+   return data;
+}
+#endif
+
+stbi_uc *stbi_dds_load_from_memory (stbi_uc const *buffer, int len, int *x, int *y, int *comp, int req_comp)
+{
+	stbi s;
+   start_mem(&s,buffer, len);
+   return dds_load(&s,x,y,comp,req_comp);
+}
diff --git a/external/include/SOIL/test_SOIL.cpp b/external/include/SOIL/test_SOIL.cpp
new file mode 100644
index 0000000..44775c5
--- /dev/null
+++ b/external/include/SOIL/test_SOIL.cpp
@@ -0,0 +1,379 @@
+#include <string>
+#include <iostream>
+
+#include <windows.h>
+#include <shellapi.h>
+#include <gl/gl.h>
+#include <gl/glext.h>
+
+#include "SOIL.h"
+
+LRESULT CALLBACK WindowProc(HWND, UINT, WPARAM, LPARAM);
+void EnableOpenGL(HWND hwnd, HDC*, HGLRC*);
+void DisableOpenGL(HWND, HDC, HGLRC);
+
+int WINAPI WinMain(HINSTANCE hInstance,
+                   HINSTANCE hPrevInstance,
+                   LPSTR lpCmdLine,
+                   int nCmdShow)
+{
+    WNDCLASSEX wcex;
+    HWND hwnd;
+    HDC hDC;
+    HGLRC hRC;
+    MSG msg;
+    BOOL bQuit = FALSE;
+    float theta = 0.0f;
+
+    // register window class
+    wcex.cbSize = sizeof(WNDCLASSEX);
+    wcex.style = CS_OWNDC;
+    wcex.lpfnWndProc = WindowProc;
+    wcex.cbClsExtra = 0;
+    wcex.cbWndExtra = 0;
+    wcex.hInstance = hInstance;
+    wcex.hIcon = LoadIcon(NULL, IDI_APPLICATION);
+    wcex.hCursor = LoadCursor(NULL, IDC_ARROW);
+    wcex.hbrBackground = (HBRUSH)GetStockObject(BLACK_BRUSH);
+    wcex.lpszMenuName = NULL;
+    wcex.lpszClassName = "GLSample";
+    wcex.hIconSm = LoadIcon(NULL, IDI_APPLICATION);
+
+
+    if (!RegisterClassEx(&wcex))
+        return 0;
+
+    // create main window
+    hwnd = CreateWindowEx(0,
+                          "GLSample",
+                          "SOIL Sample",
+                          WS_OVERLAPPEDWINDOW,
+                          CW_USEDEFAULT,
+                          CW_USEDEFAULT,
+                          512,
+                          512,
+                          NULL,
+                          NULL,
+                          hInstance,
+                          NULL);
+
+    ShowWindow(hwnd, nCmdShow);
+
+    //	check my error handling
+    /*
+    SOIL_load_OGL_texture( "img_test.png", SOIL_LOAD_AUTO, SOIL_CREATE_NEW_ID, 0 );
+    std::cout << "'" << SOIL_last_result() << "'" << std::endl;
+    */
+
+
+    // enable OpenGL for the window
+    EnableOpenGL(hwnd, &hDC, &hRC);
+
+    glEnable( GL_BLEND );
+    //glDisable( GL_BLEND );
+    //	straight alpha
+    glBlendFunc( GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA );
+    //	premultiplied alpha (remember to do the same in glColor!!)
+    //glBlendFunc( GL_ONE, GL_ONE_MINUS_SRC_ALPHA );
+
+    //	do I want alpha thresholding?
+    glEnable( GL_ALPHA_TEST );
+    glAlphaFunc( GL_GREATER, 0.5f );
+
+    //	log what the use is asking us to load
+    std::string load_me = lpCmdLine;
+    if( load_me.length() > 2 )
+    {
+		//load_me = load_me.substr( 1, load_me.length() - 2 );
+		load_me = load_me.substr( 0, load_me.length() - 0 );
+    } else
+    {
+    	//load_me = "img_test_uncompressed.dds";
+    	//load_me = "img_test_indexed.tga";
+    	//load_me = "img_test.dds";
+    	load_me = "img_test.png";
+    	//load_me = "odd_size.jpg";
+    	//load_me = "img_cheryl.jpg";
+    	//load_me = "oak_odd.png";
+    	//load_me = "field_128_cube.dds";
+    	//load_me = "field_128_cube_nomip.dds";
+    	//load_me = "field_128_cube_uc.dds";
+    	//load_me = "field_128_cube_uc_nomip.dds";
+    	//load_me = "Goblin.dds";
+    	//load_me = "parquet.dds";
+    	//load_me = "stpeters_probe.hdr";
+    	//load_me = "VeraMoBI_sdf.png";
+
+    	//	for testing the texture rectangle code
+    	//load_me = "test_rect.png";
+    }
+	std::cout << "'" << load_me << "'" << std::endl;
+
+	//	1st try to load it as a single-image-cubemap
+	//	(note, need DDS ordered faces: "EWUDNS")
+	GLuint tex_ID;
+    int time_me;
+
+    std::cout << "Attempting to load as a cubemap" << std::endl;
+    time_me = clock();
+	tex_ID = SOIL_load_OGL_single_cubemap(
+			load_me.c_str(),
+			SOIL_DDS_CUBEMAP_FACE_ORDER,
+			SOIL_LOAD_AUTO,
+			SOIL_CREATE_NEW_ID,
+			SOIL_FLAG_POWER_OF_TWO
+			| SOIL_FLAG_MIPMAPS
+			//| SOIL_FLAG_COMPRESS_TO_DXT
+			//| SOIL_FLAG_TEXTURE_REPEATS
+			//| SOIL_FLAG_INVERT_Y
+			| SOIL_FLAG_DDS_LOAD_DIRECT
+			);
+	time_me = clock() - time_me;
+	std::cout << "the load time was " << 0.001f * time_me << " seconds (warning: low resolution timer)" << std::endl;
+    if( tex_ID > 0 )
+    {
+    	glEnable( GL_TEXTURE_CUBE_MAP );
+		glEnable( GL_TEXTURE_GEN_S );
+		glEnable( GL_TEXTURE_GEN_T );
+		glEnable( GL_TEXTURE_GEN_R );
+		glTexGeni( GL_S, GL_TEXTURE_GEN_MODE, GL_REFLECTION_MAP );
+		glTexGeni( GL_T, GL_TEXTURE_GEN_MODE, GL_REFLECTION_MAP );
+		glTexGeni( GL_R, GL_TEXTURE_GEN_MODE, GL_REFLECTION_MAP );
+		glBindTexture( GL_TEXTURE_CUBE_MAP, tex_ID );
+		//	report
+		std::cout << "the loaded single cube map ID was " << tex_ID << std::endl;
+		//std::cout << "the load time was " << 0.001f * time_me << " seconds (warning: low resolution timer)" << std::endl;
+    } else
+    {
+    	std::cout << "Attempting to load as a HDR texture" << std::endl;
+		time_me = clock();
+		tex_ID = SOIL_load_OGL_HDR_texture(
+				load_me.c_str(),
+				//SOIL_HDR_RGBE,
+				//SOIL_HDR_RGBdivA,
+				SOIL_HDR_RGBdivA2,
+				0,
+				SOIL_CREATE_NEW_ID,
+				SOIL_FLAG_POWER_OF_TWO
+				| SOIL_FLAG_MIPMAPS
+				//| SOIL_FLAG_COMPRESS_TO_DXT
+				);
+		time_me = clock() - time_me;
+		std::cout << "the load time was " << 0.001f * time_me << " seconds (warning: low resolution timer)" << std::endl;
+
+		//	did I fail?
+		if( tex_ID < 1 )
+		{
+			//	loading of the single-image-cubemap failed, try it as a simple texture
+			std::cout << "Attempting to load as a simple 2D texture" << std::endl;
+			//	load the texture, if specified
+			time_me = clock();
+			tex_ID = SOIL_load_OGL_texture(
+					load_me.c_str(),
+					SOIL_LOAD_AUTO,
+					SOIL_CREATE_NEW_ID,
+					SOIL_FLAG_POWER_OF_TWO
+					| SOIL_FLAG_MIPMAPS
+					//| SOIL_FLAG_MULTIPLY_ALPHA
+					//| SOIL_FLAG_COMPRESS_TO_DXT
+					| SOIL_FLAG_DDS_LOAD_DIRECT
+					//| SOIL_FLAG_NTSC_SAFE_RGB
+					//| SOIL_FLAG_CoCg_Y
+					//| SOIL_FLAG_TEXTURE_RECTANGLE
+					);
+			time_me = clock() - time_me;
+			std::cout << "the load time was " << 0.001f * time_me << " seconds (warning: low resolution timer)" << std::endl;
+		}
+
+		if( tex_ID > 0 )
+		{
+			//	enable texturing
+			glEnable( GL_TEXTURE_2D );
+			//glEnable( 0x84F5 );// enables texture rectangle
+			//  bind an OpenGL texture ID
+			glBindTexture( GL_TEXTURE_2D, tex_ID );
+			//	report
+			std::cout << "the loaded texture ID was " << tex_ID << std::endl;
+			//std::cout << "the load time was " << 0.001f * time_me << " seconds (warning: low resolution timer)" << std::endl;
+		} else
+		{
+			//	loading of the texture failed...why?
+			glDisable( GL_TEXTURE_2D );
+			std::cout << "Texture loading failed: '" << SOIL_last_result() << "'" << std::endl;
+		}
+    }
+
+    // program main loop
+    const float ref_mag = 0.1f;
+    while (!bQuit)
+    {
+        // check for messages
+        if (PeekMessage(&msg, NULL, 0, 0, PM_REMOVE))
+        {
+            // handle or dispatch messages
+            if (msg.message == WM_QUIT)
+            {
+                bQuit = TRUE;
+            }
+            else
+            {
+                TranslateMessage(&msg);
+                DispatchMessage(&msg);
+            }
+        }
+        else
+        {
+            // OpenGL animation code goes here
+            theta = clock() * 0.1;
+
+            float tex_u_max = 1.0f;//0.2f;
+            float tex_v_max = 1.0f;//0.2f;
+
+            glClearColor(0.0f, 0.0f, 0.0f, 0.0f);
+            glClear(GL_COLOR_BUFFER_BIT);
+
+            glPushMatrix();
+            glScalef( 0.8f, 0.8f, 0.8f );
+            //glRotatef(-0.314159f*theta, 0.0f, 0.0f, 1.0f);
+			glColor4f( 1.0f, 1.0f, 1.0f, 1.0f );
+			glNormal3f( 0.0f, 0.0f, 1.0f );
+            glBegin(GL_QUADS);
+				glNormal3f( -ref_mag, -ref_mag, 1.0f );
+                glTexCoord2f( 0.0f, tex_v_max );
+                glVertex3f( -1.0f, -1.0f, -0.1f );
+
+                glNormal3f( ref_mag, -ref_mag, 1.0f );
+                glTexCoord2f( tex_u_max, tex_v_max );
+                glVertex3f( 1.0f, -1.0f, -0.1f );
+
+                glNormal3f( ref_mag, ref_mag, 1.0f );
+                glTexCoord2f( tex_u_max, 0.0f );
+                glVertex3f( 1.0f, 1.0f, -0.1f );
+
+                glNormal3f( -ref_mag, ref_mag, 1.0f );
+                glTexCoord2f( 0.0f, 0.0f );
+                glVertex3f( -1.0f, 1.0f, -0.1f );
+            glEnd();
+            glPopMatrix();
+
+			tex_u_max = 1.0f;
+            tex_v_max = 1.0f;
+            glPushMatrix();
+            glScalef( 0.8f, 0.8f, 0.8f );
+            glRotatef(theta, 0.0f, 0.0f, 1.0f);
+			glColor4f( 1.0f, 1.0f, 1.0f, 1.0f );
+			glNormal3f( 0.0f, 0.0f, 1.0f );
+            glBegin(GL_QUADS);
+                glTexCoord2f( 0.0f, tex_v_max );		glVertex3f( 0.0f, 0.0f, 0.1f );
+                glTexCoord2f( tex_u_max, tex_v_max );		glVertex3f( 1.0f, 0.0f, 0.1f );
+                glTexCoord2f( tex_u_max, 0.0f );		glVertex3f( 1.0f, 1.0f, 0.1f );
+                glTexCoord2f( 0.0f, 0.0f );		glVertex3f( 0.0f, 1.0f, 0.1f );
+            glEnd();
+            glPopMatrix();
+
+            {
+				/*	check for errors	*/
+				GLenum err_code = glGetError();
+				while( GL_NO_ERROR != err_code )
+				{
+					printf( "OpenGL Error @ %s: %i", "drawing loop", err_code );
+					err_code = glGetError();
+				}
+			}
+
+            SwapBuffers(hDC);
+
+            Sleep (1);
+        }
+    }
+
+    //	and show off the screenshot capability
+    /*
+    load_me += "-screenshot.tga";
+    SOIL_save_screenshot( load_me.c_str(), SOIL_SAVE_TYPE_TGA, 0, 0, 512, 512 );
+    //*/
+    //*
+    load_me += "-screenshot.bmp";
+    SOIL_save_screenshot( load_me.c_str(), SOIL_SAVE_TYPE_BMP, 0, 0, 512, 512 );
+    //*/
+    /*
+    load_me += "-screenshot.dds";
+    SOIL_save_screenshot( load_me.c_str(), SOIL_SAVE_TYPE_DDS, 0, 0, 512, 512 );
+    //*/
+
+    // shutdown OpenGL
+    DisableOpenGL(hwnd, hDC, hRC);
+
+    // destroy the window explicitly
+    DestroyWindow(hwnd);
+
+    return msg.wParam;
+}
+
+LRESULT CALLBACK WindowProc(HWND hwnd, UINT uMsg, WPARAM wParam, LPARAM lParam)
+{
+    switch (uMsg)
+    {
+        case WM_CLOSE:
+            PostQuitMessage(0);
+        break;
+
+        case WM_DESTROY:
+            return 0;
+
+        case WM_KEYDOWN:
+        {
+            switch (wParam)
+            {
+                case VK_ESCAPE:
+                    PostQuitMessage(0);
+                break;
+            }
+        }
+        break;
+
+        default:
+            return DefWindowProc(hwnd, uMsg, wParam, lParam);
+    }
+
+    return 0;
+}
+
+void EnableOpenGL(HWND hwnd, HDC* hDC, HGLRC* hRC)
+{
+    PIXELFORMATDESCRIPTOR pfd;
+
+    int iFormat;
+
+    /* get the device context (DC) */
+    *hDC = GetDC(hwnd);
+
+    /* set the pixel format for the DC */
+    ZeroMemory(&pfd, sizeof(pfd));
+
+    pfd.nSize = sizeof(pfd);
+    pfd.nVersion = 1;
+    pfd.dwFlags = PFD_DRAW_TO_WINDOW |
+                  PFD_SUPPORT_OPENGL | PFD_DOUBLEBUFFER;
+    pfd.iPixelType = PFD_TYPE_RGBA;
+    pfd.cColorBits = 24;
+    pfd.cDepthBits = 16;
+    pfd.iLayerType = PFD_MAIN_PLANE;
+
+    iFormat = ChoosePixelFormat(*hDC, &pfd);
+
+    SetPixelFormat(*hDC, iFormat, &pfd);
+
+    /* create and enable the render context (RC) */
+    *hRC = wglCreateContext(*hDC);
+
+    wglMakeCurrent(*hDC, *hRC);
+}
+
+void DisableOpenGL (HWND hwnd, HDC hDC, HGLRC hRC)
+{
+    wglMakeCurrent(NULL, NULL);
+    wglDeleteContext(hRC);
+    ReleaseDC(hwnd, hDC);
+}
+
diff --git a/external/include/objloader/tiny_obj_loader.cc b/external/include/objloader/tiny_obj_loader.cc
new file mode 100644
index 0000000..75f0dca
--- /dev/null
+++ b/external/include/objloader/tiny_obj_loader.cc
@@ -0,0 +1,725 @@
+//
+// Copyright 2012-2013, Syoyo Fujita.
+// 
+// Licensed under 2-clause BSD liecense.
+//
+
+//
+// version 0.9.7: Support multi-materials(per-face material ID) per object/group.
+// version 0.9.6: Support Ni(index of refraction) mtl parameter.
+//                Parse transmittance material parameter correctly.
+// version 0.9.5: Parse multiple group name.
+//                Add support of specifying the base path to load material file.
+// version 0.9.4: Initial suupport of group tag(g)
+// version 0.9.3: Fix parsing triple 'x/y/z'
+// version 0.9.2: Add more .mtl load support
+// version 0.9.1: Add initial .mtl load support
+// version 0.9.0: Initial
+//
+
+
+#include <cstdlib>
+#include <cstring>
+#include <cassert>
+
+#include <string>
+#include <vector>
+#include <map>
+#include <fstream>
+#include <sstream>
+
+#include "tiny_obj_loader.h"
+
+namespace tinyobj {
+
+struct vertex_index {
+  int v_idx, vt_idx, vn_idx;
+  vertex_index() {};
+  vertex_index(int idx) : v_idx(idx), vt_idx(idx), vn_idx(idx) {};
+  vertex_index(int vidx, int vtidx, int vnidx) : v_idx(vidx), vt_idx(vtidx), vn_idx(vnidx) {};
+
+};
+// for std::map
+static inline bool operator<(const vertex_index& a, const vertex_index& b)
+{
+  if (a.v_idx != b.v_idx) return (a.v_idx < b.v_idx);
+  if (a.vn_idx != b.vn_idx) return (a.vn_idx < b.vn_idx);
+  if (a.vt_idx != b.vt_idx) return (a.vt_idx < b.vt_idx);
+
+  return false;
+}
+
+struct obj_shape {
+  std::vector<float> v;
+  std::vector<float> vn;
+  std::vector<float> vt;
+};
+
+static inline bool isSpace(const char c) {
+  return (c == ' ') || (c == '\t');
+}
+
+static inline bool isNewLine(const char c) {
+  return (c == '\r') || (c == '\n') || (c == '\0');
+}
+
+// Make index zero-base, and also support relative index. 
+static inline int fixIndex(int idx, int n)
+{
+  int i;
+
+  if (idx > 0) {
+    i = idx - 1;
+  } else if (idx == 0) {
+    i = 0;
+  } else { // negative value = relative
+    i = n + idx;
+  }
+  return i;
+}
+
+static inline std::string parseString(const char*& token)
+{
+  std::string s;
+  int b = strspn(token, " \t");
+  int e = strcspn(token, " \t\r");
+  s = std::string(&token[b], &token[e]);
+
+  token += (e - b);
+  return s;
+}
+
+static inline int parseInt(const char*& token)
+{
+  token += strspn(token, " \t");
+  int i = atoi(token);
+  token += strcspn(token, " \t\r");
+  return i;
+}
+
+static inline float parseFloat(const char*& token)
+{
+  token += strspn(token, " \t");
+  float f = (float)atof(token);
+  token += strcspn(token, " \t\r");
+  return f;
+}
+
+static inline void parseFloat2(
+  float& x, float& y,
+  const char*& token)
+{
+  x = parseFloat(token);
+  y = parseFloat(token);
+}
+
+static inline void parseFloat3(
+  float& x, float& y, float& z,
+  const char*& token)
+{
+  x = parseFloat(token);
+  y = parseFloat(token);
+  z = parseFloat(token);
+}
+
+
+// Parse triples: i, i/j/k, i//k, i/j
+static vertex_index parseTriple(
+  const char* &token,
+  int vsize,
+  int vnsize,
+  int vtsize)
+{
+    vertex_index vi(-1);
+
+    vi.v_idx = fixIndex(atoi(token), vsize);
+    token += strcspn(token, "/ \t\r");
+    if (token[0] != '/') {
+      return vi;
+    }
+    token++;
+
+    // i//k
+    if (token[0] == '/') {
+      token++;
+      vi.vn_idx = fixIndex(atoi(token), vnsize);
+      token += strcspn(token, "/ \t\r");
+      return vi;
+    }
+    
+    // i/j/k or i/j
+    vi.vt_idx = fixIndex(atoi(token), vtsize);
+    token += strcspn(token, "/ \t\r");
+    if (token[0] != '/') {
+      return vi;
+    }
+
+    // i/j/k
+    token++;  // skip '/'
+    vi.vn_idx = fixIndex(atoi(token), vnsize);
+    token += strcspn(token, "/ \t\r");
+    return vi; 
+}
+
+static unsigned int
+updateVertex(
+  std::map<vertex_index, unsigned int>& vertexCache,
+  std::vector<float>& positions,
+  std::vector<float>& normals,
+  std::vector<float>& texcoords,
+  const std::vector<float>& in_positions,
+  const std::vector<float>& in_normals,
+  const std::vector<float>& in_texcoords,
+  const vertex_index& i)
+{
+  const std::map<vertex_index, unsigned int>::iterator it = vertexCache.find(i);
+
+  if (it != vertexCache.end()) {
+    // found cache
+    return it->second;
+  }
+
+  assert(in_positions.size() > (unsigned int) (3*i.v_idx+2));
+
+  positions.push_back(in_positions[3*i.v_idx+0]);
+  positions.push_back(in_positions[3*i.v_idx+1]);
+  positions.push_back(in_positions[3*i.v_idx+2]);
+
+  if (i.vn_idx >= 0) {
+    normals.push_back(in_normals[3*i.vn_idx+0]);
+    normals.push_back(in_normals[3*i.vn_idx+1]);
+    normals.push_back(in_normals[3*i.vn_idx+2]);
+  }
+
+  if (i.vt_idx >= 0) {
+    texcoords.push_back(in_texcoords[2*i.vt_idx+0]);
+    texcoords.push_back(in_texcoords[2*i.vt_idx+1]);
+  }
+
+  unsigned int idx = positions.size() / 3 - 1;
+  vertexCache[i] = idx;
+
+  return idx;
+}
+
+void InitMaterial(material_t& material) {
+  material.name = "";
+  material.ambient_texname = "";
+  material.diffuse_texname = "";
+  material.specular_texname = "";
+  material.normal_texname = "";
+  for (int i = 0; i < 3; i ++) {
+    material.ambient[i] = 0.f;
+    material.diffuse[i] = 0.f;
+    material.specular[i] = 0.f;
+    material.transmittance[i] = 0.f;
+    material.emission[i] = 0.f;
+  }
+  material.illum = 0;
+  material.dissolve = 1.f;
+  material.shininess = 1.f;
+  material.ior = 1.f;
+  material.unknown_parameter.clear();
+}
+
+static bool
+exportFaceGroupToShape(
+  shape_t& shape,
+  std::map<vertex_index, unsigned int> vertexCache,
+  const std::vector<float> &in_positions,
+  const std::vector<float> &in_normals,
+  const std::vector<float> &in_texcoords,
+  const std::vector<std::vector<vertex_index> >& faceGroup,
+  const int material_id,
+  const std::string &name,
+  bool clearCache)
+{
+  if (faceGroup.empty()) {
+    return false;
+  }
+
+  size_t offset;
+
+  offset = shape.mesh.indices.size();
+
+  // Flatten vertices and indices
+  for (size_t i = 0; i < faceGroup.size(); i++) {
+    const std::vector<vertex_index>& face = faceGroup[i];
+
+    vertex_index i0 = face[0];
+    vertex_index i1(-1);
+    vertex_index i2 = face[1];
+
+    size_t npolys = face.size();
+
+    // Polygon -> triangle fan conversion
+    for (size_t k = 2; k < npolys; k++) {
+      i1 = i2;
+      i2 = face[k];
+
+      unsigned int v0 = updateVertex(vertexCache, shape.mesh.positions, shape.mesh.normals, shape.mesh.texcoords, in_positions, in_normals, in_texcoords, i0);
+      unsigned int v1 = updateVertex(vertexCache, shape.mesh.positions, shape.mesh.normals, shape.mesh.texcoords, in_positions, in_normals, in_texcoords, i1);
+      unsigned int v2 = updateVertex(vertexCache, shape.mesh.positions, shape.mesh.normals, shape.mesh.texcoords, in_positions, in_normals, in_texcoords, i2);
+
+      shape.mesh.indices.push_back(v0);
+      shape.mesh.indices.push_back(v1);
+      shape.mesh.indices.push_back(v2);
+
+      shape.mesh.material_ids.push_back(material_id);
+    }
+
+  }
+
+  shape.name = name;
+
+  if (clearCache)
+      vertexCache.clear();
+
+  return true;
+
+}
+
+std::string LoadMtl (
+  std::map<std::string, int>& material_map,
+  std::vector<material_t>& materials,
+  std::istream& inStream)
+{
+  material_map.clear();
+  std::stringstream err;
+
+  material_t material;
+  
+  int maxchars = 8192;  // Alloc enough size.
+  std::vector<char> buf(maxchars);  // Alloc enough size.
+  while (inStream.peek() != -1) {
+    inStream.getline(&buf[0], maxchars);
+
+    std::string linebuf(&buf[0]);
+
+    // Trim newline '\r\n' or '\n'
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\n') linebuf.erase(linebuf.size()-1);
+    }
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\r') linebuf.erase(linebuf.size()-1);
+    }
+
+    // Skip if empty line.
+    if (linebuf.empty()) {
+      continue;
+    }
+
+    // Skip leading space.
+    const char* token = linebuf.c_str();
+    token += strspn(token, " \t");
+
+    assert(token);
+    if (token[0] == '\0') continue; // empty line
+    
+    if (token[0] == '#') continue;  // comment line
+    
+    // new mtl
+    if ((0 == strncmp(token, "newmtl", 6)) && isSpace((token[6]))) {
+      // flush previous material.
+      if (!material.name.empty())
+      {
+          material_map.insert(std::pair<std::string, int>(material.name, materials.size()));
+          materials.push_back(material);
+      }
+
+      // initial temporary material
+      InitMaterial(material);
+
+      // set new mtl name
+      char namebuf[4096];
+      token += 7;
+      sscanf(token, "%s", namebuf);
+      material.name = namebuf;
+      continue;
+    }
+    
+    // ambient
+    if (token[0] == 'K' && token[1] == 'a' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.ambient[0] = r;
+      material.ambient[1] = g;
+      material.ambient[2] = b;
+      continue;
+    }
+    
+    // diffuse
+    if (token[0] == 'K' && token[1] == 'd' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.diffuse[0] = r;
+      material.diffuse[1] = g;
+      material.diffuse[2] = b;
+      continue;
+    }
+    
+    // specular
+    if (token[0] == 'K' && token[1] == 's' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.specular[0] = r;
+      material.specular[1] = g;
+      material.specular[2] = b;
+      continue;
+    }
+    
+    // transmittance
+    if (token[0] == 'K' && token[1] == 't' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.transmittance[0] = r;
+      material.transmittance[1] = g;
+      material.transmittance[2] = b;
+      continue;
+    }
+
+    // ior(index of refraction)
+    if (token[0] == 'N' && token[1] == 'i' && isSpace((token[2]))) {
+      token += 2;
+      material.ior = parseFloat(token);
+      continue;
+    }
+
+    // emission
+    if(token[0] == 'K' && token[1] == 'e' && isSpace(token[2])) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.emission[0] = r;
+      material.emission[1] = g;
+      material.emission[2] = b;
+      continue;
+    }
+
+    // shininess
+    if(token[0] == 'N' && token[1] == 's' && isSpace(token[2])) {
+      token += 2;
+      material.shininess = parseFloat(token);
+      continue;
+    }
+
+    // illum model
+    if (0 == strncmp(token, "illum", 5) && isSpace(token[5])) {
+      token += 6;
+      material.illum = parseInt(token);
+      continue;
+    }
+
+    // dissolve
+    if ((token[0] == 'd' && isSpace(token[1]))) {
+      token += 1;
+      material.dissolve = parseFloat(token);
+      continue;
+    }
+    if (token[0] == 'T' && token[1] == 'r' && isSpace(token[2])) {
+      token += 2;
+      material.dissolve = parseFloat(token);
+      continue;
+    }
+
+    // ambient texture
+    if ((0 == strncmp(token, "map_Ka", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.ambient_texname = token;
+      continue;
+    }
+
+    // diffuse texture
+    if ((0 == strncmp(token, "map_Kd", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.diffuse_texname = token;
+      continue;
+    }
+
+    // specular texture
+    if ((0 == strncmp(token, "map_Ks", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.specular_texname = token;
+      continue;
+    }
+
+    // normal texture
+    if ((0 == strncmp(token, "map_Ns", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.normal_texname = token;
+      continue;
+    }
+
+    // unknown parameter
+    const char* _space = strchr(token, ' ');
+    if(!_space) {
+      _space = strchr(token, '\t');
+    }
+    if(_space) {
+      int len = _space - token;
+      std::string key(token, len);
+      std::string value = _space + 1;
+      material.unknown_parameter.insert(std::pair<std::string, std::string>(key, value));
+    }
+  }
+  // flush last material.
+  material_map.insert(std::pair<std::string, int>(material.name, materials.size()));
+  materials.push_back(material);
+
+  return err.str();
+}
+
+std::string MaterialFileReader::operator() (
+    const std::string& matId,
+    std::vector<material_t>& materials,
+    std::map<std::string, int>& matMap)
+{
+  std::string filepath;
+
+  if (!m_mtlBasePath.empty()) {
+    filepath = std::string(m_mtlBasePath) + matId;
+  } else {
+    filepath = matId;
+  }
+
+  std::ifstream matIStream(filepath.c_str());
+  return LoadMtl(matMap, materials, matIStream);
+}
+
+std::string
+LoadObj(
+  std::vector<shape_t>& shapes,
+  std::vector<material_t>& materials,   // [output]
+  const char* filename,
+  const char* mtl_basepath)
+{
+
+  shapes.clear();
+
+  std::stringstream err;
+
+  std::ifstream ifs(filename);
+  if (!ifs) {
+    err << "Cannot open file [" << filename << "]" << std::endl;
+    return err.str();
+  }
+
+  std::string basePath;
+  if (mtl_basepath) {
+    basePath = mtl_basepath;
+  }
+  MaterialFileReader matFileReader( basePath );
+  
+  return LoadObj(shapes, materials, ifs, matFileReader);
+}
+
+std::string LoadObj(
+  std::vector<shape_t>& shapes,
+  std::vector<material_t>& materials,   // [output]
+  std::istream& inStream,
+  MaterialReader& readMatFn)
+{
+  std::stringstream err;
+
+  std::vector<float> v;
+  std::vector<float> vn;
+  std::vector<float> vt;
+  std::vector<std::vector<vertex_index> > faceGroup;
+  std::string name;
+
+  // material
+  std::map<std::string, int> material_map;
+  std::map<vertex_index, unsigned int> vertexCache;
+  int  material = -1;
+
+  shape_t shape;
+
+  int maxchars = 8192;  // Alloc enough size.
+  std::vector<char> buf(maxchars);  // Alloc enough size.
+  while (inStream.peek() != -1) {
+    inStream.getline(&buf[0], maxchars);
+
+    std::string linebuf(&buf[0]);
+
+    // Trim newline '\r\n' or '\n'
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\n') linebuf.erase(linebuf.size()-1);
+    }
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\r') linebuf.erase(linebuf.size()-1);
+    }
+
+    // Skip if empty line.
+    if (linebuf.empty()) {
+      continue;
+    }
+
+    // Skip leading space.
+    const char* token = linebuf.c_str();
+    token += strspn(token, " \t");
+
+    assert(token);
+    if (token[0] == '\0') continue; // empty line
+    
+    if (token[0] == '#') continue;  // comment line
+
+    // vertex
+    if (token[0] == 'v' && isSpace((token[1]))) {
+      token += 2;
+      float x, y, z;
+      parseFloat3(x, y, z, token);
+      v.push_back(x);
+      v.push_back(y);
+      v.push_back(z);
+      continue;
+    }
+
+    // normal
+    if (token[0] == 'v' && token[1] == 'n' && isSpace((token[2]))) {
+      token += 3;
+      float x, y, z;
+      parseFloat3(x, y, z, token);
+      vn.push_back(x);
+      vn.push_back(y);
+      vn.push_back(z);
+      continue;
+    }
+
+    // texcoord
+    if (token[0] == 'v' && token[1] == 't' && isSpace((token[2]))) {
+      token += 3;
+      float x, y;
+      parseFloat2(x, y, token);
+      vt.push_back(x);
+      vt.push_back(y);
+      continue;
+    }
+
+    // face
+    if (token[0] == 'f' && isSpace((token[1]))) {
+      token += 2;
+      token += strspn(token, " \t");
+
+      std::vector<vertex_index> face;
+      while (!isNewLine(token[0])) {
+        vertex_index vi = parseTriple(token, v.size() / 3, vn.size() / 3, vt.size() / 2);
+        face.push_back(vi);
+        int n = strspn(token, " \t\r");
+        token += n;
+      }
+
+      faceGroup.push_back(face);
+      
+      continue;
+    }
+
+    // use mtl
+    if ((0 == strncmp(token, "usemtl", 6)) && isSpace((token[6]))) {
+
+      char namebuf[4096];
+      token += 7;
+      sscanf(token, "%s", namebuf);
+
+      bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, false);
+      faceGroup.clear();
+
+      if (material_map.find(namebuf) != material_map.end()) {
+        material = material_map[namebuf];
+      } else {
+        // { error!! material not found }
+        material = -1;
+      }
+
+      continue;
+
+    }
+
+    // load mtl
+    if ((0 == strncmp(token, "mtllib", 6)) && isSpace((token[6]))) {
+      char namebuf[4096];
+      token += 7;
+      sscanf(token, "%s", namebuf);
+        
+      std::string err_mtl = readMatFn(namebuf, materials, material_map);
+      if (!err_mtl.empty()) {
+        faceGroup.clear();  // for safety
+        return err_mtl;
+      }
+      
+      continue;
+    }
+
+    // group name
+    if (token[0] == 'g' && isSpace((token[1]))) {
+
+      // flush previous face group.
+      bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, true);
+      if (ret) {
+        shapes.push_back(shape);
+      }
+
+      shape = shape_t();
+
+      //material = -1;
+      faceGroup.clear();
+
+      std::vector<std::string> names;
+      while (!isNewLine(token[0])) {
+        std::string str = parseString(token);
+        names.push_back(str);
+        token += strspn(token, " \t\r"); // skip tag
+      }
+
+      assert(names.size() > 0);
+
+      // names[0] must be 'g', so skipt 0th element.
+      if (names.size() > 1) {
+        name = names[1];
+      } else {
+        name = "";
+      }
+
+      continue;
+    }
+
+    // object name
+    if (token[0] == 'o' && isSpace((token[1]))) {
+
+      // flush previous face group.
+      bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, true);
+      if (ret) {
+        shapes.push_back(shape);
+      }
+
+      //material = -1;
+      faceGroup.clear();
+      shape = shape_t();
+
+      // @todo { multiple object name? }
+      char namebuf[4096];
+      token += 2;
+      sscanf(token, "%s", namebuf);
+      name = std::string(namebuf);
+
+
+      continue;
+    }
+
+    // Ignore unknown command.
+  }
+
+  bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, true);
+  if (ret) {
+    shapes.push_back(shape);
+  }
+  faceGroup.clear();  // for safety
+
+  return err.str();
+}
+
+
+}
diff --git a/external/include/objloader/tiny_obj_loader.h b/external/include/objloader/tiny_obj_loader.h
new file mode 100644
index 0000000..a58d7be
--- /dev/null
+++ b/external/include/objloader/tiny_obj_loader.h
@@ -0,0 +1,107 @@
+//
+// Copyright 2012-2013, Syoyo Fujita.
+//
+// Licensed under 2-clause BSD liecense.
+//
+#ifndef _TINY_OBJ_LOADER_H
+#define _TINY_OBJ_LOADER_H
+
+#include <string>
+#include <vector>
+#include <map>
+
+namespace tinyobj {
+
+typedef struct
+{
+    std::string name;
+
+    float ambient[3];
+    float diffuse[3];
+    float specular[3];
+    float transmittance[3];
+    float emission[3];
+    float shininess;
+    float ior;                // index of refraction
+    float dissolve;           // 1 == opaque; 0 == fully transparent
+    // illumination model (see http://www.fileformat.info/format/material/)
+    int illum;
+
+    std::string ambient_texname;
+    std::string diffuse_texname;
+    std::string specular_texname;
+    std::string normal_texname;
+    std::map<std::string, std::string> unknown_parameter;
+} material_t;
+
+typedef struct
+{
+    std::vector<float>          positions;
+    std::vector<float>          normals;
+    std::vector<float>          texcoords;
+    std::vector<unsigned int>   indices;
+    std::vector<int>            material_ids; // per-mesh material ID
+} mesh_t;
+
+typedef struct
+{
+    std::string  name;
+    mesh_t       mesh;
+} shape_t;
+
+class MaterialReader
+{
+public:
+    MaterialReader(){}
+    virtual ~MaterialReader(){}
+
+    virtual std::string operator() (
+        const std::string& matId,
+        std::vector<material_t>& materials,
+        std::map<std::string, int>& matMap) = 0;
+};
+
+class MaterialFileReader:
+  public MaterialReader
+{
+    public:
+        MaterialFileReader(const std::string& mtl_basepath): m_mtlBasePath(mtl_basepath) {}
+        virtual ~MaterialFileReader() {}
+        virtual std::string operator() (
+          const std::string& matId,
+          std::vector<material_t>& materials,
+          std::map<std::string, int>& matMap);
+
+    private:
+        std::string m_mtlBasePath;
+};
+
+/// Loads .obj from a file.
+/// 'shapes' will be filled with parsed shape data
+/// The function returns error string.
+/// Returns empty string when loading .obj success.
+/// 'mtl_basepath' is optional, and used for base path for .mtl file.
+std::string LoadObj(
+    std::vector<shape_t>& shapes,   // [output]
+    std::vector<material_t>& materials,   // [output]
+    const char* filename,
+    const char* mtl_basepath = NULL);
+
+/// Loads object from a std::istream, uses GetMtlIStreamFn to retrieve
+/// std::istream for materials.
+/// Returns empty string when loading .obj success.
+std::string LoadObj(
+    std::vector<shape_t>& shapes,   // [output]
+    std::vector<material_t>& materials,   // [output]
+    std::istream& inStream,
+    MaterialReader& readMatFn);
+
+/// Loads materials into std::map
+/// Returns an empty string if successful
+std::string LoadMtl (
+  std::map<std::string, int>& material_map,
+  std::vector<material_t>& materials,
+  std::istream& inStream);
+}
+
+#endif  // _TINY_OBJ_LOADER_H
diff --git a/external/lib/win/SOIL/SOIL.lib b/external/lib/win/SOIL/SOIL.lib
new file mode 100644
index 0000000..76710df
Binary files /dev/null and b/external/lib/win/SOIL/SOIL.lib differ
diff --git a/external/syoyo-tinyobjloader-b35f498/.gitignore b/external/syoyo-tinyobjloader-b35f498/.gitignore
new file mode 100644
index 0000000..493e888
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/.gitignore
@@ -0,0 +1,2 @@
+#Common folder for building objects
+build/
diff --git a/external/syoyo-tinyobjloader-b35f498/CMakeLists.txt b/external/syoyo-tinyobjloader-b35f498/CMakeLists.txt
new file mode 100644
index 0000000..94ec330
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/CMakeLists.txt
@@ -0,0 +1,50 @@
+#Tiny Object Loader Cmake configuration file.
+#This configures the Cmake system with multiple properties, depending
+#on the platform and configuration it is set to build in.
+project(tinyobjloader)
+cmake_minimum_required(VERSION 2.8.6)
+
+#Folder Shortcuts
+set(TINYOBJLOADEREXAMPLES_DIR ${CMAKE_CURRENT_SOURCE_DIR}/examples)
+
+set(tinyobjloader-Source
+	${CMAKE_CURRENT_SOURCE_DIR}/tiny_obj_loader.h
+	${CMAKE_CURRENT_SOURCE_DIR}/tiny_obj_loader.cc
+	)
+
+set(tinyobjloader-Test-Source
+	${CMAKE_CURRENT_SOURCE_DIR}/test.cc
+	)
+
+set(tinyobjloader-examples-objsticher
+	${TINYOBJLOADEREXAMPLES_DIR}/obj_sticher/obj_writer.h
+	${TINYOBJLOADEREXAMPLES_DIR}/obj_sticher/obj_writer.cc
+	${TINYOBJLOADEREXAMPLES_DIR}/obj_sticher/obj_sticher.cc
+	)
+
+add_library(tinyobjloader
+			${tinyobjloader-Source}
+	)
+
+add_executable(test ${tinyobjloader-Test-Source})
+target_link_libraries(test tinyobjloader)
+
+add_executable(obj_sticher ${tinyobjloader-examples-objsticher})
+target_link_libraries(obj_sticher tinyobjloader)
+
+#Installation
+install ( TARGETS
+  obj_sticher
+  DESTINATION
+  bin
+  )
+install ( TARGETS
+  tinyobjloader
+  DESTINATION
+  lib
+  )
+install ( FILES
+  tiny_obj_loader.h
+  DESTINATION
+  include
+  )
diff --git a/external/syoyo-tinyobjloader-b35f498/README.md b/external/syoyo-tinyobjloader-b35f498/README.md
new file mode 100644
index 0000000..033bbe6
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/README.md
@@ -0,0 +1,126 @@
+tinyobjloader
+=============
+
+[![wercker status](https://app.wercker.com/status/495a3bac400212cdacdeb4dd9397bf4f/m "wercker status")](https://app.wercker.com/project/bykey/495a3bac400212cdacdeb4dd9397bf4f)
+
+http://syoyo.github.io/tinyobjloader/
+
+Tiny but poweful single file wavefront obj loader written in C++. No dependency except for C++ STL. It can parse 10M over polygons with moderate memory and time.
+
+Good for embedding .obj loader to your (global illumination) renderer ;-)
+
+What's new
+----------
+
+* Sep 14, 2014 : Add support for multi-material per object/group. Thanks Mykhailo!
+* Mar 17, 2014 : Fixed trim newline bugs. Thanks ardneran!
+* Apr 29, 2014 : Add API to read .obj from std::istream. Good for reading compressed .obj or connecting to procedural primitive generator. Thanks burnse!
+* Apr 21, 2014 : Define default material if no material definition exists in .obj. Thanks YarmUI!
+* Apr 10, 2014 : Add support for parsing 'illum' and 'd'/'Tr' statements. Thanks mmp!
+* Jan 27, 2014 : Added CMake project. Thanks bradc6!
+* Nov 26, 2013 : Performance optimization by NeuralSandwich. 9% improvement in his project, thanks!
+* Sep 12, 2013 : Added multiple .obj sticher example.
+
+Example
+-------
+
+![Rungholt](https://github.com/syoyo/tinyobjloader/blob/master/images/rungholt.jpg?raw=true)
+
+tinyobjloader can successfully load 6M triangles Rungholt scene.
+http://graphics.cs.williams.edu/data/meshes.xml
+
+Use case
+--------
+
+TinyObjLoader is successfully used in ...
+
+* bullet3 https://github.com/erwincoumans/bullet3
+* pbrt-v2 https://https://github.com/mmp/pbrt-v2
+* OpenGL game engine development http://swarminglogic.com/jotting/2013_10_gamedev01
+* mallie https://lighttransport.github.io/mallie
+* Your project here!
+
+Features
+--------
+
+* Group(parse multiple group name)
+* Vertex
+* Texcoord
+* Normal
+* Material
+  * Unknown material attributes are treated as key-value.
+
+Notes
+-----
+
+Polygon is converted into triangle.
+
+TODO
+----
+
+* Support quad polygon and some tags for OpenSubdiv http://graphics.pixar.com/opensubdiv/
+
+License
+-------
+
+Licensed under 2 clause BSD.
+
+Usage
+-----
+
+    std::string inputfile = "cornell_box.obj";
+    std::vector<tinyobj::shape_t> shapes;
+    std::vector<tinyobj::material_t> materials;
+  
+    std::string err = tinyobj::LoadObj(shapes, materials, inputfile.c_str());
+  
+    if (!err.empty()) {
+      std::cerr << err << std::endl;
+      exit(1);
+    }
+
+    std::cout << "# of shapes    : " << shapes.size() << std::endl;
+    std::cout << "# of materials : " << materials.size() << std::endl;
+  
+    for (size_t i = 0; i < shapes.size(); i++) {
+      printf("shape[%ld].name = %s\n", i, shapes[i].name.c_str());
+      printf("Size of shape[%ld].indices: %ld\n", i, shapes[i].mesh.indices.size());
+      printf("Size of shape[%ld].material_ids: %ld\n", i, shapes[i].mesh.material_ids.size());
+      assert((shapes[i].mesh.indices.size() % 3) == 0);
+      for (size_t f = 0; f < shapes[i].mesh.indices.size() / 3; f++) {
+        printf("  idx[%ld] = %d, %d, %d. mat_id = %d\n", f, shapes[i].mesh.indices[3*f+0], shapes[i].mesh.indices[3*f+1], shapes[i].mesh.indices[3*f+2], shapes[i].mesh.material_ids[f]);
+      }
+
+      printf("shape[%ld].vertices: %ld\n", i, shapes[i].mesh.positions.size());
+      assert((shapes[i].mesh.positions.size() % 3) == 0);
+      for (size_t v = 0; v < shapes[i].mesh.positions.size() / 3; v++) {
+        printf("  v[%ld] = (%f, %f, %f)\n", v,
+          shapes[i].mesh.positions[3*v+0],
+          shapes[i].mesh.positions[3*v+1],
+          shapes[i].mesh.positions[3*v+2]);
+      }
+    }
+
+    for (size_t i = 0; i < materials.size(); i++) {
+      printf("material[%ld].name = %s\n", i, materials[i].name.c_str());
+      printf("  material.Ka = (%f, %f ,%f)\n", materials[i].ambient[0], materials[i].ambient[1], materials[i].ambient[2]);
+      printf("  material.Kd = (%f, %f ,%f)\n", materials[i].diffuse[0], materials[i].diffuse[1], materials[i].diffuse[2]);
+      printf("  material.Ks = (%f, %f ,%f)\n", materials[i].specular[0], materials[i].specular[1], materials[i].specular[2]);
+      printf("  material.Tr = (%f, %f ,%f)\n", materials[i].transmittance[0], materials[i].transmittance[1], materials[i].transmittance[2]);
+      printf("  material.Ke = (%f, %f ,%f)\n", materials[i].emission[0], materials[i].emission[1], materials[i].emission[2]);
+      printf("  material.Ns = %f\n", materials[i].shininess);
+      printf("  material.Ni = %f\n", materials[i].ior);
+      printf("  material.dissolve = %f\n", materials[i].dissolve);
+      printf("  material.illum = %d\n", materials[i].illum);
+      printf("  material.map_Ka = %s\n", materials[i].ambient_texname.c_str());
+      printf("  material.map_Kd = %s\n", materials[i].diffuse_texname.c_str());
+      printf("  material.map_Ks = %s\n", materials[i].specular_texname.c_str());
+      printf("  material.map_Ns = %s\n", materials[i].normal_texname.c_str());
+      std::map<std::string, std::string>::const_iterator it(materials[i].unknown_parameter.begin());
+      std::map<std::string, std::string>::const_iterator itEnd(materials[i].unknown_parameter.end());
+      for (; it != itEnd; it++) {
+        printf("  material.%s = %s\n", it->first.c_str(), it->second.c_str());
+      }
+      printf("\n");
+    }
+  
diff --git a/external/syoyo-tinyobjloader-b35f498/cornell_box.mtl b/external/syoyo-tinyobjloader-b35f498/cornell_box.mtl
new file mode 100644
index 0000000..d3a1c7a
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/cornell_box.mtl
@@ -0,0 +1,24 @@
+newmtl white
+Ka 0 0 0
+Kd 1 1 1
+Ks 0 0 0
+
+newmtl red
+Ka 0 0 0
+Kd 1 0 0
+Ks 0 0 0
+
+newmtl green
+Ka 0 0 0
+Kd 0 1 0
+Ks 0 0 0
+
+newmtl blue
+Ka 0 0 0
+Kd 0 0 1
+Ks 0 0 0
+
+newmtl light
+Ka 20 20 20
+Kd 1 1 1
+Ks 0 0 0
diff --git a/external/syoyo-tinyobjloader-b35f498/cube.mtl b/external/syoyo-tinyobjloader-b35f498/cube.mtl
new file mode 100644
index 0000000..d3a1c7a
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/cube.mtl
@@ -0,0 +1,24 @@
+newmtl white
+Ka 0 0 0
+Kd 1 1 1
+Ks 0 0 0
+
+newmtl red
+Ka 0 0 0
+Kd 1 0 0
+Ks 0 0 0
+
+newmtl green
+Ka 0 0 0
+Kd 0 1 0
+Ks 0 0 0
+
+newmtl blue
+Ka 0 0 0
+Kd 0 0 1
+Ks 0 0 0
+
+newmtl light
+Ka 20 20 20
+Kd 1 1 1
+Ks 0 0 0
diff --git a/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_sticher.cc b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_sticher.cc
new file mode 100644
index 0000000..1833216
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_sticher.cc
@@ -0,0 +1,105 @@
+//
+// Stiches multiple .obj files into one .obj. 
+//
+#include "../../tiny_obj_loader.h"
+#include "obj_writer.h"
+
+#include <cassert>
+#include <iostream>
+#include <cstdlib>
+#include <cstdio>
+
+typedef std::vector<tinyobj::shape_t> Shape;
+typedef std::vector<tinyobj::material_t> Material;
+
+void
+StichObjs(
+  std::vector<tinyobj::shape_t>& out_shape,
+  std::vector<tinyobj::material_t>& out_material,
+  const std::vector<Shape>& shapes,
+  const std::vector<Material>& materials)
+{
+  int numShapes = 0;
+  for (size_t i = 0; i < shapes.size(); i++) {
+    numShapes += (int)shapes[i].size();
+  }
+
+  printf("Total # of shapes = %d\n", numShapes);
+  int materialIdOffset = 0;
+
+  size_t face_offset = 0;
+  for (size_t i = 0; i < shapes.size(); i++) {
+
+    for (size_t k = 0; k < shapes[i].size(); k++) {
+
+      std::string new_name = shapes[i][k].name;
+      // Add suffix
+      char buf[1024];
+      sprintf(buf, "_%04d", (int)i);
+      new_name += std::string(buf);
+
+      printf("shape[%ld][%ld].name = %s\n", i, k, shapes[i][k].name.c_str());
+      assert((shapes[i][k].mesh.indices.size() % 3) == 0);
+      assert((shapes[i][k].mesh.positions.size() % 3) == 0);
+
+      tinyobj::shape_t new_shape = shapes[i][k];
+      // Add offset.
+      for (size_t f = 0; f < new_shape.mesh.material_ids.size(); f++) {
+        new_shape.mesh.material_ids[f] += materialIdOffset;
+      }
+
+      new_shape.name = new_name;
+      printf("shape[%ld][%ld].new_name = %s\n", i, k, new_shape.name.c_str());
+
+      out_shape.push_back(new_shape);
+    }
+
+    materialIdOffset += materials[i].size();
+  }
+
+  for (size_t i = 0; i < materials.size(); i++) {
+    for (size_t k = 0; k < materials[i].size(); k++) {
+      out_material.push_back(materials[i][k]);
+    }
+  }
+
+}
+
+int
+main(
+  int argc,
+  char **argv)
+{
+  if (argc < 3) {
+    printf("Usage: obj_sticher input0.obj input1.obj ... output.obj\n");
+    exit(1);
+  }
+
+  int num_objfiles = argc - 2;
+  std::string out_filename = std::string(argv[argc-1]); // last element
+
+  std::vector<Shape> shapes;
+  std::vector<Material> materials;
+  shapes.resize(num_objfiles);
+
+  for (int i = 0; i < num_objfiles; i++) {
+    std::cout << "Loading " << argv[i+1] << " ... " << std::flush;
+    
+    std::string err = tinyobj::LoadObj(shapes[i], materials[i], argv[i+1]);
+    if (!err.empty()) {
+      std::cerr << err << std::endl;
+      exit(1);
+    }
+
+    std::cout << "DONE." << std::endl;
+  }
+
+  std::vector<tinyobj::shape_t> out_shape;
+  std::vector<tinyobj::material_t> out_material;
+  StichObjs(out_shape, out_material, shapes, materials);
+
+  bool ret = WriteObj(out_filename, out_shape, out_material);
+  assert(ret);
+
+  return 0;
+}
diff --git a/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_writer.cc b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_writer.cc
new file mode 100644
index 0000000..bb12457
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_writer.cc
@@ -0,0 +1,158 @@
+//
+// Simple wavefront .obj writer
+//
+#include "obj_writer.h"
+#include <cstdio>
+
+static std::string GetFileBasename(const std::string& FileName)
+{
+    if(FileName.find_last_of(".") != std::string::npos)
+        return FileName.substr(0, FileName.find_last_of("."));
+    return "";
+}
+
+bool WriteMat(const std::string& filename, const std::vector<tinyobj::material_t>& materials) {
+  FILE* fp = fopen(filename.c_str(), "w");
+  if (!fp) {
+    fprintf(stderr, "Failed to open file [ %s ] for write.\n", filename.c_str());
+    return false;
+  }
+
+  for (size_t i = 0; i < materials.size(); i++) {
+
+    tinyobj::material_t mat = materials[i];
+
+    fprintf(fp, "newmtl %s\n", mat.name.c_str());
+    fprintf(fp, "Ka %f %f %f\n", mat.ambient[0], mat.ambient[1], mat.ambient[2]);
+    fprintf(fp, "Kd %f %f %f\n", mat.diffuse[0], mat.diffuse[1], mat.diffuse[2]);
+    fprintf(fp, "Ks %f %f %f\n", mat.specular[0], mat.specular[1], mat.specular[2]);
+    fprintf(fp, "Kt %f %f %f\n", mat.transmittance[0], mat.specular[1], mat.specular[2]);
+    fprintf(fp, "Ke %f %f %f\n", mat.emission[0], mat.emission[1], mat.emission[2]);
+    fprintf(fp, "Ns %f\n", mat.shininess);
+    fprintf(fp, "Ni %f\n", mat.ior);
+    // @todo { texture }
+  }
+  
+  fclose(fp);
+
+  return true;
+}
+
+bool WriteObj(const std::string& filename, const std::vector<tinyobj::shape_t>& shapes, const std::vector<tinyobj::material_t>& materials) {
+  FILE* fp = fopen(filename.c_str(), "w");
+  if (!fp) {
+    fprintf(stderr, "Failed to open file [ %s ] for write.\n", filename.c_str());
+    return false;
+  }
+
+  std::string basename = GetFileBasename(filename);
+  std::string material_filename = basename + ".mtl";
+
+  int v_offset = 0;
+  int vn_offset = 0;
+  int vt_offset = 0;
+  int prev_material_id = -1;
+
+  fprintf(fp, "mtllib %s\n", material_filename.c_str());
+
+  for (size_t i = 0; i < shapes.size(); i++) {
+
+    bool has_vn = false;
+    bool has_vt = false;
+
+    if (shapes[i].name.empty()) {
+      fprintf(fp, "g Unknown\n");
+    } else {
+      fprintf(fp, "g %s\n", shapes[i].name.c_str());
+    }
+
+    //if (!shapes[i].material.name.empty()) {
+    //  fprintf(fp, "usemtl %s\n", shapes[i].material.name.c_str());
+    //}
+
+    // facevarying vtx
+    for (size_t k = 0; k < shapes[i].mesh.indices.size() / 3; k++) {
+      for (int j = 0; j < 3; j++) {
+        int idx = shapes[i].mesh.indices[3*k+j];
+        fprintf(fp, "v %f %f %f\n",
+          shapes[i].mesh.positions[3*idx+0],
+          shapes[i].mesh.positions[3*idx+1],
+          shapes[i].mesh.positions[3*idx+2]);
+      }
+    }
+
+    // facevarying normal
+    if (shapes[i].mesh.normals.size() > 0) {
+      for (size_t k = 0; k < shapes[i].mesh.indices.size() / 3; k++) {
+        for (int j = 0; j < 3; j++) {
+          int idx = shapes[i].mesh.indices[3*k+j];
+          fprintf(fp, "vn %f %f %f\n",
+            shapes[i].mesh.normals[3*idx+0],
+            shapes[i].mesh.normals[3*idx+1],
+            shapes[i].mesh.normals[3*idx+2]);
+        }
+      }
+    }
+    if (shapes[i].mesh.normals.size() > 0) has_vn = true;
+
+    // facevarying texcoord
+    if (shapes[i].mesh.texcoords.size() > 0) {
+      for (size_t k = 0; k < shapes[i].mesh.indices.size() / 3; k++) {
+        for (int j = 0; j < 3; j++) {
+          int idx = shapes[i].mesh.indices[3*k+j];
+          fprintf(fp, "vt %f %f\n",
+            shapes[i].mesh.texcoords[2*idx+0],
+            shapes[i].mesh.texcoords[2*idx+1]);
+        }
+      }
+    }
+    if (shapes[i].mesh.texcoords.size() > 0) has_vt = true;
+
+    // face
+    for (size_t k = 0; k < shapes[i].mesh.indices.size() / 3; k++) {
+  
+      // Face index is 1-base.
+      //int v0 = shapes[i].mesh.indices[3*k+0] + 1 + v_offset;
+      //int v1 = shapes[i].mesh.indices[3*k+1] + 1 + v_offset;
+      //int v2 = shapes[i].mesh.indices[3*k+2] + 1 + v_offset;
+      int v0 = (3*k + 0) + 1 + v_offset;
+      int v1 = (3*k + 1) + 1 + v_offset;
+      int v2 = (3*k + 2) + 1 + v_offset;
+
+      int material_id = shapes[i].mesh.material_ids[k];
+      if (material_id != prev_material_id) {
+        std::string material_name = materials[material_id].name;
+        fprintf(fp, "usemtl %s\n", material_name.c_str());
+        prev_material_id = material_id;
+      }
+
+      if (has_vn && has_vt) {
+        fprintf(fp, "f %d/%d/%d %d/%d/%d %d/%d/%d\n",
+          v0, v0, v0, v1, v1, v1, v2, v2, v2);
+      } else if (has_vn && !has_vt) {
+        fprintf(fp, "f %d//%d %d//%d %d//%d\n", v0, v0, v1, v1, v2, v2);
+      } else if (!has_vn && has_vt) {
+        fprintf(fp, "f %d/%d %d/%d %d/%d\n", v0, v0, v1, v1, v2, v2);
+      } else {
+        fprintf(fp, "f %d %d %d\n", v0, v1, v2);
+      }
+      
+    }
+
+    v_offset  += shapes[i].mesh.indices.size();
+    //vn_offset += shapes[i].mesh.normals.size() / 3;
+    //vt_offset += shapes[i].mesh.texcoords.size() / 2;
+
+  }
+
+  fclose(fp);
+
+  //
+  // Write material file
+  //
+  bool ret = WriteMat(material_filename, materials);
+
+  return ret;
+}
+
+
diff --git a/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_writer.h b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_writer.h
new file mode 100644
index 0000000..00cd792
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/obj_writer.h
@@ -0,0 +1,9 @@
+#ifndef __OBJ_WRITER_H__
+#define __OBJ_WRITER_H__
+
+#include "../../tiny_obj_loader.h"
+
+extern bool WriteObj(const std::string& filename, const std::vector<tinyobj::shape_t>& shapes, const std::vector<tinyobj::material_t>& materials);
+
+
+#endif // __OBJ_WRITER_H__
diff --git a/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/premake4.lua b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/premake4.lua
new file mode 100644
index 0000000..9c2deb6
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/examples/obj_sticher/premake4.lua
@@ -0,0 +1,38 @@
+lib_sources = {
+   "../../tiny_obj_loader.cc"
+}
+
+sources = {
+   "obj_sticher.cc",
+   "obj_writer.cc",
+   }
+
+-- premake4.lua
+solution "ObjStickerSolution"
+   configurations { "Release", "Debug" }
+
+   if (os.is("windows")) then
+      platforms { "x32", "x64" }
+   else
+      platforms { "native", "x32", "x64" }
+   end
+
+   includedirs {
+      "../../"
+   }
+
+   -- A project defines one build target
+   project "obj_sticher"
+      kind "ConsoleApp"
+      language "C++"
+      files { lib_sources, sources }
+
+      configuration "Debug"
+         defines { "DEBUG" } -- -DDEBUG
+         flags { "Symbols" }
+         targetname "obj_sticher_debug"
+
+      configuration "Release"
+         -- defines { "NDEBUG" } -- -NDEBUG
+         flags { "Symbols", "Optimize" }
+         targetname "obj_sticher"
diff --git a/external/syoyo-tinyobjloader-b35f498/images/rungholt.jpg b/external/syoyo-tinyobjloader-b35f498/images/rungholt.jpg
new file mode 100644
index 0000000..17718eb
Binary files /dev/null and b/external/syoyo-tinyobjloader-b35f498/images/rungholt.jpg differ
diff --git a/external/syoyo-tinyobjloader-b35f498/premake4.lua b/external/syoyo-tinyobjloader-b35f498/premake4.lua
new file mode 100644
index 0000000..ad020a6
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/premake4.lua
@@ -0,0 +1,33 @@
+lib_sources = {
+   "tiny_obj_loader.cc"
+}
+
+sources = {
+   "test.cc",
+   }
+
+-- premake4.lua
+solution "TinyObjLoaderSolution"
+   configurations { "Release", "Debug" }
+
+   if (os.is("windows")) then
+      platforms { "x32", "x64" }
+   else
+      platforms { "native", "x32", "x64" }
+   end
+
+   -- A project defines one build target
+   project "tinyobjloader"
+      kind "ConsoleApp"
+      language "C++"
+      files { lib_sources, sources }
+
+      configuration "Debug"
+         defines { "DEBUG" } -- -DDEBUG
+         flags { "Symbols" }
+         targetname "test_tinyobjloader_debug"
+
+      configuration "Release"
+         -- defines { "NDEBUG" } -- -NDEBUG
+         flags { "Symbols", "Optimize" }
+         targetname "test_tinyobjloader"
diff --git a/external/syoyo-tinyobjloader-b35f498/test.cc b/external/syoyo-tinyobjloader-b35f498/test.cc
new file mode 100644
index 0000000..1ad6d8c
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/test.cc
@@ -0,0 +1,198 @@
+#include "tiny_obj_loader.h"
+
+#include <cstdio>
+#include <cstdlib>
+#include <cassert>
+#include <iostream>
+#include <sstream>
+#include <fstream>
+
+static void PrintInfo(const std::vector<tinyobj::shape_t>& shapes, const std::vector<tinyobj::material_t>& materials)
+{
+  std::cout << "# of shapes    : " << shapes.size() << std::endl;
+  std::cout << "# of materials : " << materials.size() << std::endl;
+
+  for (size_t i = 0; i < shapes.size(); i++) {
+    printf("shape[%ld].name = %s\n", i, shapes[i].name.c_str());
+    printf("Size of shape[%ld].indices: %ld\n", i, shapes[i].mesh.indices.size());
+    printf("Size of shape[%ld].material_ids: %ld\n", i, shapes[i].mesh.material_ids.size());
+    assert((shapes[i].mesh.indices.size() % 3) == 0);
+    for (size_t f = 0; f < shapes[i].mesh.indices.size() / 3; f++) {
+      printf("  idx[%ld] = %d, %d, %d. mat_id = %d\n", f, shapes[i].mesh.indices[3*f+0], shapes[i].mesh.indices[3*f+1], shapes[i].mesh.indices[3*f+2], shapes[i].mesh.material_ids[f]);
+    }
+
+    printf("shape[%ld].vertices: %ld\n", i, shapes[i].mesh.positions.size());
+    assert((shapes[i].mesh.positions.size() % 3) == 0);
+    for (size_t v = 0; v < shapes[i].mesh.positions.size() / 3; v++) {
+      printf("  v[%ld] = (%f, %f, %f)\n", v,
+        shapes[i].mesh.positions[3*v+0],
+        shapes[i].mesh.positions[3*v+1],
+        shapes[i].mesh.positions[3*v+2]);
+    }
+  }
+
+  for (size_t i = 0; i < materials.size(); i++) {
+    printf("material[%ld].name = %s\n", i, materials[i].name.c_str());
+    printf("  material.Ka = (%f, %f ,%f)\n", materials[i].ambient[0], materials[i].ambient[1], materials[i].ambient[2]);
+    printf("  material.Kd = (%f, %f ,%f)\n", materials[i].diffuse[0], materials[i].diffuse[1], materials[i].diffuse[2]);
+    printf("  material.Ks = (%f, %f ,%f)\n", materials[i].specular[0], materials[i].specular[1], materials[i].specular[2]);
+    printf("  material.Tr = (%f, %f ,%f)\n", materials[i].transmittance[0], materials[i].transmittance[1], materials[i].transmittance[2]);
+    printf("  material.Ke = (%f, %f ,%f)\n", materials[i].emission[0], materials[i].emission[1], materials[i].emission[2]);
+    printf("  material.Ns = %f\n", materials[i].shininess);
+    printf("  material.Ni = %f\n", materials[i].ior);
+    printf("  material.dissolve = %f\n", materials[i].dissolve);
+    printf("  material.illum = %d\n", materials[i].illum);
+    printf("  material.map_Ka = %s\n", materials[i].ambient_texname.c_str());
+    printf("  material.map_Kd = %s\n", materials[i].diffuse_texname.c_str());
+    printf("  material.map_Ks = %s\n", materials[i].specular_texname.c_str());
+    printf("  material.map_Ns = %s\n", materials[i].normal_texname.c_str());
+    std::map<std::string, std::string>::const_iterator it(materials[i].unknown_parameter.begin());
+    std::map<std::string, std::string>::const_iterator itEnd(materials[i].unknown_parameter.end());
+    for (; it != itEnd; it++) {
+      printf("  material.%s = %s\n", it->first.c_str(), it->second.c_str());
+    }
+    printf("\n");
+  }
+}
+
+static bool
+TestLoadObj(
+  const char* filename,
+  const char* basepath = NULL)
+{
+  std::cout << "Loading " << filename << std::endl;
+
+  std::vector<tinyobj::shape_t> shapes;
+  std::vector<tinyobj::material_t> materials;
+  std::string err = tinyobj::LoadObj(shapes, materials, filename, basepath);
+
+  if (!err.empty()) {
+    std::cerr << err << std::endl;
+    return false;
+  }
+
+  PrintInfo(shapes, materials);
+
+  return true;
+}
+
+
+static bool
+TestStreamLoadObj()
+{
+  std::cout << "Stream Loading " << std::endl;
+
+  std::stringstream objStream;
+  objStream 
+    << "mtllib cube.mtl\n"
+    "\n"
+    "v 0.000000 2.000000 2.000000\n"
+    "v 0.000000 0.000000 2.000000\n"
+    "v 2.000000 0.000000 2.000000\n"
+    "v 2.000000 2.000000 2.000000\n"
+    "v 0.000000 2.000000 0.000000\n"
+    "v 0.000000 0.000000 0.000000\n"
+    "v 2.000000 0.000000 0.000000\n"
+    "v 2.000000 2.000000 0.000000\n"
+    "# 8 vertices\n"
+    "\n"
+    "g front cube\n"
+    "usemtl white\n"
+    "f 1 2 3 4\n"
+    "g back cube\n"
+    "# expects white material\n"
+    "f 8 7 6 5\n"
+    "g right cube\n"
+    "usemtl red\n"
+    "f 4 3 7 8\n"
+    "g top cube\n"
+    "usemtl white\n"
+    "f 5 1 4 8\n"
+    "g left cube\n"
+    "usemtl green\n"
+    "f 5 6 2 1\n"
+    "g bottom cube\n"
+    "usemtl white\n"
+    "f 2 6 7 3\n"
+    "# 6 elements";
+
+std::string matStream( 
+    "newmtl white\n"
+    "Ka 0 0 0\n"
+    "Kd 1 1 1\n"
+    "Ks 0 0 0\n"
+    "\n"
+    "newmtl red\n"
+    "Ka 0 0 0\n"
+    "Kd 1 0 0\n"
+    "Ks 0 0 0\n"
+    "\n"
+    "newmtl green\n"
+    "Ka 0 0 0\n"
+    "Kd 0 1 0\n"
+    "Ks 0 0 0\n"
+    "\n"
+    "newmtl blue\n"
+    "Ka 0 0 0\n"
+    "Kd 0 0 1\n"
+    "Ks 0 0 0\n"
+    "\n"
+    "newmtl light\n"
+    "Ka 20 20 20\n"
+    "Kd 1 1 1\n"
+    "Ks 0 0 0");
+
+    using namespace tinyobj;
+    class MaterialStringStreamReader:
+        public MaterialReader
+    {
+        public:
+            MaterialStringStreamReader(const std::string& matSStream): m_matSStream(matSStream) {}
+            virtual ~MaterialStringStreamReader() {}
+            virtual std::string operator() (
+              const std::string& matId,
+              std::vector<material_t>& materials,
+              std::map<std::string, int>& matMap)
+            {
+                return LoadMtl(matMap, materials, m_matSStream);
+            }
+
+        private:
+            std::stringstream m_matSStream;
+    };  
+
+  MaterialStringStreamReader matSSReader(matStream);
+  std::vector<tinyobj::shape_t> shapes;
+  std::vector<tinyobj::material_t> materials;
+  std::string err = tinyobj::LoadObj(shapes, materials, objStream, matSSReader);    
+  
+  if (!err.empty()) {
+    std::cerr << err << std::endl;
+    return false;
+  }
+
+  PrintInfo(shapes, materials);
+    
+  return true;
+}
+
+int
+main(
+  int argc,
+  char **argv)
+{
+
+  if (argc > 1) {
+    const char* basepath = NULL;
+    if (argc > 2) {
+      basepath = argv[2];
+    }
+    assert(true == TestLoadObj(argv[1], basepath));
+  } else {
+    //assert(true == TestLoadObj("cornell_box.obj"));
+    //assert(true == TestLoadObj("cube.obj"));
+    assert(true == TestStreamLoadObj());
+  }
+  
+  return 0;
+}
diff --git a/external/syoyo-tinyobjloader-b35f498/wercker.yml b/external/syoyo-tinyobjloader-b35f498/wercker.yml
new file mode 100644
index 0000000..3c1583c
--- /dev/null
+++ b/external/syoyo-tinyobjloader-b35f498/wercker.yml
@@ -0,0 +1,12 @@
+box: rioki/gcc-cpp@0.0.1
+build:
+    steps:
+        # Execute a custom script step.
+        - script:
+            name: build
+            code: |
+                  git clone https://github.com/syoyo/orebuildenv.git
+                  chmod +x ./orebuildenv/build/linux/bin/premake4
+                  ./orebuildenv/build/linux/bin/premake4 gmake
+                  make
+                  ./test_tinyobjloader
diff --git a/obj_loader.jpg b/obj_loader.jpg
new file mode 100644
index 0000000..44254f0
Binary files /dev/null and b/obj_loader.jpg differ
diff --git a/smoothing_filter.bmp b/smoothing_filter.bmp
new file mode 100644
index 0000000..1fa44d1
Binary files /dev/null and b/smoothing_filter.bmp differ
diff --git a/src/backup/raytraceKernel.cu b/src/backup/raytraceKernel.cu
new file mode 100644
index 0000000..57bf1b4
--- /dev/null
+++ b/src/backup/raytraceKernel.cu
@@ -0,0 +1,258 @@
+// CIS565 CUDA Raytracer: A parallel raytracer for Patrick Cozzi's CIS565: GPU Computing at the University of Pennsylvania
+// Written by Yining Karl Li, Copyright (c) 2012 University of Pennsylvania
+// This file includes code from:
+//       Rob Farber for CUDA-GL interop, from CUDA Supercomputing For The Masses: http://www.drdobbs.com/architecture-and-design/cuda-supercomputing-for-the-masses-part/222600097
+//       Peter Kutz and Yining Karl Li's GPU Pathtracer: http://gpupathtracer.blogspot.com/
+//       Yining Karl Li's TAKUA Render, a massively parallel pathtracing renderer: http://www.yiningkarlli.com
+
+#include <stdio.h>
+#include <cuda.h>
+#include <cmath>
+
+#include "sceneStructs.h"
+#include "glm/glm.hpp"
+#include "utilities.h"
+#include "raytraceKernel.h"
+#include "intersections.h"
+#include "interactions.h"
+#include "../src/cuPrintf.cu"  
+
+#define len(x) sqrtf(x[0]*x[0] + x[1]*x[1] + x[2]*x[2])
+
+void checkCUDAError(const char *msg) {
+  cudaError_t err = cudaGetLastError();
+  if( cudaSuccess != err) {
+    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 
+    exit(EXIT_FAILURE); 
+  }
+} 
+
+// LOOK: This function demonstrates how to use thrust for random number generation on the GPU!
+// Function that generates static.
+__host__ __device__ glm::vec3 generateRandomNumberFromThread(glm::vec2 resolution, float time, int x, int y){
+  int index = x + (y * resolution.x);
+   
+  thrust::default_random_engine rng(hash(index*time));
+  thrust::uniform_real_distribution<float> u01(0,1);
+
+  return glm::vec3((float) u01(rng), (float) u01(rng), (float) u01(rng));
+}
+
+// TODO: IMPLEMENT THIS FUNCTION
+// Function that does the initial raycast from the camera
+glm::vec3 norm(glm::vec3 in)
+{
+	 glm::vec3 ret = in / (float)len(in);
+	 return ret;
+}
+__host__ __device__ ray raycastFromCameraKernel(glm::vec2 resolution, int x, int y, glm::vec3 eye, glm::vec3 view, glm::vec3 up, glm::vec2 fov){
+
+  glm::vec3 A = glm::cross(view, up);
+  glm::vec3 B = glm::cross(A,view);
+  glm::vec3 M = view + eye;
+  glm::vec3 V = B * (float)view.length() * tanf(float(fov.y * PI / 180.0)) / (float)B.length();
+  glm::vec3 H = A * (float)view.length() * tanf(float(fov.x * PI / 180.0)) / (float)A.length(); 
+  glm::vec3 P = M + (float)((2.0*x)/(resolution.x-1.0)-1.0) * H +  (float)(2.0*(resolution.y - y - 1.0)/(resolution.y-1.0)-1.0) * V;
+   
+  ray r;
+  r.origin = P;
+  r.direction = glm::normalize(P-eye);
+ 
+  return r;
+}
+
+//Kernel that blacks out a given image buffer
+__global__ void clearImage(glm::vec2 resolution, glm::vec3* image){
+    int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+    int index = x + (y * resolution.x);
+    if(x<=resolution.x && y<=resolution.y){
+      image[index] = glm::vec3(0,0,0);
+    }
+}
+
+//Kernel that writes the image to the OpenGL PBO directly.
+__global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3* image){
+  
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * resolution.x);
+  
+  if(x<=resolution.x && y<=resolution.y){
+
+      glm::vec3 color;
+      color.x = image[index].x*255.0;
+      color.y = image[index].y*255.0;
+      color.z = image[index].z*255.0;
+
+      if(color.x>255){
+        color.x = 255;
+      }
+
+      if(color.y>255){
+        color.y = 255;
+      }
+
+      if(color.z>255){
+        color.z = 255;
+      }
+      
+      // Each thread writes one pixel location in the texture (textel)
+      PBOpos[index].w = 0;
+      PBOpos[index].x = color.x;
+      PBOpos[index].y = color.y;
+      PBOpos[index].z = color.z;
+  }
+}
+
+__host__ __device__ int checkIntersections(ray r, staticGeom* geoms, int numberOfGeoms, glm::vec3 intersectionPoint, glm::vec3 normal)
+{
+	int closestGeo = -1;
+	float t = 99999;
+
+	for(int i = 0; i < numberOfGeoms; ++i)
+	{
+		float tmp;
+		if(geoms[i].type == SPHERE)
+			tmp = sphereIntersectionTest(geoms[i], r, intersectionPoint, normal);
+		else if(geoms[i].type == CUBE)
+			tmp = boxIntersectionTest(geoms[i], r, intersectionPoint, normal);
+
+		if(tmp != -1 && tmp < t)
+		{
+			t = tmp;
+			closestGeo = i;
+		}
+	}
+
+	if( closestGeo != -1)
+	{
+		if(geoms[closestGeo].type == SPHERE)
+			sphereIntersectionTest(geoms[closestGeo], r, intersectionPoint, normal);
+		else if(geoms[closestGeo].type == CUBE)
+			boxIntersectionTest(geoms[closestGeo], r, intersectionPoint, normal);
+		return closestGeo;
+	}
+	else
+		return -1;
+
+}
+__host__ __device__ void iterativeRayTrace(ray r, int rayDepth, float time, staticGeom* geoms, material* materials,
+	                                       glm::vec3& color, int x, int y)
+{
+	if(rayDepth > 2)
+		return;
+
+	
+}
+__global__ void genCameraRayBatch(glm::vec2 resolution, cameraData cam,  ray * rays)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * resolution.x);
+	if(x<=resolution.x && y<=resolution.y)
+	{
+		rays[index] = raycastFromCameraKernel(resolution, x, y, cam.position, cam.view, cam.up, cam.fov);
+
+	}
+}
+// TODO: IMPLEMENT THIS FUNCTION
+// Core raytracer kernel
+__global__ void raytraceRay(glm::vec2 resolution, float time, cameraData cam, int rayDepth, glm::vec3* colors,
+                            staticGeom* geoms, int numberOfGeoms, material * cudaMat, ray * rays){
+
+  int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+  int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+  int index = x + (y * resolution.x);
+  ray r;
+  r = raycastFromCameraKernel(resolution, time, x, y, cam.position, cam.view, cam.up, cam.fov);
+  //cuPrintf("Ray Postion: %f %f %f  Direction: %f %f %f\n", r.origin.x , r.origin.y,r.origin.z,r.direction.x,r.direction.y,r.direction.z );
+  
+  if((x<=resolution.x && y<=resolution.y))
+  {
+	 glm::vec3 intersectionPoint, normal;
+	 int geoIndex = checkIntersections(r, geoms, numberOfGeoms, intersectionPoint, normal);
+	 colors[index] = cudaMat[geoms[geoIndex].materialid].color;
+	 if(geoIndex!=-1) // hit something, shoot ray again
+	 {
+	
+	 }
+   } 
+    //colors[index] = generateRandomNumberFromThread(resolution, time, x, y); 
+  
+}
+
+// TODO: FINISH THIS FUNCTION
+// Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
+void cudaRaytraceCore(uchar4* PBOpos, camera* renderCam, int frame, int iterations, material* materials, int numberOfMaterials, geom* geoms, int numberOfGeoms){
+  
+	int traceDepth = 2; //determines how many bounces the raytracer traces
+
+	// set up crucial magic
+	int tileSize = 8;
+	dim3 threadsPerBlock(tileSize, tileSize);
+	dim3 fullBlocksPerGrid((int)ceil(float(renderCam->resolution.x)/float(tileSize)), (int)ceil(float(renderCam->resolution.y)/float(tileSize)));
+  
+	// send image to GPU
+	glm::vec3* cudaimage = NULL;
+	cudaMalloc((void**)&cudaimage, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3));
+	cudaMemcpy( cudaimage, renderCam->image, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3), cudaMemcpyHostToDevice);
+  
+	// package geometry and materials and sent to GPU
+	staticGeom* geomList = new staticGeom[numberOfGeoms];
+	for(int i=0; i<numberOfGeoms; i++){
+	staticGeom newStaticGeom;
+	newStaticGeom.type = geoms[i].type;
+	newStaticGeom.materialid = geoms[i].materialid;
+	newStaticGeom.translation = geoms[i].translations[frame];
+	newStaticGeom.rotation = geoms[i].rotations[frame];
+	newStaticGeom.scale = geoms[i].scales[frame];
+	newStaticGeom.transform = geoms[i].transforms[frame];
+	newStaticGeom.inverseTransform = geoms[i].inverseTransforms[frame];
+	geomList[i] = newStaticGeom;
+	}
+  
+	staticGeom* cudageoms = NULL;
+	cudaMalloc((void**)&cudageoms, numberOfGeoms*sizeof(staticGeom));
+	cudaMemcpy( cudageoms, geomList, numberOfGeoms*sizeof(staticGeom), cudaMemcpyHostToDevice);
+  
+	material* cudaMat = NULL;
+	cudaMalloc((void**)&cudaMat, numberOfMaterials*sizeof(material));
+	cudaMemcpy( cudaMat, materials, numberOfGeoms*sizeof(material), cudaMemcpyHostToDevice);
+
+	// package camera
+	cameraData cam;
+	cam.resolution = renderCam->resolution;
+	cam.position = renderCam->positions[frame];
+	cam.view = renderCam->views[frame];
+	cam.up = renderCam->ups[frame];
+	cam.fov = renderCam->fov;
+
+	// package light
+
+
+	// kernel launches
+	//cudaPrintfInit();
+	ray * rays;
+	cudaMalloc((void**)&rays, cam.resolution.x * cam.resolution.y * sizeof(material));
+	genCameraRayBatch(cam.resolution, cam,  rays);
+	for( int i = 0; i < traceDepth; ++i)
+		raytraceRay<<<fullBlocksPerGrid, threadsPerBlock>>>(renderCam->resolution, (float)iterations, cam, traceDepth, cudaimage, cudageoms, numberOfGeoms, cudaMat);
+	//cudaPrintfDisplay(stdout, false);
+
+	sendImageToPBO<<<fullBlocksPerGrid, threadsPerBlock>>>(PBOpos, renderCam->resolution, cudaimage);
+
+	// retrieve image from GPU
+	cudaMemcpy( renderCam->image, cudaimage, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3), cudaMemcpyDeviceToHost);
+
+	// free up stuff, or else we'll leak memory like a madman
+	//cudaPrintfEnd();
+	cudaFree( cudaimage );
+	cudaFree( cudageoms );
+	delete geomList;
+
+	// make certain the kernel has completed
+	cudaThreadSynchronize();
+
+	checkCUDAError("Kernel failed!");
+}
diff --git a/src/cuPrintf.cu b/src/cuPrintf.cu
new file mode 100644
index 0000000..bd46ff5
--- /dev/null
+++ b/src/cuPrintf.cu
@@ -0,0 +1,879 @@
+/*
+	Copyright 2009 NVIDIA Corporation.  All rights reserved.
+
+	NOTICE TO LICENSEE:   
+
+	This source code and/or documentation ("Licensed Deliverables") are subject 
+	to NVIDIA intellectual property rights under U.S. and international Copyright 
+	laws.  
+
+	These Licensed Deliverables contained herein is PROPRIETARY and CONFIDENTIAL 
+	to NVIDIA and is being provided under the terms and conditions of a form of 
+	NVIDIA software license agreement by and between NVIDIA and Licensee ("License 
+	Agreement") or electronically accepted by Licensee.  Notwithstanding any terms 
+	or conditions to the contrary in the License Agreement, reproduction or 
+	disclosure of the Licensed Deliverables to any third party without the express 
+	written consent of NVIDIA is prohibited.     
+
+	NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE LICENSE AGREEMENT, 
+	NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THESE LICENSED 
+	DELIVERABLES FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED 
+	WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE 
+	LICENSED DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY, 
+	NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.   NOTWITHSTANDING ANY 
+	TERMS OR CONDITIONS TO THE CONTRARY IN THE LICENSE AGREEMENT, IN NO EVENT SHALL 
+	NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, 
+	OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,	WHETHER 
+	IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,  ARISING OUT OF 
+	OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THESE LICENSED DELIVERABLES.  
+
+	U.S. Government End Users. These Licensed Deliverables are a "commercial item" 
+	as that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of 
+	"commercial computer  software"  and "commercial computer software documentation" 
+	as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995) and is provided to the 
+	U.S. Government only as a commercial end item.  Consistent with 48 C.F.R.12.212 
+	and 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all U.S. Government 
+	End Users acquire the Licensed Deliverables with only those rights set forth 
+	herein. 
+
+	Any use of the Licensed Deliverables in individual and commercial software must 
+	include, in the user documentation and internal comments to the code, the above 
+	Disclaimer and U.S. Government End Users Notice.
+ */
+
+/*
+ *	cuPrintf.cu
+ *
+ *	This is a printf command callable from within a kernel. It is set
+ *	up so that output is sent to a memory buffer, which is emptied from
+ *	the host side - but only after a cudaThreadSynchronize() on the host.
+ *
+ *	Currently, there is a limitation of around 200 characters of output
+ *	and no more than 10 arguments to a single cuPrintf() call. Issue
+ *	multiple calls if longer format strings are required.
+ *
+ *	It requires minimal setup, and is *NOT* optimised for performance.
+ *	For example, writes are not coalesced - this is because there is an
+ *	assumption that people will not want to printf from every single one
+ *	of thousands of threads, but only from individual threads at a time.
+ *
+ *	Using this is simple - it requires one host-side call to initialise
+ *	everything, and then kernels can call cuPrintf at will. Sample code
+ *	is the easiest way to demonstrate:
+ *
+	#include "cuPrintf.cu"
+ 	
+	__global__ void testKernel(int val)
+	{
+		cuPrintf("Value is: %d\n", val);
+	}
+
+	int main()
+	{
+		cudaPrintfInit();
+		testKernel<<< 2, 3 >>>(10);
+		cudaPrintfDisplay(stdout, true);
+		cudaPrintfEnd();
+        return 0;
+	}
+ *
+ *	See the header file, "cuPrintf.cuh" for more info, especially
+ *	arguments to cudaPrintfInit() and cudaPrintfDisplay();
+ */
+
+#ifndef CUPRINTF_CU
+#define CUPRINTF_CU
+
+#include "cuPrintf.cuh"
+#if __CUDA_ARCH__ > 100      // Atomics only used with > sm_10 architecture
+#include <sm_11_atomic_functions.h>
+#endif
+
+// This is the smallest amount of memory, per-thread, which is allowed.
+// It is also the largest amount of space a single printf() can take up
+const static int CUPRINTF_MAX_LEN = 256;
+
+// This structure is used internally to track block/thread output restrictions.
+typedef struct __align__(8) {
+	int threadid;				// CUPRINTF_UNRESTRICTED for unrestricted
+	int blockid;				// CUPRINTF_UNRESTRICTED for unrestricted
+} cuPrintfRestriction;
+
+// The main storage is in a global print buffer, which has a known
+// start/end/length. These are atomically updated so it works as a
+// circular buffer.
+// Since the only control primitive that can be used is atomicAdd(),
+// we cannot wrap the pointer as such. The actual address must be
+// calculated from printfBufferPtr by mod-ing with printfBufferLength.
+// For sm_10 architecture, we must subdivide the buffer per-thread
+// since we do not even have an atomic primitive.
+__constant__ static char *globalPrintfBuffer = NULL;         // Start of circular buffer (set up by host)
+__constant__ static int printfBufferLength = 0;              // Size of circular buffer (set up by host)
+__device__ static cuPrintfRestriction restrictRules;         // Output restrictions
+__device__ volatile static char *printfBufferPtr = NULL;     // Current atomically-incremented non-wrapped offset
+
+// This is the header preceeding all printf entries.
+// NOTE: It *must* be size-aligned to the maximum entity size (size_t)
+typedef struct __align__(8) {
+    unsigned short magic;                   // Magic number says we're valid
+    unsigned short fmtoffset;               // Offset of fmt string into buffer
+    unsigned short blockid;                 // Block ID of author
+    unsigned short threadid;                // Thread ID of author
+} cuPrintfHeader;
+
+// Special header for sm_10 architecture
+#define CUPRINTF_SM10_MAGIC   0xC810        // Not a valid ascii character
+typedef struct __align__(16) {
+    unsigned short magic;                   // sm_10 specific magic number
+    unsigned short unused;
+    unsigned int thread_index;              // thread ID for this buffer
+    unsigned int thread_buf_len;            // per-thread buffer length
+    unsigned int offset;                    // most recent printf's offset
+} cuPrintfHeaderSM10;
+
+
+// Because we can't write an element which is not aligned to its bit-size,
+// we have to align all sizes and variables on maximum-size boundaries.
+// That means sizeof(double) in this case, but we'll use (long long) for
+// better arch<1.3 support
+#define CUPRINTF_ALIGN_SIZE      sizeof(long long)
+
+// All our headers are prefixed with a magic number so we know they're ready
+#define CUPRINTF_SM11_MAGIC  (unsigned short)0xC811        // Not a valid ascii character
+
+
+//
+//  getNextPrintfBufPtr
+//
+//  Grabs a block of space in the general circular buffer, using an
+//  atomic function to ensure that it's ours. We handle wrapping
+//  around the circular buffer and return a pointer to a place which
+//  can be written to.
+//
+//  Important notes:
+//      1. We always grab CUPRINTF_MAX_LEN bytes
+//      2. Because of 1, we never worry about wrapping around the end
+//      3. Because of 1, printfBufferLength *must* be a factor of CUPRINTF_MAX_LEN
+//
+//  This returns a pointer to the place where we own.
+//
+__device__ static char *getNextPrintfBufPtr()
+{
+    // Initialisation check
+    if(!printfBufferPtr)
+        return NULL;
+
+	// Thread/block restriction check
+	if((restrictRules.blockid != CUPRINTF_UNRESTRICTED) && (restrictRules.blockid != (blockIdx.x + gridDim.x*blockIdx.y)))
+		return NULL;
+	if((restrictRules.threadid != CUPRINTF_UNRESTRICTED) && (restrictRules.threadid != (threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z)))
+		return NULL;
+
+	// Conditional section, dependent on architecture
+#if __CUDA_ARCH__ == 100
+    // For sm_10 architectures, we have no atomic add - this means we must split the
+    // entire available buffer into per-thread blocks. Inefficient, but what can you do.
+    int thread_count = (gridDim.x * gridDim.y) * (blockDim.x * blockDim.y * blockDim.z);
+    int thread_index = threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z +
+                       (blockIdx.x + gridDim.x*blockIdx.y) * (blockDim.x * blockDim.y * blockDim.z);
+    
+    // Find our own block of data and go to it. Make sure the per-thread length
+	// is a precise multiple of CUPRINTF_MAX_LEN, otherwise we risk size and
+	// alignment issues! We must round down, of course.
+    unsigned int thread_buf_len = printfBufferLength / thread_count;
+	thread_buf_len &= ~(CUPRINTF_MAX_LEN-1);
+
+	// We *must* have a thread buffer length able to fit at least two printfs (one header, one real)
+	if(thread_buf_len < (CUPRINTF_MAX_LEN * 2))
+		return NULL;
+
+	// Now address our section of the buffer. The first item is a header.
+    char *myPrintfBuffer = globalPrintfBuffer + (thread_buf_len * thread_index);
+    cuPrintfHeaderSM10 hdr = *(cuPrintfHeaderSM10 *)(void *)myPrintfBuffer;
+    if(hdr.magic != CUPRINTF_SM10_MAGIC)
+    {
+        // If our header is not set up, initialise it
+        hdr.magic = CUPRINTF_SM10_MAGIC;
+        hdr.thread_index = thread_index;
+        hdr.thread_buf_len = thread_buf_len;
+        hdr.offset = 0;         // Note we start at 0! We pre-increment below.
+        *(cuPrintfHeaderSM10 *)(void *)myPrintfBuffer = hdr;       // Write back the header
+
+        // For initial setup purposes, we might need to init thread0's header too
+        // (so that cudaPrintfDisplay() below will work). This is only run once.
+        cuPrintfHeaderSM10 *tophdr = (cuPrintfHeaderSM10 *)(void *)globalPrintfBuffer;
+        tophdr->thread_buf_len = thread_buf_len;
+    }
+
+    // Adjust the offset by the right amount, and wrap it if need be
+    unsigned int offset = hdr.offset + CUPRINTF_MAX_LEN;
+    if(offset >= hdr.thread_buf_len)
+        offset = CUPRINTF_MAX_LEN;
+
+    // Write back the new offset for next time and return a pointer to it
+    ((cuPrintfHeaderSM10 *)(void *)myPrintfBuffer)->offset = offset;
+    return myPrintfBuffer + offset;
+#else
+    // Much easier with an atomic operation!
+    size_t offset = atomicAdd((unsigned int *)&printfBufferPtr, CUPRINTF_MAX_LEN) - (size_t)globalPrintfBuffer;
+    offset %= printfBufferLength;
+    return globalPrintfBuffer + offset;
+#endif
+}
+
+
+//
+//  writePrintfHeader
+//
+//  Inserts the header for containing our UID, fmt position and
+//  block/thread number. We generate it dynamically to avoid
+//	issues arising from requiring pre-initialisation.
+//
+__device__ static void writePrintfHeader(char *ptr, char *fmtptr)
+{
+    if(ptr)
+    {
+        cuPrintfHeader header;
+        header.magic = CUPRINTF_SM11_MAGIC;
+        header.fmtoffset = (unsigned short)(fmtptr - ptr);
+        header.blockid = blockIdx.x + gridDim.x*blockIdx.y;
+        header.threadid = threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z;
+        *(cuPrintfHeader *)(void *)ptr = header;
+    }
+}
+
+
+//
+//  cuPrintfStrncpy
+//
+//  This special strncpy outputs an aligned length value, followed by the
+//  string. It then zero-pads the rest of the string until a 64-aligned
+//  boundary. The length *includes* the padding. A pointer to the byte
+//  just after the \0 is returned.
+//
+//  This function could overflow CUPRINTF_MAX_LEN characters in our buffer.
+//  To avoid it, we must count as we output and truncate where necessary.
+//
+__device__ static char *cuPrintfStrncpy(char *dest, const char *src, int n, char *end)
+{
+    // Initialisation and overflow check
+    if(!dest || !src || (dest >= end))
+        return NULL;
+
+    // Prepare to write the length specifier. We're guaranteed to have
+    // at least "CUPRINTF_ALIGN_SIZE" bytes left because we only write out in
+    // chunks that size, and CUPRINTF_MAX_LEN is aligned with CUPRINTF_ALIGN_SIZE.
+    int *lenptr = (int *)(void *)dest;
+    int len = 0;
+    dest += CUPRINTF_ALIGN_SIZE;
+
+    // Now copy the string
+    while(n--)
+    {
+        if(dest >= end)     // Overflow check
+            break;
+
+        len++;
+        *dest++ = *src;
+        if(*src++ == '\0')
+            break;
+    }
+
+    // Now write out the padding bytes, and we have our length.
+    while((dest < end) && (((long)dest & (CUPRINTF_ALIGN_SIZE-1)) != 0))
+    {
+        len++;
+        *dest++ = 0;
+    }
+    *lenptr = len;
+    return (dest < end) ? dest : NULL;        // Overflow means return NULL
+}
+
+
+//
+//  copyArg
+//
+//  This copies a length specifier and then the argument out to the
+//  data buffer. Templates let the compiler figure all this out at
+//  compile-time, making life much simpler from the programming
+//  point of view. I'm assuimg all (const char *) is a string, and
+//  everything else is the variable it points at. I'd love to see
+//  a better way of doing it, but aside from parsing the format
+//  string I can't think of one.
+//
+//  The length of the data type is inserted at the beginning (so that
+//  the display can distinguish between float and double), and the
+//  pointer to the end of the entry is returned.
+//
+__device__ static char *copyArg(char *ptr, const char *arg, char *end)
+{
+    // Initialisation check
+    if(!ptr || !arg)
+        return NULL;
+
+    // strncpy does all our work. We just terminate.
+    if((ptr = cuPrintfStrncpy(ptr, arg, CUPRINTF_MAX_LEN, end)) != NULL)
+        *ptr = 0;
+
+    return ptr;
+}
+
+template <typename T>
+__device__ static char *copyArg(char *ptr, T &arg, char *end)
+{
+    // Initisalisation and overflow check. Alignment rules mean that
+    // we're at least CUPRINTF_ALIGN_SIZE away from "end", so we only need
+    // to check that one offset.
+    if(!ptr || ((ptr+CUPRINTF_ALIGN_SIZE) >= end))
+        return NULL;
+
+    // Write the length and argument
+    *(int *)(void *)ptr = sizeof(arg);
+    ptr += CUPRINTF_ALIGN_SIZE;
+    *(T *)(void *)ptr = arg;
+    ptr += CUPRINTF_ALIGN_SIZE;
+    *ptr = 0;
+
+    return ptr;
+}
+
+
+//
+//  cuPrintf
+//
+//  Templated printf functions to handle multiple arguments.
+//  Note we return the total amount of data copied, not the number
+//  of characters output. But then again, who ever looks at the
+//  return from printf() anyway?
+//
+//  The format is to grab a block of circular buffer space, the
+//  start of which will hold a header and a pointer to the format
+//  string. We then write in all the arguments, and finally the
+//  format string itself. This is to make it easy to prevent
+//  overflow of our buffer (we support up to 10 arguments, each of
+//  which can be 12 bytes in length - that means that only the
+//  format string (or a %s) can actually overflow; so the overflow
+//  check need only be in the strcpy function.
+//
+//  The header is written at the very last because that's what
+//  makes it look like we're done.
+//
+//  Errors, which are basically lack-of-initialisation, are ignored
+//  in the called functions because NULL pointers are passed around
+//
+
+// All printf variants basically do the same thing, setting up the
+// buffer, writing all arguments, then finalising the header. For
+// clarity, we'll pack the code into some big macros.
+#define CUPRINTF_PREAMBLE \
+    char *start, *end, *bufptr, *fmtstart; \
+    if((start = getNextPrintfBufPtr()) == NULL) return 0; \
+    end = start + CUPRINTF_MAX_LEN; \
+    bufptr = start + sizeof(cuPrintfHeader);
+
+// Posting an argument is easy
+#define CUPRINTF_ARG(argname) \
+	bufptr = copyArg(bufptr, argname, end);
+
+// After args are done, record start-of-fmt and write the fmt and header
+#define CUPRINTF_POSTAMBLE \
+    fmtstart = bufptr; \
+    end = cuPrintfStrncpy(bufptr, fmt, CUPRINTF_MAX_LEN, end); \
+    writePrintfHeader(start, end ? fmtstart : NULL); \
+    return end ? (int)(end - start) : 0;
+
+__device__ int cuPrintf(const char *fmt)
+{
+	CUPRINTF_PREAMBLE;
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1> __device__ int cuPrintf(const char *fmt, T1 arg1)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3, typename T4> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+	CUPRINTF_ARG(arg4);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3, typename T4, typename T5> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+	CUPRINTF_ARG(arg4);
+	CUPRINTF_ARG(arg5);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+	CUPRINTF_ARG(arg4);
+	CUPRINTF_ARG(arg5);
+	CUPRINTF_ARG(arg6);
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+	CUPRINTF_ARG(arg4);
+	CUPRINTF_ARG(arg5);
+	CUPRINTF_ARG(arg6);
+	CUPRINTF_ARG(arg7);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7, T8 arg8)
+{
+	CUPRINTF_PREAMBLE;
+
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+	CUPRINTF_ARG(arg4);
+	CUPRINTF_ARG(arg5);
+	CUPRINTF_ARG(arg6);
+	CUPRINTF_ARG(arg7);
+	CUPRINTF_ARG(arg8);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename T9> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7, T8 arg8, T9 arg9)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+	CUPRINTF_ARG(arg4);
+	CUPRINTF_ARG(arg5);
+	CUPRINTF_ARG(arg6);
+	CUPRINTF_ARG(arg7);
+	CUPRINTF_ARG(arg8);
+	CUPRINTF_ARG(arg9);
+
+	CUPRINTF_POSTAMBLE;
+}
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename T9, typename T10> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7, T8 arg8, T9 arg9, T10 arg10)
+{
+	CUPRINTF_PREAMBLE;
+	    
+	CUPRINTF_ARG(arg1);
+	CUPRINTF_ARG(arg2);
+	CUPRINTF_ARG(arg3);
+	CUPRINTF_ARG(arg4);
+	CUPRINTF_ARG(arg5);
+	CUPRINTF_ARG(arg6);
+	CUPRINTF_ARG(arg7);
+	CUPRINTF_ARG(arg8);
+	CUPRINTF_ARG(arg9);
+	CUPRINTF_ARG(arg10);
+
+	CUPRINTF_POSTAMBLE;
+}
+#undef CUPRINTF_PREAMBLE
+#undef CUPRINTF_ARG
+#undef CUPRINTF_POSTAMBLE
+
+
+//
+//	cuPrintfRestrict
+//
+//	Called to restrict output to a given thread/block.
+//	We store the info in "restrictRules", which is set up at
+//	init time by the host. It's not the cleanest way to do this
+//	because it means restrictions will last between
+//	invocations, but given the output-pointer continuity,
+//	I feel this is reasonable.
+//
+__device__ void cuPrintfRestrict(int threadid, int blockid)
+{
+    int thread_count = blockDim.x * blockDim.y * blockDim.z;
+	if(((threadid < thread_count) && (threadid >= 0)) || (threadid == CUPRINTF_UNRESTRICTED))
+		restrictRules.threadid = threadid;
+
+	int block_count = gridDim.x * gridDim.y;
+	if(((blockid < block_count) && (blockid >= 0)) || (blockid == CUPRINTF_UNRESTRICTED))
+		restrictRules.blockid = blockid;
+}
+
+
+///////////////////////////////////////////////////////////////////////////////
+// HOST SIDE
+
+#include <stdio.h>
+static FILE *printf_fp;
+
+static char *printfbuf_start=NULL;
+static char *printfbuf_device=NULL;
+static int printfbuf_len=0;
+
+
+//
+//  outputPrintfData
+//
+//  Our own internal function, which takes a pointer to a data buffer
+//  and passes it through libc's printf for output.
+//
+//  We receive the formate string and a pointer to where the data is
+//  held. We then run through and print it out.
+//
+//  Returns 0 on failure, 1 on success
+//
+static int outputPrintfData(char *fmt, char *data)
+{
+    // Format string is prefixed by a length that we don't need
+    fmt += CUPRINTF_ALIGN_SIZE;
+
+    // Now run through it, printing everything we can. We must
+    // run to every % character, extract only that, and use printf
+    // to format it.
+    char *p = strchr(fmt, '%');
+    while(p != NULL)
+    {
+        // Print up to the % character
+        *p = '\0';
+        fputs(fmt, printf_fp);
+        *p = '%';           // Put back the %
+
+        // Now handle the format specifier
+        char *format = p++;         // Points to the '%'
+        p += strcspn(p, "%cdiouxXeEfgGaAnps");
+        if(*p == '\0')              // If no format specifier, print the whole thing
+        {
+            fmt = format;
+            break;
+        }
+
+        // Cut out the format bit and use printf to print it. It's prefixed
+        // by its length.
+        int arglen = *(int *)data;
+        if(arglen > CUPRINTF_MAX_LEN)
+        {
+            fputs("Corrupt printf buffer data - aborting\n", printf_fp);
+            return 0;
+        }
+
+        data += CUPRINTF_ALIGN_SIZE;
+        
+        char specifier = *p++;
+        char c = *p;        // Store for later
+        *p = '\0';
+        switch(specifier)
+        {
+            // These all take integer arguments
+            case 'c':
+            case 'd':
+            case 'i':
+            case 'o':
+            case 'u':
+            case 'x':
+            case 'X':
+            case 'p':
+                fprintf(printf_fp, format, *((int *)data));
+                break;
+
+            // These all take double arguments
+            case 'e':
+            case 'E':
+            case 'f':
+            case 'g':
+            case 'G':
+            case 'a':
+            case 'A':
+                if(arglen == 4)     // Float vs. Double thing
+                    fprintf(printf_fp, format, *((float *)data));
+                else
+                    fprintf(printf_fp, format, *((double *)data));
+                break;
+
+            // Strings are handled in a special way
+            case 's':
+                fprintf(printf_fp, format, (char *)data);
+                break;
+
+            // % is special
+            case '%':
+                fprintf(printf_fp, "%%");
+                break;
+
+            // Everything else is just printed out as-is
+            default:
+                fprintf(printf_fp, format);
+                break;
+        }
+        data += CUPRINTF_ALIGN_SIZE;         // Move on to next argument
+        *p = c;                     // Restore what we removed
+        fmt = p;                    // Adjust fmt string to be past the specifier
+        p = strchr(fmt, '%');       // and get the next specifier
+    }
+
+    // Print out the last of the string
+    fputs(fmt, printf_fp);
+    return 1;
+}
+
+
+//
+//  doPrintfDisplay
+//
+//  This runs through the blocks of CUPRINTF_MAX_LEN-sized data, calling the
+//  print function above to display them. We've got this separate from
+//  cudaPrintfDisplay() below so we can handle the SM_10 architecture
+//  partitioning.
+//
+static int doPrintfDisplay(int headings, int clear, char *bufstart, char *bufend, char *bufptr, char *endptr)
+{
+    // Grab, piece-by-piece, each output element until we catch
+    // up with the circular buffer end pointer
+    int printf_count=0;
+    char printfbuf_local[CUPRINTF_MAX_LEN+1];
+    printfbuf_local[CUPRINTF_MAX_LEN] = '\0';
+
+    while(bufptr != endptr)
+    {
+        // Wrap ourselves at the end-of-buffer
+        if(bufptr == bufend)
+            bufptr = bufstart;
+
+        // Adjust our start pointer to within the circular buffer and copy a block.
+        cudaMemcpy(printfbuf_local, bufptr, CUPRINTF_MAX_LEN, cudaMemcpyDeviceToHost);
+
+        // If the magic number isn't valid, then this write hasn't gone through
+        // yet and we'll wait until it does (or we're past the end for non-async printfs).
+        cuPrintfHeader *hdr = (cuPrintfHeader *)printfbuf_local;
+        if((hdr->magic != CUPRINTF_SM11_MAGIC) || (hdr->fmtoffset >= CUPRINTF_MAX_LEN))
+        {
+            //fprintf(printf_fp, "Bad magic number in printf header\n");
+            break;
+        }
+
+        // Extract all the info and get this printf done
+        if(headings)
+            fprintf(printf_fp, "[%d, %d]: ", hdr->blockid, hdr->threadid);
+        if(hdr->fmtoffset == 0)
+            fprintf(printf_fp, "printf buffer overflow\n");
+        else if(!outputPrintfData(printfbuf_local+hdr->fmtoffset, printfbuf_local+sizeof(cuPrintfHeader)))
+            break;
+        printf_count++;
+
+        // Clear if asked
+        if(clear)
+            cudaMemset(bufptr, 0, CUPRINTF_MAX_LEN);
+
+        // Now advance our start location, because we're done, and keep copying
+        bufptr += CUPRINTF_MAX_LEN;
+    }
+
+    return printf_count;
+}
+
+
+//
+//  cudaPrintfInit
+//
+//  Takes a buffer length to allocate, creates the memory on the device and
+//  returns a pointer to it for when a kernel is called. It's up to the caller
+//  to free it.
+//
+extern "C" cudaError_t cudaPrintfInit(size_t bufferLen)
+{
+    // Fix up bufferlen to be a multiple of CUPRINTF_MAX_LEN
+    bufferLen = (bufferLen < CUPRINTF_MAX_LEN) ? CUPRINTF_MAX_LEN : bufferLen;
+    if((bufferLen % CUPRINTF_MAX_LEN) > 0)
+        bufferLen += (CUPRINTF_MAX_LEN - (bufferLen % CUPRINTF_MAX_LEN));
+    printfbuf_len = (int)bufferLen;
+
+    // Allocate a print buffer on the device and zero it
+    if(cudaMalloc((void **)&printfbuf_device, printfbuf_len) != cudaSuccess)
+		return cudaErrorInitializationError;
+    cudaMemset(printfbuf_device, 0, printfbuf_len);
+    printfbuf_start = printfbuf_device;         // Where we start reading from
+
+	// No restrictions to begin with
+	cuPrintfRestriction restrict;
+	restrict.threadid = restrict.blockid = CUPRINTF_UNRESTRICTED;
+	cudaMemcpyToSymbol(restrictRules, &restrict, sizeof(restrict));
+
+    // Initialise the buffer and the respective lengths/pointers.
+    cudaMemcpyToSymbol(globalPrintfBuffer, &printfbuf_device, sizeof(char *));
+    cudaMemcpyToSymbol(printfBufferPtr, &printfbuf_device, sizeof(char *));
+    cudaMemcpyToSymbol(printfBufferLength, &printfbuf_len, sizeof(printfbuf_len));
+
+    return cudaSuccess;
+}
+
+
+//
+//  cudaPrintfEnd
+//
+//  Frees up the memory which we allocated
+//
+extern "C" void cudaPrintfEnd()
+{
+    if(!printfbuf_start || !printfbuf_device)
+        return;
+
+    cudaFree(printfbuf_device);
+    printfbuf_start = printfbuf_device = NULL;
+}
+
+
+//
+//  cudaPrintfDisplay
+//
+//  Each call to this function dumps the entire current contents
+//	of the printf buffer to the pre-specified FILE pointer. The
+//	circular "start" pointer is advanced so that subsequent calls
+//	dumps only new stuff.
+//
+//  In the case of async memory access (via streams), call this
+//  repeatedly to keep trying to empty the buffer. If it's a sync
+//  access, then the whole buffer should empty in one go.
+//
+//	Arguments:
+//		outputFP     - File descriptor to output to (NULL => stdout)
+//		showThreadID - If true, prints [block,thread] before each line
+//
+extern "C" cudaError_t cudaPrintfDisplay(void *outputFP, bool showThreadID)
+{
+	printf_fp = (FILE *)((outputFP == NULL) ? stdout : outputFP);
+
+    // For now, we force "synchronous" mode which means we're not concurrent
+	// with kernel execution. This also means we don't need clearOnPrint.
+	// If you're patching it for async operation, here's where you want it.
+    bool sync_printfs = true;
+	bool clearOnPrint = false;
+
+    // Initialisation check
+    if(!printfbuf_start || !printfbuf_device || !printf_fp)
+        return cudaErrorMissingConfiguration;
+
+    // To determine which architecture we're using, we read the
+    // first short from the buffer - it'll be the magic number
+    // relating to the version.
+    unsigned short magic;
+    cudaMemcpy(&magic, printfbuf_device, sizeof(unsigned short), cudaMemcpyDeviceToHost);
+
+    // For SM_10 architecture, we've split our buffer into one-per-thread.
+    // That means we must do each thread block separately. It'll require
+    // extra reading. We also, for now, don't support async printfs because
+    // that requires tracking one start pointer per thread.
+    if(magic == CUPRINTF_SM10_MAGIC)
+    {
+        sync_printfs = true;
+	    clearOnPrint = false;
+        int blocklen = 0;
+        char *blockptr = printfbuf_device;
+        while(blockptr < (printfbuf_device + printfbuf_len))
+        {
+            cuPrintfHeaderSM10 hdr;
+            cudaMemcpy(&hdr, blockptr, sizeof(hdr), cudaMemcpyDeviceToHost);
+
+            // We get our block-size-step from the very first header
+            if(hdr.thread_buf_len != 0)
+                blocklen = hdr.thread_buf_len;
+
+            // No magic number means no printfs from this thread
+            if(hdr.magic != CUPRINTF_SM10_MAGIC)
+            {
+                if(blocklen == 0)
+                {
+                    fprintf(printf_fp, "No printf headers found at all!\n");
+                    break;                              // No valid headers!
+                }
+                blockptr += blocklen;
+                continue;
+            }
+
+            // "offset" is non-zero then we can print the block contents
+            if(hdr.offset > 0)
+            {
+                // For synchronous printfs, we must print from endptr->bufend, then from start->end
+                if(sync_printfs)
+                    doPrintfDisplay(showThreadID, clearOnPrint, blockptr+CUPRINTF_MAX_LEN, blockptr+hdr.thread_buf_len, blockptr+hdr.offset+CUPRINTF_MAX_LEN, blockptr+hdr.thread_buf_len);
+                doPrintfDisplay(showThreadID, clearOnPrint, blockptr+CUPRINTF_MAX_LEN, blockptr+hdr.thread_buf_len, blockptr+CUPRINTF_MAX_LEN, blockptr+hdr.offset+CUPRINTF_MAX_LEN);
+            }
+
+            // Move on to the next block and loop again
+            blockptr += hdr.thread_buf_len;
+        }
+    }
+    // For SM_11 and up, everything is a single buffer and it's simple
+    else if(magic == CUPRINTF_SM11_MAGIC)
+    {
+	    // Grab the current "end of circular buffer" pointer.
+        char *printfbuf_end = NULL;
+        cudaMemcpyFromSymbol(&printfbuf_end, printfBufferPtr, sizeof(char *));
+
+        // Adjust our starting and ending pointers to within the block
+        char *bufptr = ((printfbuf_start - printfbuf_device) % printfbuf_len) + printfbuf_device;
+        char *endptr = ((printfbuf_end - printfbuf_device) % printfbuf_len) + printfbuf_device;
+
+        // For synchronous (i.e. after-kernel-exit) printf display, we have to handle circular
+        // buffer wrap carefully because we could miss those past "end".
+        if(sync_printfs)
+            doPrintfDisplay(showThreadID, clearOnPrint, printfbuf_device, printfbuf_device+printfbuf_len, endptr, printfbuf_device+printfbuf_len);
+        doPrintfDisplay(showThreadID, clearOnPrint, printfbuf_device, printfbuf_device+printfbuf_len, bufptr, endptr);
+
+        printfbuf_start = printfbuf_end;
+    }
+    else
+        ;//printf("Bad magic number in cuPrintf buffer header\n");
+
+    // If we were synchronous, then we must ensure that the memory is cleared on exit
+    // otherwise another kernel launch with a different grid size could conflict.
+    if(sync_printfs)
+        cudaMemset(printfbuf_device, 0, printfbuf_len);
+
+    return cudaSuccess;
+}
+
+// Cleanup
+#undef CUPRINTF_MAX_LEN
+#undef CUPRINTF_ALIGN_SIZE
+#undef CUPRINTF_SM10_MAGIC
+#undef CUPRINTF_SM11_MAGIC
+
+#endif
\ No newline at end of file
diff --git a/src/cuPrintf.cuh b/src/cuPrintf.cuh
new file mode 100644
index 0000000..7635b81
--- /dev/null
+++ b/src/cuPrintf.cuh
@@ -0,0 +1,162 @@
+/*
+	Copyright 2009 NVIDIA Corporation.  All rights reserved.
+
+	NOTICE TO LICENSEE:   
+
+	This source code and/or documentation ("Licensed Deliverables") are subject 
+	to NVIDIA intellectual property rights under U.S. and international Copyright 
+	laws.  
+
+	These Licensed Deliverables contained herein is PROPRIETARY and CONFIDENTIAL 
+	to NVIDIA and is being provided under the terms and conditions of a form of 
+	NVIDIA software license agreement by and between NVIDIA and Licensee ("License 
+	Agreement") or electronically accepted by Licensee.  Notwithstanding any terms 
+	or conditions to the contrary in the License Agreement, reproduction or 
+	disclosure of the Licensed Deliverables to any third party without the express 
+	written consent of NVIDIA is prohibited.     
+
+	NOTWITHSTANDING ANY TERMS OR CONDITIONS TO THE CONTRARY IN THE LICENSE AGREEMENT, 
+	NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THESE LICENSED 
+	DELIVERABLES FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR IMPLIED 
+	WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH REGARD TO THESE 
+	LICENSED DELIVERABLES, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY, 
+	NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.   NOTWITHSTANDING ANY 
+	TERMS OR CONDITIONS TO THE CONTRARY IN THE LICENSE AGREEMENT, IN NO EVENT SHALL 
+	NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, 
+	OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS,	WHETHER 
+	IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION,  ARISING OUT OF 
+	OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THESE LICENSED DELIVERABLES.  
+
+	U.S. Government End Users. These Licensed Deliverables are a "commercial item" 
+	as that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of 
+	"commercial computer  software"  and "commercial computer software documentation" 
+	as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995) and is provided to the 
+	U.S. Government only as a commercial end item.  Consistent with 48 C.F.R.12.212 
+	and 48 C.F.R. 227.7202-1 through 227.7202-4 (JUNE 1995), all U.S. Government 
+	End Users acquire the Licensed Deliverables with only those rights set forth 
+	herein. 
+
+	Any use of the Licensed Deliverables in individual and commercial software must 
+	include, in the user documentation and internal comments to the code, the above 
+	Disclaimer and U.S. Government End Users Notice.
+ */
+
+#ifndef CUPRINTF_H
+#define CUPRINTF_H
+
+/*
+ *	This is the header file supporting cuPrintf.cu and defining both
+ *	the host and device-side interfaces. See that file for some more
+ *	explanation and sample use code. See also below for details of the
+ *	host-side interfaces.
+ *
+ *  Quick sample code:
+ *
+	#include "cuPrintf.cu"
+ 	
+	__global__ void testKernel(int val)
+	{
+		cuPrintf("Value is: %d\n", val);
+	}
+
+	int main()
+	{
+		cudaPrintfInit();
+		testKernel<<< 2, 3 >>>(10);
+		cudaPrintfDisplay(stdout, true);
+		cudaPrintfEnd();
+        return 0;
+	}
+ */
+
+///////////////////////////////////////////////////////////////////////////////
+// DEVICE SIDE
+// External function definitions for device-side code
+
+// Abuse of templates to simulate varargs
+__device__ int cuPrintf(const char *fmt);
+template <typename T1> __device__ int cuPrintf(const char *fmt, T1 arg1);
+template <typename T1, typename T2> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2);
+template <typename T1, typename T2, typename T3> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3);
+template <typename T1, typename T2, typename T3, typename T4> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4);
+template <typename T1, typename T2, typename T3, typename T4, typename T5> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5);
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6);
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7);
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7, T8 arg8);
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename T9> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7, T8 arg8, T9 arg9);
+template <typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename T9, typename T10> __device__ int cuPrintf(const char *fmt, T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5, T6 arg6, T7 arg7, T8 arg8, T9 arg9, T10 arg10);
+
+
+//
+//	cuPrintfRestrict
+//
+//	Called to restrict output to a given thread/block. Pass
+//	the constant CUPRINTF_UNRESTRICTED to unrestrict output
+//	for thread/block IDs. Note you can therefore allow
+//	"all printfs from block 3" or "printfs from thread 2
+//	on all blocks", or "printfs only from block 1, thread 5".
+//
+//	Arguments:
+//		threadid - Thread ID to allow printfs from
+//		blockid - Block ID to allow printfs from
+//
+//	NOTE: Restrictions last between invocations of
+//	kernels unless cudaPrintfInit() is called again.
+//
+#define CUPRINTF_UNRESTRICTED	-1
+__device__ void cuPrintfRestrict(int threadid, int blockid);
+
+
+
+///////////////////////////////////////////////////////////////////////////////
+// HOST SIDE
+// External function definitions for host-side code
+
+//
+//	cudaPrintfInit
+//
+//	Call this once to initialise the printf system. If the output
+//	file or buffer size needs to be changed, call cudaPrintfEnd()
+//	before re-calling cudaPrintfInit().
+//
+//	The default size for the buffer is 1 megabyte. For CUDA
+//	architecture 1.1 and above, the buffer is filled linearly and
+//	is completely used;	however for architecture 1.0, the buffer
+//	is divided into as many segments are there are threads, even
+//	if some threads do not call cuPrintf().
+//
+//	Arguments:
+//		bufferLen - Length, in bytes, of total space to reserve
+//		            (in device global memory) for output.
+//
+//	Returns:
+//		cudaSuccess if all is well.
+//
+extern "C" cudaError_t cudaPrintfInit(size_t bufferLen=1048576);   // 1-meg - that's enough for 4096 printfs by all threads put together
+
+//
+//	cudaPrintfEnd
+//
+//	Cleans up all memories allocated by cudaPrintfInit().
+//	Call this at exit, or before calling cudaPrintfInit() again.
+//
+extern "C" void cudaPrintfEnd();
+
+//
+//	cudaPrintfDisplay
+//
+//	Dumps the contents of the output buffer to the specified
+//	file pointer. If the output pointer is not specified,
+//	the default "stdout" is used.
+//
+//	Arguments:
+//		outputFP     - A file pointer to an output stream.
+//		showThreadID - If "true", output strings are prefixed
+//		               by "[blockid, threadid] " at output.
+//
+//	Returns:
+//		cudaSuccess if all is well.
+//
+extern "C" cudaError_t cudaPrintfDisplay(void *outputFP=NULL, bool showThreadID=false);
+
+#endif  // CUPRINTF_H
\ No newline at end of file
diff --git a/src/interactions.h b/src/interactions.h
index 7bf6fab..c7829c5 100644
--- a/src/interactions.h
+++ b/src/interactions.h
@@ -55,6 +55,7 @@ __host__ __device__ Fresnel calculateFresnel(glm::vec3 normal, glm::vec3 inciden
 
   fresnel.reflectionCoefficient = 1;
   fresnel.transmissionCoefficient = 0;
+
   return fresnel;
 }
 
@@ -99,8 +100,12 @@ __host__ __device__ glm::vec3 getRandomDirectionInSphere(float xi1, float xi2) {
 __host__ __device__ int calculateBSDF(ray& r, glm::vec3 intersect, glm::vec3 normal, glm::vec3 emittedColor,
                                        AbsorptionAndScatteringProperties& currentAbsorptionAndScattering,
                                        glm::vec3& color, glm::vec3& unabsorbedColor, material m){
+	int type = 1;
+	if(m.specularExponent == 0 && m.hasRefractive==0)
+		type = 0;
+	
 
-  return 1;
+  return type;
 };
 
 #endif
diff --git a/src/intersections.h b/src/intersections.h
index c9eafb6..dac7dae 100644
--- a/src/intersections.h
+++ b/src/intersections.h
@@ -12,13 +12,15 @@
 #include "sceneStructs.h"
 #include "cudaMat4.h"
 #include "utilities.h"
+#include "tiny_obj_loader.h"
+
 
 // Some forward declarations
 __host__ __device__ glm::vec3 getPointOnRay(ray r, float t);
 __host__ __device__ glm::vec3 multiplyMV(cudaMat4 m, glm::vec4 v);
 __host__ __device__ glm::vec3 getSignOfRay(ray r);
 __host__ __device__ glm::vec3 getInverseDirectionOfRay(ray r);
-__host__ __device__ float boxIntersectionTest(staticGeom sphere, ray r, glm::vec3& intersectionPoint, glm::vec3& normal);
+__host__ __device__ float boxIntersectionTest(staticGeom box, ray r, glm::vec3& intersectionPoint, glm::vec3& normal);
 __host__ __device__ float sphereIntersectionTest(staticGeom sphere, ray r, glm::vec3& intersectionPoint, glm::vec3& normal);
 __host__ __device__ glm::vec3 getRandomPointOnCube(staticGeom cube, float randomSeed);
 
@@ -72,8 +74,116 @@ __host__ __device__ glm::vec3 getSignOfRay(ray r){
 // TODO: IMPLEMENT THIS FUNCTION
 // Cube intersection test, return -1 if no intersection, otherwise, distance to intersection
 __host__ __device__ float boxIntersectionTest(staticGeom box, ray r, glm::vec3& intersectionPoint, glm::vec3& normal){
+        
+	glm::vec3 ro = multiplyMV(box.inverseTransform, glm::vec4(r.origin,1.0f));
+	glm::vec3 rd = glm::normalize(multiplyMV(box.inverseTransform, glm::vec4(r.direction,0.0f)));
 
-    return -1;
+	ray rt; rt.origin = ro; rt.direction = rd;
+
+	glm::vec3 ray_dir = rd;
+	glm::vec3 ray_pos = ro;
+
+	float Tnear = -99999;
+	float Tfar =  99999;
+	float t1, t2;
+	
+	if(ray_dir.x == 0)
+	{
+		if(ray_pos.x < -0.5 || ray_pos.x > 0.5 )
+		return -1;
+	}
+	else
+	{
+		t1 = (-0.5 - ray_pos.x)/ ray_dir.x;
+		t2 = (0.5 - ray_pos.x)/ ray_dir.x;
+		if(t1 > t2)
+		{
+			float tmp = t1;
+			t1 = t2;
+			t2 = tmp;
+		}
+		if(t1 > Tnear)
+			Tnear = t1;
+		if(t2 < Tfar)
+			Tfar = t2;
+		if(Tnear > Tfar)
+			return -1;
+		if(Tfar < 0)
+			return -1;
+	}
+
+	if(ray_dir.y == 0)
+	{
+		if(ray_pos.y < -0.5 || ray_pos.y > 0.5 )
+		return -1;
+	}
+	else
+	{
+		t1 = (-0.5 - ray_pos.y)/ ray_dir.y;
+		t2 = (0.5 - ray_pos.y)/ ray_dir.y;
+		if(t1 > t2)
+		{
+			float tmp = t1;
+			t1 = t2;
+			t2 = tmp;
+		}
+		if(t1 > Tnear)
+			Tnear = t1;
+		if(t2 < Tfar)
+			Tfar = t2;
+		if(Tnear > Tfar)
+			return -1;
+		if(Tfar < 0)
+			return -1;
+	}
+
+	if(ray_dir.z == 0)
+	{
+		if(ray_pos.z < -0.5 || ray_pos.z > 0.5 )
+		return -1;
+	}
+	else
+	{
+		t1 = (-0.5 - ray_pos.z)/ ray_dir.z;
+		t2 = (0.5 - ray_pos.z)/ ray_dir.z;
+		if(t1 > t2)
+		{
+			float tmp = t1;
+			t1 = t2;
+			t2 = tmp;
+		}
+		if(t1 > Tnear)
+			Tnear = t1;
+		if(t2 < Tfar)
+			Tfar = t2;
+		if(Tnear > Tfar)
+			return -1;
+		if(Tfar < 0)
+			return -1;
+	}
+	glm::vec3 realNormal;
+
+	glm::vec3 resultP = ray_pos + Tnear * ray_dir;
+	if(resultP.x - 0.5f > -0.0001f)
+		realNormal =multiplyMV(box.transform, glm::vec4(1.0f,0.0f,0.0f,1.0f));
+	else if(resultP.x + 0.5f < 0.0001f)
+		realNormal =multiplyMV(box.transform, glm::vec4(-1.0f,0.0f,0.0f,1.0f));
+	else if(resultP.y - 0.5f > -0.0001f)
+		realNormal =multiplyMV(box.transform, glm::vec4(0.0f,1.0f,0.0f,1.0f));
+	else if(resultP.y + 0.5f < 0.0001f)
+		realNormal =multiplyMV(box.transform, glm::vec4(0.0f,-1.0f,0.0f,1.0f));
+	else if(resultP.z - 0.5f > -0.0001f)
+		realNormal =multiplyMV(box.transform, glm::vec4(0.0f,0.0f,1.0f,1.0f));
+	else if(resultP.z + 0.5f < 0.0001f)
+		realNormal =multiplyMV(box.transform, glm::vec4(0.0f,0.0f,-1.0f,1.0f));
+	
+	 glm::vec3 realIntersectionPoint = multiplyMV(box.transform, glm::vec4(getPointOnRay(rt, Tnear), 1.0));
+     glm::vec3 realOrigin = multiplyMV(box.transform, glm::vec4(0,0,0,1));
+
+     intersectionPoint = realIntersectionPoint;
+	 normal = glm::normalize(realNormal - realOrigin);
+        
+  return glm::length(r.origin - realIntersectionPoint);
 }
 
 // LOOK: Here's an intersection test example from a sphere. Now you just need to figure out cube and, optionally, triangle.
@@ -116,6 +226,61 @@ __host__ __device__ float sphereIntersectionTest(staticGeom sphere, ray r, glm::
   return glm::length(r.origin - realIntersectionPoint);
 }
 
+__host__ __device__ float Test_RayPolyIntersect(glm::vec3 ray_pos, glm::vec3 ray_dir, glm::vec3 n, glm::vec3 p1, glm::vec3 p2, glm::vec3 p3, glm::vec3& intersectionPoint, glm::vec3& normal) 
+{
+	float d = glm::dot(n,p1);
+	float t = (d - glm::dot(n, ray_pos))/ glm::dot(n,ray_dir);
+	if (t <= 0) {
+		return -1;
+	}
+	glm::vec3 x = ray_pos + t * ray_dir;
+
+	float s1 = glm::dot(glm::cross(p2 - p1, x - p1), n);
+	float s2 = glm::dot(glm::cross(p3 - p2, x - p2), n);
+	float s3 = glm::dot(glm::cross(p1 - p3, x - p3), n);
+	if (s1 >= 0 && s2 >= 0 && s3 >= 0) {
+		intersectionPoint = x;
+		return t;
+	}
+	else {
+		return -1;
+	}
+}
+// Poly intersection test
+__host__ __device__ float meshIntersectionTest(staticGeom & m, ray r, glm::vec3& intersectionPoint, glm::vec3& normal){
+  
+	glm::vec3 ro = multiplyMV(m.inverseTransform, glm::vec4(r.origin,1.0f));
+	glm::vec3 rd = glm::normalize(multiplyMV(m.inverseTransform, glm::vec4(r.direction,0.0f)));
+
+	ray rt; rt.origin = ro; rt.direction = rd;
+	glm::vec3 p1,p2,p3;
+	glm::vec3 ray_dir = rd;
+	glm::vec3 ray_pos = ro;
+	glm::vec3 tmp_intersection;
+	float minLength = 999999;
+
+	for (int v = 0; v < m.faceNum; ++v) {
+
+		p1 = m.faces[3*v];
+		p2 = m.faces[3*v+1];
+		p3 = m.faces[3*v+2];
+
+		float tmp = Test_RayPolyIntersect(ray_pos, ray_dir,m.normals[v],  p1,  p2,  p3,tmp_intersection,  normal);
+		 if(tmp > 0.00001f && tmp < minLength)
+		 {
+			 minLength = tmp;
+			// glm:: vec3 n(m.);
+			
+			 intersectionPoint = multiplyMV(m.transform, glm::vec4(tmp_intersection, 1.0f));
+			 normal = glm::normalize(multiplyMV(m.transform, glm::vec4(m.normals[v],1.0f)));
+		 }
+	}
+	if(minLength > 0.0001 && minLength != 999999)
+		return minLength;
+	else
+		return -1;
+}
+
 // Returns x,y,z half-dimensions of tightest bounding box
 __host__ __device__ glm::vec3 getRadiuses(staticGeom geom){
     glm::vec3 origin = multiplyMV(geom.transform, glm::vec4(0,0,0,1));
diff --git a/src/main.cpp b/src/main.cpp
index b002500..0d4b696 100755
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -8,6 +8,8 @@
 #include "main.h"
 #define GLEW_STATIC
 
+
+
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
@@ -119,7 +121,7 @@ void runCuda(){
       for (int x=0; x < renderCam->resolution.x; x++) {
         for (int y=0; y < renderCam->resolution.y; y++) {
           int index = x + (y * renderCam->resolution.x);
-          outputImage.writePixelRGB(renderCam->resolution.x-1-x,y,renderCam->image[index]);
+          outputImage.writePixelRGB(renderCam->resolution.x-1-x,y,(renderCam->image[index])/(float)iterations);
         }
       }
       
@@ -168,8 +170,8 @@ bool init(int argc, char* argv[]) {
       return false;
   }
 
-  width = 800;
-  height = 800;
+  width = 1000;
+  height = 1000;
   window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
   if (!window){
       glfwTerminate();
@@ -225,11 +227,12 @@ void initCuda(){
 }
 
 void initTextures(){
-    glGenTextures(1, &displayImage);
+	glGenTextures(1, &displayImage);
     glBindTexture(GL_TEXTURE_2D, displayImage);
+	
     glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
     glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
-    glTexImage2D( GL_TEXTURE_2D, 0, GL_RGBA8, width, height, 0, GL_BGRA, GL_UNSIGNED_BYTE, NULL);
+	glTexImage2D(GL_TEXTURE_2D, 0,GL_RGB, width, height, 0, GL_BGR, GL_UNSIGNED_BYTE, NULL);
 }
 
 void initVAO(void){
diff --git a/src/raytraceKernel.cu b/src/raytraceKernel.cu
index 9c7bc7d..d2af984 100644
--- a/src/raytraceKernel.cu
+++ b/src/raytraceKernel.cu
@@ -15,6 +15,80 @@
 #include "raytraceKernel.h"
 #include "intersections.h"
 #include "interactions.h"
+#include "../src/cuPrintf.cu"  
+#include "thrust/copy.h"
+#include "SOIL/SOIL.h"
+
+#define THRESHOLD 0.005f
+#define len(x) sqrtf(x[0]*x[0] + x[1]*x[1] + x[2]*x[2])
+#define FRESNEL 1
+#define DEPTH_FIELD_MODE 0
+#define AMBIENT_OCCLUSION_MODE 0
+#define NORMAL_MODE 0
+
+glm::vec3 * cudaTextures;
+
+// a simple loadBMP function ### not work 
+glm::vec3 * loadBMP(const char * imagepath)
+{
+	// Data read from the header of the BMP file
+	unsigned char header[54]; // Each BMP file begins by a 54-bytes header
+	unsigned int dataPos;     // Position in the file where the actual data begins
+	unsigned int width, height;
+	unsigned int imageSize;   // = width*height*3
+	// Actual RGB data
+	unsigned char * data;
+
+	// Open the file
+	FILE * file = fopen(imagepath, "rb");
+	if(!file)
+	{
+		printf("image could not be opened\n");
+		exit(0);
+	}
+
+	if( fread(header,1,54,file)!=54){
+		printf("Incorrect BMP file! \n");
+		exit(0);
+	}
+	
+	if( header[0] != 'B' || header[1] != 'M')
+	{
+		printf("Incorrect BMP file! \n");
+		exit(0);
+	}
+
+	dataPos    = *(int*)&(header[0x0A]);
+	imageSize  = *(int*)&(header[0x22]);
+	width      = *(int*)&(header[0x12]);
+	height     = *(int*)&(header[0x16]);
+	// Some BMP files are misformatted, guess missing information
+	if (imageSize==0)    imageSize=width*height*3; // 3 : one byte for each Red, Green and Blue component
+	if (dataPos==0)      dataPos=54; // The BMP header is done that way
+
+	// Create a buffer
+	//printf("width = %d, height = %d, image size : %d \n", width,height,imageSize);
+	data = new unsigned char [imageSize];
+ 
+	// Read the actual data from the file into the buffer
+	fread(data,1,imageSize,file);
+	//printf("BMP file loaded! \n");
+	fclose(file);
+
+	int i = 0;
+	glm::vec3 * texture = new glm::vec3[height * width];
+	for( int y = 0; y < height; ++y)
+	{
+		for(int x = 0; x < width; ++x)
+		{ 
+			texture[x + y * width] = glm::vec3((float)data[i]/255.0f, (float)data[i+1]/255.0f,(float)data[i+2]/255.0f);
+			i=i+3;
+		}
+ 
+	}
+		//printf("R: %f, G: %f, B: %f \n", (float)data[i],(float)data[i+1],(float)data[i+2]);
+	return texture;
+}
 
 void checkCUDAError(const char *msg) {
   cudaError_t err = cudaGetLastError();
@@ -37,10 +111,41 @@ __host__ __device__ glm::vec3 generateRandomNumberFromThread(glm::vec2 resolutio
 
 // TODO: IMPLEMENT THIS FUNCTION
 // Function that does the initial raycast from the camera
-__host__ __device__ ray raycastFromCameraKernel(glm::vec2 resolution, float time, int x, int y, glm::vec3 eye, glm::vec3 view, glm::vec3 up, glm::vec2 fov){
+glm::vec3 norm(glm::vec3 in)
+{
+	 glm::vec3 ret = in / (float)len(in);
+	 return ret;
+}
+__host__ __device__ ray raycastFromCameraKernel(glm::vec2 resolution, int x, int y, glm::vec3 eye, glm::vec3 view, glm::vec3 up, glm::vec2 fov, 
+	                                           float focl, float aptr, float time){
+
+  glm::vec3 A = glm::cross(view, up);
+  glm::vec3 B = glm::cross(A,view);
+  glm::vec3 M = view + eye;
+  glm::vec3 V = B * (float)view.length() * tanf(float(fov.y * PI / 180.0)) / (float)B.length();
+  glm::vec3 H = A * (float)view.length() * tanf(float(fov.x * PI / 180.0)) / (float)A.length(); 
+
+  thrust::default_random_engine rng(hash((time+1)*(x+1)*(y+1)));
+  thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);
+
+  float i = x + (float)u01(rng);
+  float j = y + (float)u01(rng);
+  glm::vec3 P = M + (float)((2.0*i)/(resolution.x-1.0)-1.0) * H +  (float)(2.0*(resolution.y - j - 1.0)/(resolution.y-1.0)-1.0) * V;
+   
   ray r;
-  r.origin = glm::vec3(0,0,0);
-  r.direction = glm::vec3(0,0,-1);
+  r.origin = P;
+  r.direction = glm::normalize(P-eye);
+
+  if(DEPTH_FIELD_MODE)
+  {
+	  glm::vec3 focalPoint = eye + r.direction *focl;
+	  		
+	thrust::uniform_real_distribution<float> u02(-aptr/2, aptr/2);
+	r.origin = eye + A * u02(rng) + B * u02(rng);
+	r.direction = glm::normalize(focalPoint - r.origin);
+  }
+
+ 
   return r;
 }
 
@@ -55,7 +160,7 @@ __global__ void clearImage(glm::vec2 resolution, glm::vec3* image){
 }
 
 //Kernel that writes the image to the OpenGL PBO directly.
-__global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3* image){
+__global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3* image, float time){
   
   int x = (blockIdx.x * blockDim.x) + threadIdx.x;
   int y = (blockIdx.y * blockDim.y) + threadIdx.y;
@@ -64,9 +169,9 @@ __global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3*
   if(x<=resolution.x && y<=resolution.y){
 
       glm::vec3 color;
-      color.x = image[index].x*255.0;
-      color.y = image[index].y*255.0;
-      color.z = image[index].z*255.0;
+      color.x = image[index].x*255.0/time;
+      color.y = image[index].y*255.0/time;
+      color.z = image[index].z*255.0/time;
 
       if(color.x>255){
         color.x = 255;
@@ -88,78 +193,337 @@ __global__ void sendImageToPBO(uchar4* PBOpos, glm::vec2 resolution, glm::vec3*
   }
 }
 
+__host__ __device__ int checkIntersections(ray r, staticGeom* geoms, int numberOfGeoms, glm::vec3 & intersectionPoint, glm::vec3 & normal)
+{
+	int closestGeo = -1;
+	float t = 99999;
+	glm::vec3 tmpN, tmpIntersection;
+	for(int i = 0; i < numberOfGeoms; ++i)
+	{
+		float tmp;
+		if(geoms[i].type == SPHERE)
+			tmp = sphereIntersectionTest(geoms[i], r, intersectionPoint, normal);
+		else if(geoms[i].type == CUBE)
+			tmp = boxIntersectionTest(geoms[i], r, intersectionPoint, normal);
+		else if(geoms[i].type == MESH)
+		{
+			  tmp =  meshIntersectionTest(geoms[i], r,  intersectionPoint, normal);
+			  /*tmpN = normal;
+			  tmpIntersection = intersectionPoint;*/
+		}
+
+		if(tmp >= 0 && tmp < t)
+		{
+			t = tmp;
+			closestGeo = i;
+		}
+	}
+
+	if( closestGeo >= 0 )
+	{
+		if(geoms[closestGeo].type == SPHERE)
+			sphereIntersectionTest(geoms[closestGeo], r, intersectionPoint, normal);
+		else if(geoms[closestGeo].type == CUBE)
+			boxIntersectionTest(geoms[closestGeo], r, intersectionPoint, normal);
+		else if(geoms[closestGeo].type == MESH)
+			meshIntersectionTest(geoms[closestGeo], r, intersectionPoint, normal);
+		return closestGeo;
+	}
+	else
+		return -1;
+
+}
+
+__global__ void genCameraRayBatch(glm::vec2 resolution, cameraData cam,  ray * rays, float time)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+	int index = x + (y * resolution.x);
+	if(x<=resolution.x && y<=resolution.y)
+	{
+		rays[index] = raycastFromCameraKernel(resolution, x, y, cam.position, cam.view, cam.up, cam.fov, cam.focl, cam.aperture, time);
+		rays[index].id = index;
+		rays[index].rayColor = glm::vec3(1.0f, 1.0f, 1.0f);
+	}
+}
+
+__global__ void buildDirectionLightMap(glm::vec2 resolution, cameraData cam,  ray * rays, float time)
+{
+}
+
+// smoothing kernel
+__global__ void averagePixelColor(glm::vec2 resolution,  glm::vec3* colors, float iterations)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+    int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+    int index = x + (y * resolution.x);
+	
+    const int n = 5;
+	int avgIndex[n*n*4];
+
+	if((x<resolution.x-n && y<resolution.y-n) && x>n && y >n)
+    {
+			for(int i = -n; i <n; ++i)
+			{
+				for(int j = -n; j <n; ++j)
+					avgIndex[j+n + (i+n)*2 *n]  = (x+i) + ((y+j) * resolution.x);
+			}
+
+			glm::vec3 newColor(0.0f,0.0f,0.0f);
+			for(int i = 0; i < n*n*4; ++i)
+				newColor += colors[avgIndex[i]];
+			colors[index] =  newColor/(float)(n*n*4);
+	}
+}
+
+__host__ __device__ bool isDiffuse(float seed, float diffuseRate)
+{
+	thrust::default_random_engine rng(hash(seed));
+	thrust::uniform_real_distribution<float> u01(0,1);
+	if (u01(rng) <= diffuseRate) 
+		return true;
+	else 
+		return false;
+
+}
+
+__host__ __device__ bool isReflected(float seed, float IOR, float reflectRate, float refractRate, glm::vec3 normal,
+									glm::vec3 idir, glm::vec3 tdir)
+{
+	float R;
+	if(FRESNEL)
+	{
+		float rs = (IOR * glm::dot(normal, idir) - glm::dot(normal, tdir)) / (IOR * glm::dot(normal, idir) + glm::dot(normal, tdir));
+		float rp = (glm::dot(normal, idir) - IOR * glm::dot(normal, tdir)) / (glm::dot(normal, idir) + IOR * glm::dot(normal, tdir));
+		R = 0.5f * (rs * rs + rp * rp);
+	}
+	else
+	{
+		glm::vec2 r(reflectRate,refractRate);
+		r = glm::normalize(r);
+		R = r.x;
+	}
+
+	thrust::default_random_engine rng(hash(seed));
+	thrust::uniform_real_distribution<float> u01(0,1);
+	if (u01(rng) <= R) 
+		return true;
+	else 
+		return false;
+}
+
 // TODO: IMPLEMENT THIS FUNCTION
 // Core raytracer kernel
-__global__ void raytraceRay(glm::vec2 resolution, float time, cameraData cam, int rayDepth, glm::vec3* colors,
-                            staticGeom* geoms, int numberOfGeoms){
+__global__ void raytraceRay(glm::vec2 resolution, float time, glm::vec3* colors, int rayDepth,
+                            staticGeom* geoms, int numberOfGeoms, material * cudaMat, ray * rays, glm::vec3 * cudaTextures){
 
   int x = (blockIdx.x * blockDim.x) + threadIdx.x;
   int y = (blockIdx.y * blockDim.y) + threadIdx.y;
   int index = x + (y * resolution.x);
+  //cuPrintf("number of rays: %d", numRays );
+  
+  if((x<=resolution.x && y<=resolution.y))
+  {
+	  if(rays[index].id >= 0)
+	  {
+		 glm::vec3 intersectionPoint;
+		 glm::vec3 normal;
+		 int geoIndex = checkIntersections(rays[index], geoms, numberOfGeoms, intersectionPoint, normal);
+		 //colors[index] = cudaMat[geoms[geoIndex].materialid].color;
+		 if(geoIndex < 0 ) 
+		 {
+			 rays[index].id = -1;
+		 }
+		 else 
+		 {
+			 material mat = cudaMat[geoms[geoIndex].materialid];
+			 if(mat.emittance > 0.0001f)
+			 {
+				 colors[index] += rays[index].rayColor * mat.color * mat.emittance;
+				 rays[index].id = -1;
+			 }			 
+			 else 
+			 {
+				 float seed= (time/10.0f+1.0f) * (index+1.0f) * (rayDepth+1.0f) ;
+				 if((mat.hasReflective > 0.0f || mat.hasRefractive > 0.0f) && !isDiffuse(seed, mat.hasScatter))
+				 {
+					 float IOR = mat.indexOfRefraction;
+					 if(glm::dot(normal, rays[index].direction) > 0)
+					 {
+						 normal = -normal;
+						 IOR = 1.0f/(IOR+THRESHOLD);
+					 }
 
-  if((x<=resolution.x && y<=resolution.y)){
+					 glm::vec3 transmittedRay = glm::refract(rays[index].direction, normal, 1.0f/(IOR+THRESHOLD));
+					 if(!isReflected(seed,IOR, mat.hasReflective, mat.hasRefractive, normal,rays[index].direction,transmittedRay) && mat.hasRefractive > 0.0f)
+					 {
+						 if(glm::length(transmittedRay) > THRESHOLD)
+						 {
+							 rays[index].direction = glm::normalize(transmittedRay);
+							 rays[index].origin = intersectionPoint - normal * THRESHOLD;
+							 rays[index].rayColor =  rays[index].rayColor * mat.color;
+						 }	 
+					 }
+					 else
+					 {
+						 glm::vec3 reflectedRay = glm::reflect(rays[index].direction, normal);
+						 rays[index].direction = glm::normalize(reflectedRay);
+						 rays[index].origin = intersectionPoint + normal * THRESHOLD;
+						 rays[index].rayColor =  rays[index].rayColor * mat.color;
+					 }
+					 return;
+				 }
 
-    colors[index] = generateRandomNumberFromThread(resolution, time, x, y);
-   }
+				if(glm::dot(rays[index].direction, normal) > 0)
+				{
+					normal = -normal;
+				}
+				thrust::default_random_engine rng(hash(seed));
+				thrust::uniform_real_distribution<float> u01(0.0f, 1.0f);
+				//colors[index] = mat.color;
+				//colors[index] += intersectionPoint;
+				//glm::vec3 toLight = geoms[7].translation - intersectionPoint;
+				rays[index].direction = calculateRandomDirectionInHemisphere(normal,  u01(rng), u01(rng));
+				rays[index].origin = intersectionPoint + normal * THRESHOLD;
+				if(mat.isTextured > 0)
+				{
+					glm::vec3 p = multiplyMV(geoms[geoIndex].inverseTransform, glm::vec4(intersectionPoint,1.0f));
+					float r = geoms[geoIndex].scale.x * 0.5f;
+					int s = 256.0f * fabs(cosf((float)p.z/0.5f)/PI) ;
+					int t = 256.0f * fabs(cosf((float)p.x/(float)(r * sinf(PI * s)))/(2.0f * PI));
+		
+					colors[index] += cudaTextures[s + t * 256];
+					//cuPrintf("[s,t] = [%d, %d] with color = (%f, %f, %f) \n", s,t,cudaTextures[s + t * 256].x,cudaTextures[s + t * 256].y,cudaTextures[s + t * 256].z );
+					rays[index].id = -1;
+				}
+				else
+					rays[index].rayColor =  rays[index].rayColor * mat.color;
+		      }
+		   }
+	   }
+   } 
+    //colors[index] = generateRandomNumberFromThread(resolution, time, x, y); 
+  
 }
 
+__host__ __device__ bool isTerminated(ray r)
+{
+	if(r.id==-1)
+		return false;
+	else 
+		return true;
+}
 // TODO: FINISH THIS FUNCTION
 // Wrapper for the __global__ call that sets up the kernel calls and does a ton of memory management
 void cudaRaytraceCore(uchar4* PBOpos, camera* renderCam, int frame, int iterations, material* materials, int numberOfMaterials, geom* geoms, int numberOfGeoms){
   
-  int traceDepth = 1; //determines how many bounces the raytracer traces
+	int traceDepth = 5; //determines how many bounces the raytracer traces
 
-  // set up crucial magic
-  int tileSize = 8;
-  dim3 threadsPerBlock(tileSize, tileSize);
-  dim3 fullBlocksPerGrid((int)ceil(float(renderCam->resolution.x)/float(tileSize)), (int)ceil(float(renderCam->resolution.y)/float(tileSize)));
+	// set up crucial magic
+	int tileSize = 8;
+	dim3 threadsPerBlock(tileSize, tileSize);
+	dim3 fullBlocksPerGrid((int)ceil(float(renderCam->resolution.x)/float(tileSize)), (int)ceil(float(renderCam->resolution.y)/float(tileSize)));
   
-  // send image to GPU
-  glm::vec3* cudaimage = NULL;
-  cudaMalloc((void**)&cudaimage, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3));
-  cudaMemcpy( cudaimage, renderCam->image, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3), cudaMemcpyHostToDevice);
+	// send image to GPU
+	glm::vec3* cudaimage = NULL;
+	cudaMalloc((void**)&cudaimage, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3));
+	cudaMemcpy( cudaimage, renderCam->image, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3), cudaMemcpyHostToDevice);
   
-  // package geometry and materials and sent to GPU
-  staticGeom* geomList = new staticGeom[numberOfGeoms];
-  for(int i=0; i<numberOfGeoms; i++){
-    staticGeom newStaticGeom;
-    newStaticGeom.type = geoms[i].type;
-    newStaticGeom.materialid = geoms[i].materialid;
-    newStaticGeom.translation = geoms[i].translations[frame];
-    newStaticGeom.rotation = geoms[i].rotations[frame];
-    newStaticGeom.scale = geoms[i].scales[frame];
-    newStaticGeom.transform = geoms[i].transforms[frame];
-    newStaticGeom.inverseTransform = geoms[i].inverseTransforms[frame];
-    geomList[i] = newStaticGeom;
-  }
+	// package geometry and materials and sent to GPU
+	staticGeom* geomList = new staticGeom[numberOfGeoms];
+	for(int i=0; i<numberOfGeoms; i++){
+	staticGeom newStaticGeom;
+	newStaticGeom.type = geoms[i].type;
+	newStaticGeom.materialid = geoms[i].materialid;
+	newStaticGeom.translation = geoms[i].translations[frame];
+	newStaticGeom.rotation = geoms[i].rotations[frame];
+	newStaticGeom.scale = geoms[i].scales[frame];
+	newStaticGeom.transform = geoms[i].transforms[frame];
+	newStaticGeom.inverseTransform = geoms[i].inverseTransforms[frame];
+
+	if(geoms[i].type ==MESH)
+	{
+		newStaticGeom.faceNum = geoms[i].faceNum;
+
+		for(int k = 0; k < newStaticGeom.faceNum*3; ++k)
+		{
+			newStaticGeom.faces[k] = geoms[i].faces[k];
+			if(k < newStaticGeom.faceNum)
+				newStaticGeom.normals[k] = geoms[i].normals[k];
+		}
+		//std::cout << newStaticGeom.faceNum << std::endl;
+		/*for(int j =0; j<newStaticGeom.faceNum; ++j)
+		{
+			std::cout <<newStaticGeom.faces[3*j].x << " " <<  newStaticGeom.faces[3*j].y << " " << newStaticGeom.faces[3*j].z <<  " | ";
+			std::cout << newStaticGeom.faces[3*j+1].x << " " <<  newStaticGeom.faces[3*j+1].y << " " << newStaticGeom.faces[3*j+1].z << " | ";
+			std::cout << newStaticGeom.faces[3*j+2].x << " " <<  newStaticGeom.faces[3*j+2].y << " " << newStaticGeom.faces[3*j+2].z << std::endl;
+		}*/
+	}
+	geomList[i] = newStaticGeom;
+	}
   
-  staticGeom* cudageoms = NULL;
-  cudaMalloc((void**)&cudageoms, numberOfGeoms*sizeof(staticGeom));
-  cudaMemcpy( cudageoms, geomList, numberOfGeoms*sizeof(staticGeom), cudaMemcpyHostToDevice);
+	staticGeom* cudageoms = NULL;
+	cudaMalloc((void**)&cudageoms, numberOfGeoms*sizeof(staticGeom));
+	cudaMemcpy( cudageoms, geomList, numberOfGeoms*sizeof(staticGeom), cudaMemcpyHostToDevice);
   
-  // package camera
-  cameraData cam;
-  cam.resolution = renderCam->resolution;
-  cam.position = renderCam->positions[frame];
-  cam.view = renderCam->views[frame];
-  cam.up = renderCam->ups[frame];
-  cam.fov = renderCam->fov;
-
-  // kernel launches
-  raytraceRay<<<fullBlocksPerGrid, threadsPerBlock>>>(renderCam->resolution, (float)iterations, cam, traceDepth, cudaimage, cudageoms, numberOfGeoms);
+	material* cudaMat = NULL;
+	cudaMalloc((void**)&cudaMat, numberOfMaterials*sizeof(material));
+	cudaMemcpy( cudaMat, materials, numberOfMaterials*sizeof(material), cudaMemcpyHostToDevice);
+	
+	// package camera
+	cameraData cam;
+	cam.resolution = renderCam->resolution;
+	cam.position = renderCam->positions[frame];
+	cam.view = renderCam->views[frame];
+	cam.up = renderCam->ups[frame];
+	cam.fov = renderCam->fov;
+	cam.focl = renderCam->focl;
+	cam.aperture = renderCam->aperture;
 
-  sendImageToPBO<<<fullBlocksPerGrid, threadsPerBlock>>>(PBOpos, renderCam->resolution, cudaimage);
+	// copy texture
+	glm::vec3 * texture = loadBMP("metal1b.bmp");
+	glm::vec3 texture2[256*256];
+	for(int i = 0; i < 256 * 256; ++i)
+		texture2[i] = texture[i];
+	cudaMalloc((void**) &cudaTextures, 256 * 256 * sizeof(glm::vec3));
+	cudaMemcpy(cudaTextures, texture2, 256 * 256 * sizeof(glm::vec3), cudaMemcpyHostToDevice);
 
-  // retrieve image from GPU
-  cudaMemcpy( renderCam->image, cudaimage, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3), cudaMemcpyDeviceToHost);
 
-  // free up stuff, or else we'll leak memory like a madman
-  cudaFree( cudaimage );
-  cudaFree( cudageoms );
-  delete geomList;
+	// kernel launches
+	
+	ray * cudaRays;
+	cudaMalloc((void**)&cudaRays, (int)renderCam->resolution.x * (int)renderCam->resolution.y * sizeof(ray));
+	
+	
+	genCameraRayBatch<<<fullBlocksPerGrid, threadsPerBlock>>>(cam.resolution, cam,  cudaRays, iterations);
+	//cudaPrintfInit();
+	
+	for( int i = 0; i < traceDepth; ++i)
+	{	
+		raytraceRay<<<fullBlocksPerGrid, threadsPerBlock>>>(renderCam->resolution, (float)iterations, cudaimage, i, cudageoms, numberOfGeoms, cudaMat,cudaRays, cudaTextures);
+	}
+	
+	//cudaPrintfDisplay(stdout, false);
+	//if(iterations == 9999)
+	//	averagePixelColor<<<fullBlocksPerGrid, threadsPerBlock>>>(renderCam->resolution,cudaimage,iterations);
+	
+	sendImageToPBO<<<fullBlocksPerGrid, threadsPerBlock>>>(PBOpos, renderCam->resolution, cudaimage, iterations);
+	//
+	// retrieve image from GPU
+	cudaMemcpy( renderCam->image, cudaimage, (int)renderCam->resolution.x*(int)renderCam->resolution.y*sizeof(glm::vec3), cudaMemcpyDeviceToHost);
 
-  // make certain the kernel has completed
-  cudaThreadSynchronize();
+	// free up stuff, or else we'll leak memory like a madman
+	//cudaPrintfEnd();
+	cudaFree( cudaimage );
+	cudaFree( cudageoms );
+	cudaFree( cudaRays);
+	cudaFree( cudaMat);
+	cudaFree( cudaTextures);
+	delete [] geomList;
+	delete [] texture;
 
-  checkCUDAError("Kernel failed!");
+	// make certain the kernel has completed
+	cudaThreadSynchronize();
+	
+	checkCUDAError("Kernel failed!");
 }
diff --git a/src/scene.cpp b/src/scene.cpp
index 4cbe216..53831ff 100644
--- a/src/scene.cpp
+++ b/src/scene.cpp
@@ -7,6 +7,7 @@
 #include <iostream>
 #include "scene.h"
 #include <cstring>
+#include "tiny_obj_loader.h"
 
 scene::scene(string filename){
 	cout << "Reading scene from " << filename << " ..." << endl;
@@ -64,6 +65,46 @@ int scene::loadObject(string objectid){
                     cout << "Creating new mesh..." << endl;
                     cout << "Reading mesh from " << line << "... " << endl;
 		    		newObject.type = MESH;
+
+					 std::vector<tinyobj::shape_t> shapes;
+					std::vector<tinyobj::material_t> materials;
+					std::string err =  tinyobj::LoadObj(shapes, materials, "diamond.obj", NULL);
+
+					for (size_t i = 0; i < shapes.size(); i++) {
+						  newObject.faces = new glm::vec3[shapes[0].mesh.indices.size()];
+						  newObject.normals = new glm::vec3[shapes[0].mesh.indices.size()];
+						  newObject.faceNum = shapes[0].mesh.indices.size()/3;
+						  printf("shape[%ld].indices: %ld\n", i, shapes[i].mesh.indices.size());
+						  printf(" normal: %d \n", shapes[i].mesh.normals.size());
+						  printf(" points: %d \n", shapes[i].mesh.positions.size());
+							for (size_t f = 0; f < shapes[i].mesh.indices.size()/3; f++) {
+							  /*printf("  idx[%ld] = %d, %d, %d. mat_id = %d\n", f, shapes[i].mesh.indices[3*f+0], 
+								                                                  shapes[i].mesh.indices[3*f+1], 
+																				  shapes[i].mesh.indices[3*f+2], 
+																				  shapes[i].mesh.material_ids[f]);*/
+							   int index1 = shapes[i].mesh.indices[3*f+0];
+							   int index2 = shapes[i].mesh.indices[3*f+1];
+							   int index3 = shapes[i].mesh.indices[3*f+2];
+							   newObject.faces[3*f + 0] = glm::vec3(shapes[i].mesh.positions[3*index1],
+								                                   shapes[i].mesh.positions[3*index1+1],
+																   shapes[i].mesh.positions[3*index1+2]);
+
+							    newObject.faces[3*f + 1] = glm::vec3(shapes[i].mesh.positions[3*index2],
+								                                   shapes[i].mesh.positions[3*index2+1],
+																   shapes[i].mesh.positions[3*index2+2]);
+
+							    newObject.faces[3*f + 2] = glm::vec3(shapes[i].mesh.positions[3*index3],
+								                                   shapes[i].mesh.positions[3*index3+1],
+																   shapes[i].mesh.positions[3*index3+2]);
+
+								glm::vec3 p1 = newObject.faces[3*f];
+								glm::vec3 p2 = newObject.faces[3*f+1];
+								glm::vec3 p3 = newObject.faces[3*f+2];
+
+								newObject.normals[f] = glm::normalize( glm::cross(p2-p1,p3-p1));
+							}
+					}
+	
                 }else{
                     cout << "ERROR: " << line << " is not a valid object type!" << endl;
                     return -1;
@@ -172,7 +213,7 @@ int scene::loadCamera(){
         }
 	    
 	    //load camera properties
-	    for(int i=0; i<3; i++){
+	    for(int i=0; i<5; i++){
             //glm::vec3 translation; glm::vec3 rotation; glm::vec3 scale;
             utilityCore::safeGetline(fp_in,line);
             tokens = utilityCore::tokenizeString(line);
@@ -182,6 +223,12 @@ int scene::loadCamera(){
                 views.push_back(glm::vec3(atof(tokens[1].c_str()), atof(tokens[2].c_str()), atof(tokens[3].c_str())));
             }else if(strcmp(tokens[0].c_str(), "UP")==0){
                 ups.push_back(glm::vec3(atof(tokens[1].c_str()), atof(tokens[2].c_str()), atof(tokens[3].c_str())));
+            }
+			else if(strcmp(tokens[0].c_str(), "FOCL")==0){
+				newCamera.focl = atof(tokens[1].c_str());
+            }
+			else if(strcmp(tokens[0].c_str(), "APTR")==0){
+				newCamera.aperture = atof(tokens[1].c_str());
             }
 	    }
 	    
@@ -229,7 +276,7 @@ int scene::loadMaterial(string materialid){
 		material newMaterial;
 	
 		//load static properties
-		for(int i=0; i<10; i++){
+		for(int i=0; i<11; i++){
 			string line;
             utilityCore::safeGetline(fp_in,line);
 			vector<string> tokens = utilityCore::tokenizeString(line);
@@ -256,7 +303,8 @@ int scene::loadMaterial(string materialid){
 				newMaterial.reducedScatterCoefficient = atof(tokens[1].c_str());					  
 			}else if(strcmp(tokens[0].c_str(), "EMITTANCE")==0){
 				newMaterial.emittance = atof(tokens[1].c_str());					  
-			
+			}else if(strcmp(tokens[0].c_str(), "TEXTURE")==0){
+				newMaterial.isTextured = atof(tokens[1].c_str());	
 			}
 		}
 		materials.push_back(newMaterial);
diff --git a/src/sceneStructs.h b/src/sceneStructs.h
index 5e0c853..6398aa4 100644
--- a/src/sceneStructs.h
+++ b/src/sceneStructs.h
@@ -10,12 +10,16 @@
 #include "cudaMat4.h"
 #include <cuda_runtime.h>
 #include <string>
+#include <stdlib.h>
+#include "tiny_obj_loader.h"
 
 enum GEOMTYPE{ SPHERE, CUBE, MESH };
 
 struct ray {
 	glm::vec3 origin;
 	glm::vec3 direction;
+	int id;
+	glm::vec3 rayColor;
 };
 
 struct geom {
@@ -27,6 +31,9 @@ struct geom {
 	glm::vec3* scales;
 	cudaMat4* transforms;
 	cudaMat4* inverseTransforms;
+	glm::vec3* faces;
+	glm::vec3* normals;
+	int faceNum;
 };
 
 struct staticGeom {
@@ -37,6 +44,9 @@ struct staticGeom {
 	glm::vec3 scale;
 	cudaMat4 transform;
 	cudaMat4 inverseTransform;
+	glm::vec3 faces[27];
+	glm::vec3 normals[27];
+	int faceNum;
 };
 
 struct cameraData {
@@ -45,6 +55,8 @@ struct cameraData {
 	glm::vec3 view;
 	glm::vec3 up;
 	glm::vec2 fov;
+	float focl;
+	float aperture;
 };
 
 struct camera {
@@ -58,6 +70,8 @@ struct camera {
 	glm::vec3* image;
 	ray* rayList;
 	std::string imageName;
+	float focl;
+	float aperture;
 };
 
 struct material{
@@ -71,6 +85,8 @@ struct material{
 	glm::vec3 absorptionCoefficient;
 	float reducedScatterCoefficient;
 	float emittance;
+	int isTextured;
 };
 
+
 #endif //CUDASTRUCTS_H
diff --git a/src/tiny_obj_loader.cc b/src/tiny_obj_loader.cc
new file mode 100644
index 0000000..75f0dca
--- /dev/null
+++ b/src/tiny_obj_loader.cc
@@ -0,0 +1,725 @@
+//
+// Copyright 2012-2013, Syoyo Fujita.
+// 
+// Licensed under 2-clause BSD liecense.
+//
+
+//
+// version 0.9.7: Support multi-materials(per-face material ID) per object/group.
+// version 0.9.6: Support Ni(index of refraction) mtl parameter.
+//                Parse transmittance material parameter correctly.
+// version 0.9.5: Parse multiple group name.
+//                Add support of specifying the base path to load material file.
+// version 0.9.4: Initial suupport of group tag(g)
+// version 0.9.3: Fix parsing triple 'x/y/z'
+// version 0.9.2: Add more .mtl load support
+// version 0.9.1: Add initial .mtl load support
+// version 0.9.0: Initial
+//
+
+
+#include <cstdlib>
+#include <cstring>
+#include <cassert>
+
+#include <string>
+#include <vector>
+#include <map>
+#include <fstream>
+#include <sstream>
+
+#include "tiny_obj_loader.h"
+
+namespace tinyobj {
+
+struct vertex_index {
+  int v_idx, vt_idx, vn_idx;
+  vertex_index() {};
+  vertex_index(int idx) : v_idx(idx), vt_idx(idx), vn_idx(idx) {};
+  vertex_index(int vidx, int vtidx, int vnidx) : v_idx(vidx), vt_idx(vtidx), vn_idx(vnidx) {};
+
+};
+// for std::map
+static inline bool operator<(const vertex_index& a, const vertex_index& b)
+{
+  if (a.v_idx != b.v_idx) return (a.v_idx < b.v_idx);
+  if (a.vn_idx != b.vn_idx) return (a.vn_idx < b.vn_idx);
+  if (a.vt_idx != b.vt_idx) return (a.vt_idx < b.vt_idx);
+
+  return false;
+}
+
+struct obj_shape {
+  std::vector<float> v;
+  std::vector<float> vn;
+  std::vector<float> vt;
+};
+
+static inline bool isSpace(const char c) {
+  return (c == ' ') || (c == '\t');
+}
+
+static inline bool isNewLine(const char c) {
+  return (c == '\r') || (c == '\n') || (c == '\0');
+}
+
+// Make index zero-base, and also support relative index. 
+static inline int fixIndex(int idx, int n)
+{
+  int i;
+
+  if (idx > 0) {
+    i = idx - 1;
+  } else if (idx == 0) {
+    i = 0;
+  } else { // negative value = relative
+    i = n + idx;
+  }
+  return i;
+}
+
+static inline std::string parseString(const char*& token)
+{
+  std::string s;
+  int b = strspn(token, " \t");
+  int e = strcspn(token, " \t\r");
+  s = std::string(&token[b], &token[e]);
+
+  token += (e - b);
+  return s;
+}
+
+static inline int parseInt(const char*& token)
+{
+  token += strspn(token, " \t");
+  int i = atoi(token);
+  token += strcspn(token, " \t\r");
+  return i;
+}
+
+static inline float parseFloat(const char*& token)
+{
+  token += strspn(token, " \t");
+  float f = (float)atof(token);
+  token += strcspn(token, " \t\r");
+  return f;
+}
+
+static inline void parseFloat2(
+  float& x, float& y,
+  const char*& token)
+{
+  x = parseFloat(token);
+  y = parseFloat(token);
+}
+
+static inline void parseFloat3(
+  float& x, float& y, float& z,
+  const char*& token)
+{
+  x = parseFloat(token);
+  y = parseFloat(token);
+  z = parseFloat(token);
+}
+
+
+// Parse triples: i, i/j/k, i//k, i/j
+static vertex_index parseTriple(
+  const char* &token,
+  int vsize,
+  int vnsize,
+  int vtsize)
+{
+    vertex_index vi(-1);
+
+    vi.v_idx = fixIndex(atoi(token), vsize);
+    token += strcspn(token, "/ \t\r");
+    if (token[0] != '/') {
+      return vi;
+    }
+    token++;
+
+    // i//k
+    if (token[0] == '/') {
+      token++;
+      vi.vn_idx = fixIndex(atoi(token), vnsize);
+      token += strcspn(token, "/ \t\r");
+      return vi;
+    }
+    
+    // i/j/k or i/j
+    vi.vt_idx = fixIndex(atoi(token), vtsize);
+    token += strcspn(token, "/ \t\r");
+    if (token[0] != '/') {
+      return vi;
+    }
+
+    // i/j/k
+    token++;  // skip '/'
+    vi.vn_idx = fixIndex(atoi(token), vnsize);
+    token += strcspn(token, "/ \t\r");
+    return vi; 
+}
+
+static unsigned int
+updateVertex(
+  std::map<vertex_index, unsigned int>& vertexCache,
+  std::vector<float>& positions,
+  std::vector<float>& normals,
+  std::vector<float>& texcoords,
+  const std::vector<float>& in_positions,
+  const std::vector<float>& in_normals,
+  const std::vector<float>& in_texcoords,
+  const vertex_index& i)
+{
+  const std::map<vertex_index, unsigned int>::iterator it = vertexCache.find(i);
+
+  if (it != vertexCache.end()) {
+    // found cache
+    return it->second;
+  }
+
+  assert(in_positions.size() > (unsigned int) (3*i.v_idx+2));
+
+  positions.push_back(in_positions[3*i.v_idx+0]);
+  positions.push_back(in_positions[3*i.v_idx+1]);
+  positions.push_back(in_positions[3*i.v_idx+2]);
+
+  if (i.vn_idx >= 0) {
+    normals.push_back(in_normals[3*i.vn_idx+0]);
+    normals.push_back(in_normals[3*i.vn_idx+1]);
+    normals.push_back(in_normals[3*i.vn_idx+2]);
+  }
+
+  if (i.vt_idx >= 0) {
+    texcoords.push_back(in_texcoords[2*i.vt_idx+0]);
+    texcoords.push_back(in_texcoords[2*i.vt_idx+1]);
+  }
+
+  unsigned int idx = positions.size() / 3 - 1;
+  vertexCache[i] = idx;
+
+  return idx;
+}
+
+void InitMaterial(material_t& material) {
+  material.name = "";
+  material.ambient_texname = "";
+  material.diffuse_texname = "";
+  material.specular_texname = "";
+  material.normal_texname = "";
+  for (int i = 0; i < 3; i ++) {
+    material.ambient[i] = 0.f;
+    material.diffuse[i] = 0.f;
+    material.specular[i] = 0.f;
+    material.transmittance[i] = 0.f;
+    material.emission[i] = 0.f;
+  }
+  material.illum = 0;
+  material.dissolve = 1.f;
+  material.shininess = 1.f;
+  material.ior = 1.f;
+  material.unknown_parameter.clear();
+}
+
+static bool
+exportFaceGroupToShape(
+  shape_t& shape,
+  std::map<vertex_index, unsigned int> vertexCache,
+  const std::vector<float> &in_positions,
+  const std::vector<float> &in_normals,
+  const std::vector<float> &in_texcoords,
+  const std::vector<std::vector<vertex_index> >& faceGroup,
+  const int material_id,
+  const std::string &name,
+  bool clearCache)
+{
+  if (faceGroup.empty()) {
+    return false;
+  }
+
+  size_t offset;
+
+  offset = shape.mesh.indices.size();
+
+  // Flatten vertices and indices
+  for (size_t i = 0; i < faceGroup.size(); i++) {
+    const std::vector<vertex_index>& face = faceGroup[i];
+
+    vertex_index i0 = face[0];
+    vertex_index i1(-1);
+    vertex_index i2 = face[1];
+
+    size_t npolys = face.size();
+
+    // Polygon -> triangle fan conversion
+    for (size_t k = 2; k < npolys; k++) {
+      i1 = i2;
+      i2 = face[k];
+
+      unsigned int v0 = updateVertex(vertexCache, shape.mesh.positions, shape.mesh.normals, shape.mesh.texcoords, in_positions, in_normals, in_texcoords, i0);
+      unsigned int v1 = updateVertex(vertexCache, shape.mesh.positions, shape.mesh.normals, shape.mesh.texcoords, in_positions, in_normals, in_texcoords, i1);
+      unsigned int v2 = updateVertex(vertexCache, shape.mesh.positions, shape.mesh.normals, shape.mesh.texcoords, in_positions, in_normals, in_texcoords, i2);
+
+      shape.mesh.indices.push_back(v0);
+      shape.mesh.indices.push_back(v1);
+      shape.mesh.indices.push_back(v2);
+
+      shape.mesh.material_ids.push_back(material_id);
+    }
+
+  }
+
+  shape.name = name;
+
+  if (clearCache)
+      vertexCache.clear();
+
+  return true;
+
+}
+
+std::string LoadMtl (
+  std::map<std::string, int>& material_map,
+  std::vector<material_t>& materials,
+  std::istream& inStream)
+{
+  material_map.clear();
+  std::stringstream err;
+
+  material_t material;
+  
+  int maxchars = 8192;  // Alloc enough size.
+  std::vector<char> buf(maxchars);  // Alloc enough size.
+  while (inStream.peek() != -1) {
+    inStream.getline(&buf[0], maxchars);
+
+    std::string linebuf(&buf[0]);
+
+    // Trim newline '\r\n' or '\n'
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\n') linebuf.erase(linebuf.size()-1);
+    }
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\r') linebuf.erase(linebuf.size()-1);
+    }
+
+    // Skip if empty line.
+    if (linebuf.empty()) {
+      continue;
+    }
+
+    // Skip leading space.
+    const char* token = linebuf.c_str();
+    token += strspn(token, " \t");
+
+    assert(token);
+    if (token[0] == '\0') continue; // empty line
+    
+    if (token[0] == '#') continue;  // comment line
+    
+    // new mtl
+    if ((0 == strncmp(token, "newmtl", 6)) && isSpace((token[6]))) {
+      // flush previous material.
+      if (!material.name.empty())
+      {
+          material_map.insert(std::pair<std::string, int>(material.name, materials.size()));
+          materials.push_back(material);
+      }
+
+      // initial temporary material
+      InitMaterial(material);
+
+      // set new mtl name
+      char namebuf[4096];
+      token += 7;
+      sscanf(token, "%s", namebuf);
+      material.name = namebuf;
+      continue;
+    }
+    
+    // ambient
+    if (token[0] == 'K' && token[1] == 'a' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.ambient[0] = r;
+      material.ambient[1] = g;
+      material.ambient[2] = b;
+      continue;
+    }
+    
+    // diffuse
+    if (token[0] == 'K' && token[1] == 'd' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.diffuse[0] = r;
+      material.diffuse[1] = g;
+      material.diffuse[2] = b;
+      continue;
+    }
+    
+    // specular
+    if (token[0] == 'K' && token[1] == 's' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.specular[0] = r;
+      material.specular[1] = g;
+      material.specular[2] = b;
+      continue;
+    }
+    
+    // transmittance
+    if (token[0] == 'K' && token[1] == 't' && isSpace((token[2]))) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.transmittance[0] = r;
+      material.transmittance[1] = g;
+      material.transmittance[2] = b;
+      continue;
+    }
+
+    // ior(index of refraction)
+    if (token[0] == 'N' && token[1] == 'i' && isSpace((token[2]))) {
+      token += 2;
+      material.ior = parseFloat(token);
+      continue;
+    }
+
+    // emission
+    if(token[0] == 'K' && token[1] == 'e' && isSpace(token[2])) {
+      token += 2;
+      float r, g, b;
+      parseFloat3(r, g, b, token);
+      material.emission[0] = r;
+      material.emission[1] = g;
+      material.emission[2] = b;
+      continue;
+    }
+
+    // shininess
+    if(token[0] == 'N' && token[1] == 's' && isSpace(token[2])) {
+      token += 2;
+      material.shininess = parseFloat(token);
+      continue;
+    }
+
+    // illum model
+    if (0 == strncmp(token, "illum", 5) && isSpace(token[5])) {
+      token += 6;
+      material.illum = parseInt(token);
+      continue;
+    }
+
+    // dissolve
+    if ((token[0] == 'd' && isSpace(token[1]))) {
+      token += 1;
+      material.dissolve = parseFloat(token);
+      continue;
+    }
+    if (token[0] == 'T' && token[1] == 'r' && isSpace(token[2])) {
+      token += 2;
+      material.dissolve = parseFloat(token);
+      continue;
+    }
+
+    // ambient texture
+    if ((0 == strncmp(token, "map_Ka", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.ambient_texname = token;
+      continue;
+    }
+
+    // diffuse texture
+    if ((0 == strncmp(token, "map_Kd", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.diffuse_texname = token;
+      continue;
+    }
+
+    // specular texture
+    if ((0 == strncmp(token, "map_Ks", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.specular_texname = token;
+      continue;
+    }
+
+    // normal texture
+    if ((0 == strncmp(token, "map_Ns", 6)) && isSpace(token[6])) {
+      token += 7;
+      material.normal_texname = token;
+      continue;
+    }
+
+    // unknown parameter
+    const char* _space = strchr(token, ' ');
+    if(!_space) {
+      _space = strchr(token, '\t');
+    }
+    if(_space) {
+      int len = _space - token;
+      std::string key(token, len);
+      std::string value = _space + 1;
+      material.unknown_parameter.insert(std::pair<std::string, std::string>(key, value));
+    }
+  }
+  // flush last material.
+  material_map.insert(std::pair<std::string, int>(material.name, materials.size()));
+  materials.push_back(material);
+
+  return err.str();
+}
+
+std::string MaterialFileReader::operator() (
+    const std::string& matId,
+    std::vector<material_t>& materials,
+    std::map<std::string, int>& matMap)
+{
+  std::string filepath;
+
+  if (!m_mtlBasePath.empty()) {
+    filepath = std::string(m_mtlBasePath) + matId;
+  } else {
+    filepath = matId;
+  }
+
+  std::ifstream matIStream(filepath.c_str());
+  return LoadMtl(matMap, materials, matIStream);
+}
+
+std::string
+LoadObj(
+  std::vector<shape_t>& shapes,
+  std::vector<material_t>& materials,   // [output]
+  const char* filename,
+  const char* mtl_basepath)
+{
+
+  shapes.clear();
+
+  std::stringstream err;
+
+  std::ifstream ifs(filename);
+  if (!ifs) {
+    err << "Cannot open file [" << filename << "]" << std::endl;
+    return err.str();
+  }
+
+  std::string basePath;
+  if (mtl_basepath) {
+    basePath = mtl_basepath;
+  }
+  MaterialFileReader matFileReader( basePath );
+  
+  return LoadObj(shapes, materials, ifs, matFileReader);
+}
+
+std::string LoadObj(
+  std::vector<shape_t>& shapes,
+  std::vector<material_t>& materials,   // [output]
+  std::istream& inStream,
+  MaterialReader& readMatFn)
+{
+  std::stringstream err;
+
+  std::vector<float> v;
+  std::vector<float> vn;
+  std::vector<float> vt;
+  std::vector<std::vector<vertex_index> > faceGroup;
+  std::string name;
+
+  // material
+  std::map<std::string, int> material_map;
+  std::map<vertex_index, unsigned int> vertexCache;
+  int  material = -1;
+
+  shape_t shape;
+
+  int maxchars = 8192;  // Alloc enough size.
+  std::vector<char> buf(maxchars);  // Alloc enough size.
+  while (inStream.peek() != -1) {
+    inStream.getline(&buf[0], maxchars);
+
+    std::string linebuf(&buf[0]);
+
+    // Trim newline '\r\n' or '\n'
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\n') linebuf.erase(linebuf.size()-1);
+    }
+    if (linebuf.size() > 0) {
+      if (linebuf[linebuf.size()-1] == '\r') linebuf.erase(linebuf.size()-1);
+    }
+
+    // Skip if empty line.
+    if (linebuf.empty()) {
+      continue;
+    }
+
+    // Skip leading space.
+    const char* token = linebuf.c_str();
+    token += strspn(token, " \t");
+
+    assert(token);
+    if (token[0] == '\0') continue; // empty line
+    
+    if (token[0] == '#') continue;  // comment line
+
+    // vertex
+    if (token[0] == 'v' && isSpace((token[1]))) {
+      token += 2;
+      float x, y, z;
+      parseFloat3(x, y, z, token);
+      v.push_back(x);
+      v.push_back(y);
+      v.push_back(z);
+      continue;
+    }
+
+    // normal
+    if (token[0] == 'v' && token[1] == 'n' && isSpace((token[2]))) {
+      token += 3;
+      float x, y, z;
+      parseFloat3(x, y, z, token);
+      vn.push_back(x);
+      vn.push_back(y);
+      vn.push_back(z);
+      continue;
+    }
+
+    // texcoord
+    if (token[0] == 'v' && token[1] == 't' && isSpace((token[2]))) {
+      token += 3;
+      float x, y;
+      parseFloat2(x, y, token);
+      vt.push_back(x);
+      vt.push_back(y);
+      continue;
+    }
+
+    // face
+    if (token[0] == 'f' && isSpace((token[1]))) {
+      token += 2;
+      token += strspn(token, " \t");
+
+      std::vector<vertex_index> face;
+      while (!isNewLine(token[0])) {
+        vertex_index vi = parseTriple(token, v.size() / 3, vn.size() / 3, vt.size() / 2);
+        face.push_back(vi);
+        int n = strspn(token, " \t\r");
+        token += n;
+      }
+
+      faceGroup.push_back(face);
+      
+      continue;
+    }
+
+    // use mtl
+    if ((0 == strncmp(token, "usemtl", 6)) && isSpace((token[6]))) {
+
+      char namebuf[4096];
+      token += 7;
+      sscanf(token, "%s", namebuf);
+
+      bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, false);
+      faceGroup.clear();
+
+      if (material_map.find(namebuf) != material_map.end()) {
+        material = material_map[namebuf];
+      } else {
+        // { error!! material not found }
+        material = -1;
+      }
+
+      continue;
+
+    }
+
+    // load mtl
+    if ((0 == strncmp(token, "mtllib", 6)) && isSpace((token[6]))) {
+      char namebuf[4096];
+      token += 7;
+      sscanf(token, "%s", namebuf);
+        
+      std::string err_mtl = readMatFn(namebuf, materials, material_map);
+      if (!err_mtl.empty()) {
+        faceGroup.clear();  // for safety
+        return err_mtl;
+      }
+      
+      continue;
+    }
+
+    // group name
+    if (token[0] == 'g' && isSpace((token[1]))) {
+
+      // flush previous face group.
+      bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, true);
+      if (ret) {
+        shapes.push_back(shape);
+      }
+
+      shape = shape_t();
+
+      //material = -1;
+      faceGroup.clear();
+
+      std::vector<std::string> names;
+      while (!isNewLine(token[0])) {
+        std::string str = parseString(token);
+        names.push_back(str);
+        token += strspn(token, " \t\r"); // skip tag
+      }
+
+      assert(names.size() > 0);
+
+      // names[0] must be 'g', so skipt 0th element.
+      if (names.size() > 1) {
+        name = names[1];
+      } else {
+        name = "";
+      }
+
+      continue;
+    }
+
+    // object name
+    if (token[0] == 'o' && isSpace((token[1]))) {
+
+      // flush previous face group.
+      bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, true);
+      if (ret) {
+        shapes.push_back(shape);
+      }
+
+      //material = -1;
+      faceGroup.clear();
+      shape = shape_t();
+
+      // @todo { multiple object name? }
+      char namebuf[4096];
+      token += 2;
+      sscanf(token, "%s", namebuf);
+      name = std::string(namebuf);
+
+
+      continue;
+    }
+
+    // Ignore unknown command.
+  }
+
+  bool ret = exportFaceGroupToShape(shape, vertexCache, v, vn, vt, faceGroup, material, name, true);
+  if (ret) {
+    shapes.push_back(shape);
+  }
+  faceGroup.clear();  // for safety
+
+  return err.str();
+}
+
+
+}
diff --git a/src/tiny_obj_loader.h b/src/tiny_obj_loader.h
new file mode 100644
index 0000000..a58d7be
--- /dev/null
+++ b/src/tiny_obj_loader.h
@@ -0,0 +1,107 @@
+//
+// Copyright 2012-2013, Syoyo Fujita.
+//
+// Licensed under 2-clause BSD liecense.
+//
+#ifndef _TINY_OBJ_LOADER_H
+#define _TINY_OBJ_LOADER_H
+
+#include <string>
+#include <vector>
+#include <map>
+
+namespace tinyobj {
+
+typedef struct
+{
+    std::string name;
+
+    float ambient[3];
+    float diffuse[3];
+    float specular[3];
+    float transmittance[3];
+    float emission[3];
+    float shininess;
+    float ior;                // index of refraction
+    float dissolve;           // 1 == opaque; 0 == fully transparent
+    // illumination model (see http://www.fileformat.info/format/material/)
+    int illum;
+
+    std::string ambient_texname;
+    std::string diffuse_texname;
+    std::string specular_texname;
+    std::string normal_texname;
+    std::map<std::string, std::string> unknown_parameter;
+} material_t;
+
+typedef struct
+{
+    std::vector<float>          positions;
+    std::vector<float>          normals;
+    std::vector<float>          texcoords;
+    std::vector<unsigned int>   indices;
+    std::vector<int>            material_ids; // per-mesh material ID
+} mesh_t;
+
+typedef struct
+{
+    std::string  name;
+    mesh_t       mesh;
+} shape_t;
+
+class MaterialReader
+{
+public:
+    MaterialReader(){}
+    virtual ~MaterialReader(){}
+
+    virtual std::string operator() (
+        const std::string& matId,
+        std::vector<material_t>& materials,
+        std::map<std::string, int>& matMap) = 0;
+};
+
+class MaterialFileReader:
+  public MaterialReader
+{
+    public:
+        MaterialFileReader(const std::string& mtl_basepath): m_mtlBasePath(mtl_basepath) {}
+        virtual ~MaterialFileReader() {}
+        virtual std::string operator() (
+          const std::string& matId,
+          std::vector<material_t>& materials,
+          std::map<std::string, int>& matMap);
+
+    private:
+        std::string m_mtlBasePath;
+};
+
+/// Loads .obj from a file.
+/// 'shapes' will be filled with parsed shape data
+/// The function returns error string.
+/// Returns empty string when loading .obj success.
+/// 'mtl_basepath' is optional, and used for base path for .mtl file.
+std::string LoadObj(
+    std::vector<shape_t>& shapes,   // [output]
+    std::vector<material_t>& materials,   // [output]
+    const char* filename,
+    const char* mtl_basepath = NULL);
+
+/// Loads object from a std::istream, uses GetMtlIStreamFn to retrieve
+/// std::istream for materials.
+/// Returns empty string when loading .obj success.
+std::string LoadObj(
+    std::vector<shape_t>& shapes,   // [output]
+    std::vector<material_t>& materials,   // [output]
+    std::istream& inStream,
+    MaterialReader& readMatFn);
+
+/// Loads materials into std::map
+/// Returns an empty string if successful
+std::string LoadMtl (
+  std::map<std::string, int>& material_map,
+  std::vector<material_t>& materials,
+  std::istream& inStream);
+}
+
+#endif  // _TINY_OBJ_LOADER_H
diff --git a/src/utilities.cpp b/src/utilities.cpp
index a8e5d90..6f0ae4d 100755
--- a/src/utilities.cpp
+++ b/src/utilities.cpp
@@ -4,7 +4,7 @@
 //  File: utilities.cpp
 //  A collection/kitchen sink of generally useful functions
 
-#define GLM_FORCE_RADIANS
+//#define GLM_FORCE_RADIANS
 
 #include <glm/gtc/matrix_transform.hpp>
 #include <glm/gtc/matrix_inverse.hpp>
diff --git a/test.0.jpg b/test.0.jpg
new file mode 100644
index 0000000..59b664d
Binary files /dev/null and b/test.0.jpg differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/Stone.bmp b/windows/Project3-Pathtracer/Project3-Pathtracer/Stone.bmp
new file mode 100644
index 0000000..d3c1df2
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/Stone.bmp differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/bricks_red.tga b/windows/Project3-Pathtracer/Project3-Pathtracer/bricks_red.tga
new file mode 100644
index 0000000..b5b5183
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/bricks_red.tga differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/bunnyl.mtl b/windows/Project3-Pathtracer/Project3-Pathtracer/bunnyl.mtl
new file mode 100644
index 0000000..c89145d
--- /dev/null
+++ b/windows/Project3-Pathtracer/Project3-Pathtracer/bunnyl.mtl
@@ -0,0 +1,6 @@
+newmtl initialShadingGroup
+illum 4
+Kd 0.50 0.50 0.50
+Ka 0.00 0.00 0.00
+Tf 1.00 1.00 1.00
+Ni 1.00
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/depth_field.jpg b/windows/Project3-Pathtracer/Project3-Pathtracer/depth_field.jpg
new file mode 100644
index 0000000..ac56121
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/depth_field.jpg differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/metal1a.bmp b/windows/Project3-Pathtracer/Project3-Pathtracer/metal1a.bmp
new file mode 100644
index 0000000..1e5ad94
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/metal1a.bmp differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/metal1b.bmp b/windows/Project3-Pathtracer/Project3-Pathtracer/metal1b.bmp
new file mode 100644
index 0000000..29e3706
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/metal1b.bmp differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/obj_loader.jpg b/windows/Project3-Pathtracer/Project3-Pathtracer/obj_loader.jpg
new file mode 100644
index 0000000..44254f0
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/obj_loader.jpg differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/simple_path_tracer.bmp b/windows/Project3-Pathtracer/Project3-Pathtracer/simple_path_tracer.bmp
new file mode 100644
index 0000000..7327c43
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/simple_path_tracer.bmp differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/smoothing_filter.bmp b/windows/Project3-Pathtracer/Project3-Pathtracer/smoothing_filter.bmp
new file mode 100644
index 0000000..1fa44d1
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/smoothing_filter.bmp differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/test.0.jpg b/windows/Project3-Pathtracer/Project3-Pathtracer/test.0.jpg
new file mode 100644
index 0000000..59b664d
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/test.0.jpg differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/test.2.bmp b/windows/Project3-Pathtracer/Project3-Pathtracer/test.2.bmp
new file mode 100644
index 0000000..8905d54
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/test.2.bmp differ
diff --git a/windows/Project3-Pathtracer/Project3-Pathtracer/tex1.bmp b/windows/Project3-Pathtracer/Project3-Pathtracer/tex1.bmp
new file mode 100644
index 0000000..8778f99
Binary files /dev/null and b/windows/Project3-Pathtracer/Project3-Pathtracer/tex1.bmp differ