Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add paper visualizations to examples #3020

Merged
merged 35 commits into from
Aug 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
812a705
Add .python-version (from pyenv) to gitignore
roym899 Aug 17, 2023
f590437
Add TAPIR paper walkthrough
roym899 Aug 17, 2023
4af9ab5
Replace playlist id with video id
roym899 Aug 17, 2023
85f3ab2
Try vq=hd720 instead of hd=1
roym899 Aug 17, 2023
4a51c45
Add autoplay to videos
roym899 Aug 17, 2023
7496a90
Add TAPIR images
roym899 Aug 17, 2023
76b7661
Improve texts, revert to hd=1, improve sizing of overview image
roym899 Aug 17, 2023
24f86bd
Center overview image
roym899 Aug 17, 2023
62cb009
Title case for section titles
roym899 Aug 17, 2023
3b88200
Retain standard image resizing behavior
roym899 Aug 17, 2023
ed9bc39
Remove setup tab, and reorder tabs
roym899 Aug 17, 2023
82c4c8e
Revert setup removal
roym899 Aug 17, 2023
6fb32b5
Add empty line before setup
roym899 Aug 17, 2023
9e571a7
Try to remove paper walkthrough tab
roym899 Aug 17, 2023
d2ba497
Add paper walkthroughs tab
roym899 Aug 17, 2023
2321272
Reorder paper walkthroughs
roym899 Aug 17, 2023
14e019b
Add SLAHMR example
roym899 Aug 17, 2023
f121495
Add LIMAP example
roym899 Aug 22, 2023
e839f66
Add Wide Baseline example
roym899 Aug 22, 2023
fd195b3
Add DBW example
roym899 Aug 22, 2023
3e11105
Add Shap-E + Point-E example
roym899 Aug 22, 2023
be097c5
Add SimpleRecon example
roym899 Aug 22, 2023
cbc66b9
Capitalize SLAM
roym899 Aug 22, 2023
1ec67ac
Add MCC example
roym899 Aug 22, 2023
e342ee9
Clean up and reorder examples page
roym899 Aug 22, 2023
67f76bd
Sort walkthroughs by publishing date
roym899 Aug 22, 2023
d76bb2c
Add SAM tag to MCC example
roym899 Aug 22, 2023
6f459c7
Add username to TODOs
roym899 Aug 22, 2023
f3c85b3
Add remaining videos
roym899 Aug 29, 2023
f687435
Separate Setup tab again
roym899 Aug 29, 2023
82cc469
Add SimpleRecon videos
roym899 Aug 29, 2023
27cfcab
Add jax tag
roym899 Aug 29, 2023
64a8304
Rename to paper visualizations
roym899 Aug 29, 2023
9d2d536
Fix enumerations, @rerundotio->Rerun SDK, and remove twitter handles
roym899 Aug 29, 2023
9f03cf7
Keep inline enumeration for (a),(b), (c)
roym899 Aug 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ _deps

# Python virtual environment:
**/venv*
.python-version

# Python build artifacts:
__pycache__
Expand Down
38 changes: 35 additions & 3 deletions examples/manifest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ root:
individual example sections below.
children:
- name: real-data
title: Examples with real data
title: Examples with Real Data
prelude: |
The following examples illustrate using the Rerun logging SDK with potential real-world (if toy) use cases.
They all require additional data to be downloaded, so an internet connection is needed at least once.
Expand Down Expand Up @@ -97,7 +97,7 @@ root:
python: python/face_tracking

- name: artificial-data
title: Examples with artificial data
title: Examples with Artificial Data
prelude: |
The following examples serve to illustrate various uses of the Rerun logging SDK.
They should not require any additional data downloads, and should run offline.
Expand Down Expand Up @@ -127,10 +127,38 @@ root:
- name: text-logging
python: python/text_logging

- name: paper-walkthrough
title: Paper Visualizations
prelude: |
The following examples use Rerun to create visual walkthroughs of papers. They are typically forks
from the official open-source implementations adding Rerun as the visualizer.
Check out the respective READMEs for installation instructions.
For the simplest possible examples showing how to use each api,
check out [Loggable Data Types](/docs/reference/data_types).
children:
- name: differentiable_blocks_world
python: python/differentiable_blocks_world
- name: tapir
python: python/tapir
- name: widebaseline
python: python/widebaseline
- name: shape_pointe
python: python/shape_pointe
- name: limap
python: python/limap
- name: simplerecon
python: python/simplerecon
- name: mcc
python: python/mcc
- name: slahmr
python: python/slahmr

- name: setup
title: Setup
prelude: |
Make sure you have the Rerun repository checked out and the latest SDK installed.
### Examples with Real / Artificial Data
To run these examples, make sure you have the Rerun repository checked out
and the latest SDK installed.

```bash
pip install --upgrade rerun-sdk # install the latest Rerun SDK
Expand All @@ -141,3 +169,7 @@ root:
> Note: Make sure your SDK version matches the examples.
For example, if your SDK version is `0.3.1`, check out the matching tag
in the Rerun repository by running `git checkout v0.3.1`.

### Paper Visualizations
To reproduce the paper visualizations check out the README of the respective
Rerun forks.
39 changes: 39 additions & 0 deletions examples/python/differentiable_blocks_world/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: "Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives"
python: https://github.com/rerun-io/differentiable-blocksworld
tags: [3D, mesh, pinhole-camera]
thumbnail: https://static.rerun.io/fd44aa668cdebc6a4c14ff038e28f48cfb83c5ee_dbw_480w.png
---

Finding a textured mesh decomposition from a collection of posed images is a very challenging optimization problem. “Differentiable Block Worlds” by @t_monnier et al. shows impressive results using differentiable rendering. I visualized how this optimization works using the Rerun SDK.

https://www.youtube.com/watch?v=Ztwak981Lqg?playlist=Ztwak981Lqg&loop=1&hd=1&rel=0&autoplay=1

In “Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives” the authors describe an optimization of a background icosphere, a ground plane, and multiple superquadrics. The goal is to find the shapes and textures that best explain the observations.

<picture>
<source media="(max-width: 480px)" srcset="https://static.rerun.io/71b822942cb6ce044d6f5f177350c61f0ab31d80_dbw-overview_480w.png">
<source media="(max-width: 768px)" srcset="https://static.rerun.io/9586ea6a3f73d247984f951c07d9cf40dcdf23d2_dbw-overview_768w.png">
<source media="(max-width: 1024px)" srcset="https://static.rerun.io/89bab0c74b2bbff84a606cc3a400f208e1aaadeb_dbw-overview_1024w.png">
<source media="(max-width: 1200px)" srcset="https://static.rerun.io/7c8bec373d0a6c71ea05ffa696acb981137ca579_dbw-overview_1200w.png">
<img src="https://static.rerun.io/a8fea9769b734b2474a1e743259b3e4e68203c0f_dbw-overview_full.png" alt="">
</picture>

The optimization is initialized with an initial set of superquadrics (”blocks”), a ground plane, and a sphere for the background. From here, the optimization can only reduce the number of blocks, not add additional ones.

https://www.youtube.com/watch?v=bOon26Zdqpc?playlist=bOon26Zdqpc&loop=1&hd=1&rel=0&autoplay=1

A key difference to other differentiable renderers is the addition of transparency handling. Each mesh has an opacity associated with it that is optimized. When the opacity becomes lower than a threshold the mesh is discarded in the visualization. This allows to optimize the number of meshes.

https://www.youtube.com/watch?v=d6LkS63eHXo?playlist=d6LkS63eHXo&loop=1&hd=1&rel=0&autoplay=1

To stabilize the optimization and avoid local minima, a 3-stage optimization is employed:
1. the texture resolution is reduced by a factor of 8,
2. the full resolution texture is optimized, and
3. transparency-based optimization is deactivated, only optimizing the opaque meshes from here.

https://www.youtube.com/watch?v=irxqjUGm34g?playlist=irxqjUGm34g&loop=1&hd=1&rel=0&autoplay=1

Check out the [project page](https://www.tmonnier.com/DBW/), which also contains examples of physical simulation and scene editing enabled by this kind of scene decomposition.

Also make sure to read the [paper](https://arxiv.org/abs/2307.05473) by Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei A. Efros, Mathieu Aubry. Interesting study of how to approach such a difficult optimization problem.
38 changes: 38 additions & 0 deletions examples/python/limap/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
---
title: "3D Line Mapping Revisited"
python: hhttps://github.com/rerun-io/limap
tags: [2D, 3D, structure-from-motion, time-series, line-detection, pinhole-camera]
thumbnail: https://static.rerun.io/1c99ab95ad2a9e673effa0e104f5240912c80850_limap_480w.png
---

Human-made environments contain a lot of straight lines, which are currently not exploited by most mapping approaches. With their recent work "3D Line Mapping Revisited" Shaohui Liu et al. take steps towards changing that.

https://www.youtube.com/watch?v=UdDzfxDo7UQ?playlist=UdDzfxDo7UQ&loop=1&hd=1&rel=0&autoplay=1

The work covers all stages of line-based structure-from-motion: line detection, line matching, line triangulation, track building and joint optimization. As shown in the figure, detected points and their interaction with lines is also used to aid the reconstruction.

<picture>
<source media="(max-width: 480px)" srcset="https://static.rerun.io/924954fe0cf39a4e02ef51fc48dd5a24bd618cbb_limap-overview_480w.png">
<source media="(max-width: 768px)" srcset="https://static.rerun.io/1c3528db7299ceaf9b7422b5be89c1aad805af7f_limap-overview_768w.png">
<source media="(max-width: 1024px)" srcset="https://static.rerun.io/f6bab491a2fd0ac8215095de65555b66ec932326_limap-overview_1024w.png">
<source media="(max-width: 1200px)" srcset="https://static.rerun.io/8cd2c725f579dbef19c63a187742e16b6b67cf80_limap-overview_1200w.png">
<img src="https://static.rerun.io/8d066d407d2ce1117744555b0e7691c54d7715d4_limap-overview_full.png" alt="">
</picture>

LIMAP matches detected 2D lines between images and computes 3D candidates for each match. These are scored, and only the best candidate one is kept (green in video). To remove duplicates and reduce noise candidates are grouped together when they likely belong to the same line.

https://www.youtube.com/watch?v=kyrD6IJKxg8?playlist=kyrD6IJKxg8&loop=1&hd=1&rel=0&autoplay=1

Focusing on a single line, LIMAP computes a score for each candidate (the brighter, the higher the cost). These scores are used to decide which line candidates belong to the same line. The final line shown in red is computed based on the candidates that were grouped together.

https://www.youtube.com/watch?v=JTOs_VVOS78?playlist=JTOs_VVOS78&loop=1&hd=1&rel=0&autoplay=1

Once the lines are found, LIMAP further uses point-line associations to jointly optimize lines and points. Often 3D points lie on lines or intersections thereof. Here we highlight the line-point associations in blue.

https://www.youtube.com/watch?v=0xZXPv1o7S0?playlist=0xZXPv1o7S0&loop=1&hd=1&rel=0&autoplay=1

Human-made environments often contain a lot of parallel and orthogonal lines. LIMAP allows to globally optimize the lines by detecting sets that are likely parallel or orthogonal. Here we visualize these parallel lines. Each color is associated with one vanishing point.

https://www.youtube.com/watch?v=qyWYq0arb-Y?playlist=qyWYq0arb-Y&loop=1&hd=1&rel=0&autoplay=1

There is a lot more to unpack, so check out the [paper](https://arxiv.org/abs/2303.17504) by Shaohui Liu, Yifan Yu, Rémi Pautrat, Marc Pollefeys, Viktor Larsson. It also gives an educational overview of the strengths and weaknesses of both line-based and point-based structure-from-motion.
31 changes: 31 additions & 0 deletions examples/python/mcc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
---
title: "Single Image 3D Reconstruction using MCC, SAM, and ZoeDepth"
python: https://github.com/rerun-io/MCC
tags: [2D, 3D, segmentation, point-cloud, sam]
thumbnail: https://static.rerun.io/e62757c5953407373f2279be37a80748334cb6d7_mcc_480w.png
---

By combining MetaAI's [Segment Anything Model (SAM)](https://github.com/facebookresearch/segment-anything) and [Multiview Compressive Coding (MCC)](https://github.com/facebookresearch/MCC) we can get a 3D object from a single image.

https://www.youtube.com/watch?v=kmgFTWBZhWU?playlist=kmgFTWBZhWU&loop=1&hd=1&rel=0&autoplay=1

The basic idea is to use SAM to create a generic object mask so we can exclude the background.

https://www.youtube.com/watch?v=7qosqFbesL0?playlist=7qosqFbesL0&loop=1&hd=1&rel=0&autoplay=1

The next step is to generate a depth image. Here we use the awesome [ZoeDepth](https://github.com/isl-org/ZoeDepth) to get realistic depth from the color image.

https://www.youtube.com/watch?v=d0u-MoNVR6o?playlist=d0u-MoNVR6o&loop=1&hd=1&rel=0&autoplay=1

With depth, color, and an object mask we have everything needed to create a colored point cloud of the object from a single view

https://www.youtube.com/watch?v=LI0mE7usguk?playlist=LI0mE7usguk&loop=1&hd=1&rel=0&autoplay=1

MCC encodes the colored points and then creates a reconstruction by sweeping through the volume, querying the network for occupancy and color at each point.

https://www.youtube.com/watch?v=RuHv9Nx6PvI?playlist=RuHv9Nx6PvI&loop=1&hd=1&rel=0&autoplay=1

This is a really great example of how a lot of cool solutions are built these days; by stringing together more targeted pre-trained models.The details of the three building blocks can be found in the respective papers:
- [Segment Anything](https://arxiv.org/abs/2304.02643) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick
- [Multiview Compressive Coding for 3D Reconstruction](https://arxiv.org/abs/2301.08247) by Chao-Yuan Wu, Justin Johnson, Jitendra Malik, Christoph Feichtenhofer, and Georgia Gkioxari
- [ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth](https://arxiv.org/abs/2302.12288) by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller
44 changes: 44 additions & 0 deletions examples/python/shape_pointe/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: "Point-E and Shap-E"
python: https://github.com/rerun-io/point-shap-e
tags: [3D, diffusion, point, mesh]
thumbnail: https://static.rerun.io/c17f91298ad12eee6347a911338fca0604178f58_overview_480w.png
---

OpenAI has released two models for text-to-3d generation: Point-E and Shape-E. Both of these methods are fast and interesting but still low fidelity for now.

https://www.youtube.com/watch?v=f9QWkamyWZI?playlist=f9QWkamyWZI&loop=1&hd=1&rel=0&autoplay=1

First off, how do these two methods differ from each other? Point-E represents its 3D shapes via point clouds. It does so using a 3-step generation process: first, it generates a single synthetic view using a text-to-image diffusion model (in this case GLIDE).

<picture>
<source media="(max-width: 480px)" srcset="https://static.rerun.io/deb21c7f2081826702bb6a23696dc242d5b9a0cc_pointe-overview_480w.png">
<source media="(max-width: 768px)" srcset="https://static.rerun.io/863b4c6de7e5c0450d0bfc368c58e73c126b96e2_pointe-overview_768w.png">
<source media="(max-width: 1024px)" srcset="https://static.rerun.io/9bf5c456ea4e43a120abcbd07f75363d7efb3093_pointe-overview_1024w.png">
<source media="(max-width: 1200px)" srcset="https://static.rerun.io/e9f6f26563bc2a5468e65bc42a9ba2d99e5a04f0_pointe-overview_1200w.png">
<img src="https://static.rerun.io/a65f587a4a4cbcd0972bda09aa63bba35273abc3_pointe-overview_full.png" alt="">
</picture>

It then produces a coarse 3D point cloud using a second diffusion model which conditions on the generated image; third, it generates a fine 3D point cloud using an upsampling network. Finally, a another model is used to predict an SDF from the point cloud, and marching cubes turns it into a mesh. As you can tell, the results aren’t very high quality, but they are fast.

https://www.youtube.com/watch?v=37Rsi7bphQY?playlist=37Rsi7bphQY&loop=1&hd=1&rel=0&autoplay=1

Shap-E improves on this by representing 3D shapes implicitly. This is done in two stages. First, an encoder is trained that takes images or a point cloud as input and outputs the weights of a NeRF.

<picture>
<source media="(max-width: 480px)" srcset="https://static.rerun.io/a2d6e282c48727469277be5597a7a50304a8adf5_shape-overview_480w.png">
<source media="(max-width: 768px)" srcset="https://static.rerun.io/6849fc43a2ee73844a584907be70892b2b1bdc4c_shape-overview_768w.png">
<source media="(max-width: 1024px)" srcset="https://static.rerun.io/93454a3be08778259ed41de29437c06aaec45c76_shape-overview_1024w.png">
<source media="(max-width: 1200px)" srcset="https://static.rerun.io/d4d26996d20a2e0c98d595c8bfd1fd4cd3cca193_shape-overview_1200w.png">
<img src="https://static.rerun.io/44a3498818968c3c8ee27d55c4ba97e5ff907168_shape-overview_full.png" alt="">
</picture>

In the second stage, a diffusion model is trained on a dataset of NeRF weights generated by the previous encoder. This diffusion model is conditioned on either images or text descriptions. The resulting NeRF also outputs SDF values so that meshes can be extracted using marching cubes again. Here we see the prompt "a cheesburger" turn into a 3D mesh a set of images.

https://www.youtube.com/watch?v=oTVLrujriiQ?playlist=oTVLrujriiQ&loop=1&hd=1&rel=0&autoplay=1

When compared to Point-E on both image-to-mesh and text-to-mesh generation, Shap-E converges faster and reaches comparable or better sample quality despite modeling a higher-dimensional, multi-representation output space.

https://www.youtube.com/watch?v=DskRD5nioyA?playlist=DskRD5nioyA&loop=1&hd=1&rel=0&autoplay=1

Check out the respective papers to learn more about the details of both methods: "[Shap-E: Generating Conditional 3D Implicit Functions](https://arxiv.org/abs/2305.02463)" by Heewoo Jun and Alex Nichol; "[Point-E: A System for Generating 3D Point Clouds from Complex Prompts](https://arxiv.org/abs/2212.08751)" by Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen.
32 changes: 32 additions & 0 deletions examples/python/simplerecon/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: "SimpleRecon: 3D Reconstruction Without 3D Convolutions"
python: https://github.com/rerun-io/simplerecon
tags: [3D, depth, time-series, pinhole-camera, mesh]
thumbnail: https://static.rerun.io/394d6544341a45882dcad4f2f5fbaabd74b3d1a3_simplerecon_480w.png
---

SimpleRecon is a back-to-basics approach for 3D scene reconstruction from posed monocular images by Niantic Labs. It offers state-of-the-art depth accuracy and competitive 3D scene reconstruction which makes it perfect for resource-constrained environments.

https://www.youtube.com/watch?v=TYR9_Ql0w7k?playlist=TYR9_Ql0w7k&loop=1&hd=1&rel=0&autoplay=1

SimpleRecon's key contributions include using a 2D CNN with a cost volume, incorporating metadata via MLP, and avoiding computational costs of 3D convolutions. The different frustrums in the visualization show each source frame used to compute the cost volume. These source frames have their features extracted and back-projected into the current frames depth plane hypothesis.

https://www.youtube.com/watch?v=g0dzm-k1-K8?playlist=g0dzm-k1-K8&loop=1&hd=1&rel=0&autoplay=1

SimpleRecon only uses camera poses, depths, and surface normals (generated from depth) for supervision allowing for out-of-distribution inference e.g. from an ARKit compatible iPhone.

https://www.youtube.com/watch?v=OYsErbNdQSs?playlist=OYsErbNdQSs&loop=1&hd=1&rel=0&autoplay=1

The method works well for applications such as robotic navigation, autonomous driving, and AR. It takes input images, their intrinsics, and relative camera poses to predict dense depth maps, combining monocular depth estimation and MVS via plane sweep.

<picture>
<source media="(max-width: 480px)" srcset="https://static.rerun.io/6074c6c7039eccb14796dffda6e158b4d6a09c0e_simplerecon-overview_480w.png">
<source media="(max-width: 768px)" srcset="https://static.rerun.io/ed7ded09ee1d32c9adae4b8df0b539a57e2286f0_simplerecon-overview_768w.png">
<source media="(max-width: 1024px)" srcset="https://static.rerun.io/431dd4d4c6d4245ccf4904a38e24ff143713c97d_simplerecon-overview_1024w.png">
<source media="(max-width: 1200px)" srcset="https://static.rerun.io/59058fb7a7a4a5e3d63116aeb7197fb3f32fe19a_simplerecon-overview_1200w.png">
<img src="https://static.rerun.io/1f2400ba4f3b90f967f9503b855364363f776dbb_simplerecon-overview_full.png" alt="">
</picture>

Metadata incorporated in the cost volume improves depth estimation accuracy and 3D reconstruction quality. The lightweight and interpretable 2D CNN architecture benefits from added metadata for each frame, leading to better performance.

If you want to learn more about the method, check out the [paper](https://arxiv.org/abs/2208.14743) by Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Clément Godard.
Loading