Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anim_nerf.npz #79

Open
serizawa-04013958 opened this issue Aug 12, 2024 · 27 comments
Open

anim_nerf.npz #79

serizawa-04013958 opened this issue Aug 12, 2024 · 27 comments

Comments

@serizawa-04013958
Copy link

serizawa-04013958 commented Aug 12, 2024

Hello
I'd like to confirm your preprocess code result.
when I try to preprocess people snap shot dataset, the results does not seem to be difficult from animnerf.npz.
here is a result. since I can get good result when I use animnerf, I would like to ask you the difference.
I used InstantAvatar/tree/master/scripts/visualize-SMPL.py for visualization
anim_nerf ↓
https://github.com/user-attachments/assets/4b337b02-171d-4c24-84df-f2b7579ca363
pose_optimized ↓
https://github.com/user-attachments/assets/058a1e3f-9562-4d91-87a8-81d07b767751

Thank you

@tijiang13
Copy link
Owner

tijiang13 commented Aug 12, 2024

Hi Serizawa,

Based on the visualization you provided, it seems that the wrong intrinsic parameters may have been loaded. The Anim-NeRF version uses the provided GT intrinsics, which have a larger focal length. However, the preprocessing pipeline we provided utilizes ROMP, which assumes a much smaller focal length. It’s possible that the camera parameters were accidentally overwritten when you run the preprocessing process. As a result, you likely used the ROMP camera to project the Anim-NeRF poses, leading to an overly small reprojection.

Best, Tianjian

@tijiang13
Copy link
Owner

tijiang13 commented Aug 12, 2024

I guess the documentt can be a bit of confusing. For the benchmark on PeopleSnapshot, we used the GT camera and poses to isolate the impact of inaccuracies stemming from cameras and poses. This was done to illustrate performance under controlled settings(or let's say, what will happen with a better pose estimator as better pose estimator papers come out every year :P). In contrast, the examples on Neuman data were provided to demonstrate performance in less controlled settings, where we have no prior knowledge of the cameras or poses.

@serizawa-04013958
Copy link
Author

Thank you for replying immediately!!
let me confirm, anim_nerf.npz is GT camera/pose parameter, right?

so, I understood that ROMP preprocess is not optimal for People snapshot dataset.
but I'd like to treat peoplesnapshot dataset like wild dataset(neuman).

I work on synthetic dataset using people snapshot, and I would like to use same camera pose estimator, but current ROMP can not achieve same quality as GT. Can you help me if possible?

Thank you very much.

@tijiang13
Copy link
Owner

Hi Serizawa,

Maybe you can give 4DHuman a shot? We have done some internal experiments before and it has better alignment empirically.

Best, Tianjian

@serizawa-04013958
Copy link
Author

Hello
so, you mean anim_nerf.npz comes from Ground truth, but you did some internal experiment with 4DHuman,
then it was similar to GT alignment. is it correct?

I mean to ask how to get below accurate npz file. I'm very glad to confirm.

image

@serizawa-04013958
Copy link
Author

serizawa-04013958 commented Aug 23, 2024

@tijiang13
If possible, coulda you share the code to convert 4D human output to ROMP format?
I could work 4D-human's sample code, but 4D-human's format is different from ROMP format...

@AlecDusheck
Copy link

This would be helpful!

@tijiang13
Copy link
Owner

Hello Serizawa and Alec,

Sorry for the delayed reply -- I have been quite busy during the past 2 weeks and forgot to check Github regularly.

Here is the code I was using:

# process the SMPL poses
body_pose = np.zeros((NUM_PERSONS, NUM_FRAMES, 23, 3))
global_orient = np.zeros((NUM_PERSONS, NUM_FRAMES, 3))
transl = np.zeros((NUM_PERSONS, NUM_FRAMES, 3))
betas = np.zeros((NUM_PERSONS, NUM_FRAMES, 10))
for i, datum in enumerate(data):
    for j, person_id in enumerate(datum["personid"]):
        person_id = int(person_id)
        cx, cy = datum["box_center"][person_id]
        bbox_size = datum["box_size"][person_id]
        img_size = datum["img_size"][person_id]
        W, H = img_size

        # for cam_t we use pred_cam rather than pred_cam_t
        cam_t = datum["pred_cam"][person_id]
        tz, tx, ty = cam_t
        scale = 2 / max(bbox_size * tz, 1e-9)
        tz = focal_length * scale
        tx = tx + scale * (cx - W * 0.5)
        ty = ty + scale * (cy - H * 0.5)
        cam_t = np.array([tx, ty, tz])

        # convert back pose to SMPL format
        body_pose_R = datum["body_pose"][person_id]
        body_pose_ji = np.stack([cv2.Rodrigues(r)[0].squeeze(-1) for r in body_pose_R])
        global_orient_R = datum["global_orient"][person_id]
        global_orient_ji = cv2.Rodrigues(global_orient_R[0])[0].squeeze(-1)

        body_pose[j, i] = body_pose_ji
        global_orient[j, i] = global_orient_ji
        transl[j, i] = cam_t
        betas[j, i] = datum["betas"][person_id]
        
       
# process the camera
img_size = data[0]["img_size"][0]
W, H = img_size
intrinsic = np.array([[focal_length, 0, W * 0.5],
                      [0, focal_length, H * 0.5],
                      [0,            0,       1]])
extrinsic = np.broadcast_to(np.eye(4), (NUM_FRAMES, 4, 4))

After the conversion you will be able to visualize the SMPL meshes using the visualize_SMPL.py script in this repo.

Best, Tianjian

@tijiang13
Copy link
Owner

Note: One of the good thing with 4DHuman is its ability to set the focal length for your GT camera when the intrinsics are available. This can be particularly useful when the true focal length differs significantly from the default settings such as in ROMP. This is surprisingly common when it comes to humans (people tend to use very large focal lengths).

@serizawa-04013958
Copy link
Author

That's great!! I appreciated it!

I'll get to work on the code right away in my environment.
Thank you so much!
For debugging, let me keep this issue

@tijiang13
Copy link
Owner

You are welcome :D

Best, Tianjian

@serizawa-04013958
Copy link
Author

serizawa-04013958 commented Sep 4, 2024

Hello, let me ask question.
how to caluculate focal_length?
Do you use same equation which is used in hmr2.py? I mean output['focal_length']

here is 4D-humans's code

focal_length = self.cfg.EXTRA.FOCAL_LENGTH * torch.ones(batch_size, 2, device=device, dtype=dtype)

@tijiang13
Copy link
Owner

tijiang13 commented Sep 4, 2024

Hi Serizawa,

You can just run 4DHuman with default hyper-parameters. The code above just illustrates how to change the focal length & SMPL parameters accordingly if you want to set the focal length to a different value.

Best, Tianjian

@serizawa-04013958
Copy link
Author

serizawa-04013958 commented Sep 5, 2024

Hello. I tried above code, but smpl pose is a little bit weird...
I used model's output from HMR2 on hmr2.py
here is my code. could you advise me?
If possible, I want to know detail about data(datum) and saving code of data.

visualize result by visualize_SMPL.py
image

4D-human's output worked well.
image

pred_smpl_params = out['pred_smpl_params']
body_pose_R = pred_smpl_params["body_pose"].detach().cpu().numpy()[0]
body_pose_ji = np.stack([cv2.Rodrigues(r)[0].squeeze(-1) for r in body_pose_R])
global_orient_R = pred_smpl_params["global_orient"].detach().cpu().numpy()[0]
global_orient_ji = cv2.Rodrigues(global_orient_R[0])[0].squeeze(-1)
body_pose[0, id] = body_pose_ji
global_orient[0, id] = global_orient_ji
transl[0, id] = cam_t
betas[0, id] = pred_smpl_params["betas"].detach().cpu().numpy()[0]

@tijiang13
Copy link
Owner

I think there’s a difference because 4DHuman only renders the object within the bounding box, while we project it onto the entire image. As for the misalignment, I still run the refinement as before, and this is usually easy to fix I guess.

Best, Tianjian

@serizawa-04013958
Copy link
Author

I'm very sorry.
Could you share the conversion code of body_pose?
this code's output is (NUM_PERSONS, NUM_FRAMES, 23, 3)), but ROMP need (NUM_PERSONS, NUM_FRAMES, 69)) I think.

@tijiang13
Copy link
Owner

Hi Serizawa,

No worries. Regarding the question, you can simply flatten the last two dimensions using output.reshape(NUM_PERSONS, NUM_FRAMES, 69).

Best, Tianjian

@serizawa-04013958
Copy link
Author

serizawa-04013958 commented Sep 18, 2024

thank you for your comment.
Unfortunately I don't know why but I faced smpl position misalignment issue..
problem is the tz, tx, ty is incorrect and therefore, camera can not see smpl.
since tz is too big, and go out of the camera's range.
I can modify scaling parameter and visualize it, but it does not make sense.

I tried to get necessary information on demo.py file of 4D-humans code

` for batch in dataloader:
batch = recursive_to(batch, device)
with torch.no_grad():
out, pred_smpl_params = model(batch)

        pred_cam = out['pred_cam']
        box_center = batch["box_center"].float()
        box_size = batch["box_size"].float()
        img_size = batch["img_size"].float()
        # -----------------------------------
        # get info for ROMP conversion
        # -----------------------------------
        # pred_cam_t = out['pred_cam_t'][0].detach().cpu().numpy()
        W, H = img_cv2.shape[1], img_cv2.shape[0]
        bbox_size = box_size[0].detach().cpu().numpy()
        cam_t = pred_cam[0].detach().cpu().numpy()
        tz, tx, ty = cam_t
        scale = 2 / max(bbox_size * tz, 1e-9)
        focal_length = model_cfg.EXTRA.FOCAL_LENGTH # 5000
        tz = (focal_length * scale)
        
        cx, cy = box_center[0].detach().cpu().numpy()
        tx = tx + scale * (cx - W * 0.5)
        ty = ty + scale * (cy - H * 0.5)
        # print(type(tx), type(ty), type(tz))
        cam_t = np.array([tx, ty, tz])

`

@tijiang13
Copy link
Owner

Hi Serizawa,

How did you visualise the SMPL? Did you use aitviewer?

Best, Tianjian

@serizawa-04013958
Copy link
Author

Yes, I used SMPL and aitviewer as same as visualize_SMPL.py

@serizawa-04013958
Copy link
Author

Here is my code. sorry I could not paste .py file

demo.txt

@tijiang13
Copy link
Owner

Can you give this gist a try?

@serizawa-04013958
Copy link
Author

Thank you so much!!
normally 4D-human's out put is only image and .obj file, so which output do you indicate?
is it HMR2's output? sorry for bothering you. Thanks as always.

@tijiang13
Copy link
Owner

Hi,

See: https://github.com/shubham-goel/4D-Humans/blob/main/demo.py#L91

Here is modified demo.py I used:

"""
Adapted based on https://github.com/shubham-goel/4D-Humans/blob/main/demo.py
"""
from pathlib import Path
import torch
import argparse
import os
import cv2
import numpy as np
from tqdm import tqdm
from collections import defaultdict
import joblib

from hmr2.configs import CACHE_DIR_4DHUMANS
from hmr2.models import HMR2, download_models, load_hmr2, DEFAULT_CHECKPOINT
from hmr2.utils import recursive_to
from hmr2.datasets.vitdet_dataset import ViTDetDataset, DEFAULT_MEAN, DEFAULT_STD
from hmr2.utils.renderer import Renderer, cam_crop_to_full


LIGHT_BLUE=(0.65098039,  0.74117647,  0.85882353)

def main():
    import time
    start = time.time()
    parser = argparse.ArgumentParser(description='HMR2 demo code')
    parser.add_argument('--checkpoint', type=str, default=DEFAULT_CHECKPOINT, help='Path to pretrained model checkpoint')
    parser.add_argument('--img_folder', type=str, default='example_data/images', help='Folder with input images')
    parser.add_argument('--out_folder', type=str, default='demo_out', help='Output folder to save rendered results')
    parser.add_argument('--detector', type=str, default='vitdet', choices=['vitdet', 'regnety'], help='Using regnety improves runtime')

    args = parser.parse_args()

    # Download and load checkpoints
    download_models(CACHE_DIR_4DHUMANS)
    model, model_cfg = load_hmr2(args.checkpoint)

    # Setup HMR2.0 model
    device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
    model = model.to(device)
    model.eval()

    # Load detector
    from hmr2.utils.utils_detectron2 import DefaultPredictor_Lazy
    if args.detector == 'vitdet':
        from detectron2.config import LazyConfig
        import hmr2
        cfg_path = Path(hmr2.__file__).parent/'configs'/'cascade_mask_rcnn_vitdet_h_75ep.py'
        detectron2_cfg = LazyConfig.load(str(cfg_path))
        detectron2_cfg.train.init_checkpoint = "https://dl.fbaipublicfiles.com/detectron2/ViTDet/COCO/cascade_mask_rcnn_vitdet_h/f328730692/model_final_f05665.pkl"
        for i in range(3):
            detectron2_cfg.model.roi_heads.box_predictors[i].test_score_thresh = 0.25
        detector = DefaultPredictor_Lazy(detectron2_cfg)
    elif args.detector == 'regnety':
        from detectron2 import model_zoo
        from detectron2.config import get_cfg
        detectron2_cfg = model_zoo.get_config('new_baselines/mask_rcnn_regnety_4gf_dds_FPN_400ep_LSJ.py', trained=True)
        detectron2_cfg.model.roi_heads.box_predictor.test_score_thresh = 0.5
        detectron2_cfg.model.roi_heads.box_predictor.test_nms_thresh   = 0.4
        detector       = DefaultPredictor_Lazy(detectron2_cfg)

    # Setup the renderer
    # renderer = Renderer(model_cfg, faces=model.smpl.faces)

    # Make output directory if it does not exist
    os.makedirs(args.out_folder, exist_ok=True)

    # Iterate over all images in folder
    outputs = []
    for img_path in tqdm(sorted(Path(args.img_folder).glob('*.png'))):
        img_cv2 = cv2.imread(str(img_path))

        # Detect humans in image
        det_out = detector(img_cv2)

        det_instances = det_out['instances']
        valid_idx = (det_instances.pred_classes==0) & (det_instances.scores > 0.5)
        boxes=det_instances.pred_boxes.tensor[valid_idx].cpu().numpy()

        # Run HMR2.0 on all detected humans
        dataset = ViTDetDataset(model_cfg, img_cv2, boxes)
        dataloader = torch.utils.data.DataLoader(dataset, batch_size=8, shuffle=False, num_workers=0)

        temp = defaultdict(list)
        for batch in dataloader:
            batch = recursive_to(batch, device)
            with torch.no_grad():
                out = model(batch)
            keys = ["pred_cam", "pred_cam_t", "focal_length", "pred_keypoints_2d"]
            for k in keys: temp[k].append(out[k].float().cpu().numpy())
            keys = ["box_center", "box_size", "personid", "img_size"]
            for k in keys: temp[k].append(batch[k].float().cpu().numpy())
            for k in out["pred_smpl_params"].keys():
                temp[k].append(out["pred_smpl_params"][k].float().cpu().numpy())
        for k in temp.keys(): temp[k] = np.concatenate(temp[k], axis=0)
        outputs.append(temp)
    output_path = Path(args.out_folder) / 'out_4dhuman.pkl'
    joblib.dump(outputs, output_path) # Save results

@serizawa-04013958
Copy link
Author

Thank you for sharing!!
your rendering code worked.

and I understood the difference between previous visualize_SMPL.py and visualize_SMPL.py which you shared this time.
original visualize_SMPL used
pc = Billboard.from_camera_and_distance(cam, 8, W, H, img_paths,
image_process_fn=draw_func)
but your code changed 8 to 200.

I will try to use 4D-human's output for instantavatar.

@serizawa-04013958
Copy link
Author

Hello, by using your code I could work Instantavatar training!
but unfortunately, the result was not attractive.
Have you ever compared the result using romp and 4D-humans?
if so, did you fix focal length=5000?

@tijiang13
Copy link
Owner

Hi Serizawa,

We use GT intrinsics when it's available -- in case it's unknown, you will need to adjust it accordingly (from the resolution of the raw images etc.) If the results are not satisfying, I'd suggest checking estimated poses / key points / masks in your case for troubleshooting.

Best, Tianjian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants