Tracking simple metrics in remote server is really slow #3191

diogo-sr · 2024-07-19T15:12:04Z

🐛 Bug

I have been using AIM to track item detection experiments. We have a back-end running in one of our remote servers we use to track our training and evaluation data. The data consists either of float or image data (mostly numpy.NDARRAY[numpy.uint8]. I have observed massive performance differences between tracking data to a remote AIM server or to a local AIM server running on my laptop.
For instance, tracking a json file with 3000 lines (see attachment in the To reproduce section) takes more than 15minutes to push to the remote server while it takes less than 10 seconds to do the exact same job locally(!).

I have tried to debug this by pushing batches of data instead of doing one call per metric, but nothing seems to make a difference. To add more unexpected information to the picture, tracking 95 images (each app 4MB) to the exact same server took only one minute. I think this means that the delay is not related with the size of the data being tracked (the images are almost 400Mbs while the raw json data is 4.6Mb) 🤷‍♂️

I would really appreciate if someone could cast some light on this, if this difference in performance is expected or if there are any optimizations in terms of tracking/hardware... we could use to speed it up, because how it works not it is really not usable.

To reproduce

Start remote AIM server
Load metrics.json and track each metric
Code snippet used to recreated:

import os
import json
import numpy
from aim import Run

repo="aim://my_aim_server"
path_to_metrics_json="abs_path_metrics.json"

# Start run
logger = Run(experiment="back-end-test", repo=repo)

# Format metrics to proper json
new_metrics_path = "/tmp/new_metrics.json"
if os.path.exists(new_metrics_path):
    os.remove(new_metrics_path)
os.system(f"cat {path_to_metrics_json} | jq -s '.[0:]' >> {new_metrics_path}")

with open(new_metrics_path) as json_data:
    metrics_data = json.load(json_data)

# Log train metrics
for metrics_dict in metrics_data:
    for k, v in metrics_dict.items():
        if not v:
            v = numpy.nan

        logger.track(float(v), k, step=metrics_dict["iteration"])

Expected behavior

Pushing the metrics should not take more than 15minutes

Environment

Aim Version 3.20.1
Python version 3.9.18
OS Ubuntu 22 LTS

The text was updated successfully, but these errors were encountered:

alberttorosyan · 2024-07-22T07:23:27Z

@diogo-sr thanks for raising this issue. Performance in general and of the tracking server is a priority for the team.
@mihran113, could you please take a look? Could you please share the results we got after re-implementing the tracking server?

diogo-sr · 2024-07-22T07:29:23Z

Thank you for picking it up! Looking forward to hear back from you

peter-sk · 2024-08-03T20:22:02Z

Slightly related to PR #3203. Copying a few megabytes worth of 1M steps tracked sequence is really slow. The PR updates aim to be able to update the remote tree in chunks.

I am not sure how easy this is to integrate in direct tracking to a remote repository. But we are now tracking to a local repository and syncing the runs to a remote repository in close-to-real time using custom sync code, which we will be happy to contribute to aim once the aim backend has support for chunk updates.

diogo-sr · 2024-09-10T08:03:50Z

Hi. @peter-sk I have just re-tested my script using latest version of aim v3.24.0 and the speed is the same as previous tags.

diogo-sr added help wanted Extra attention is needed type / bug Issue type: something isn't working labels Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking simple metrics in remote server is really slow #3191

Tracking simple metrics in remote server is really slow #3191

diogo-sr commented Jul 19, 2024

alberttorosyan commented Jul 22, 2024

diogo-sr commented Jul 22, 2024

peter-sk commented Aug 3, 2024

diogo-sr commented Sep 10, 2024

Tracking simple metrics in remote server is really slow #3191

Tracking simple metrics in remote server is really slow #3191

Comments

diogo-sr commented Jul 19, 2024

🐛 Bug

To reproduce

Expected behavior

Environment

alberttorosyan commented Jul 22, 2024

diogo-sr commented Jul 22, 2024

peter-sk commented Aug 3, 2024

diogo-sr commented Sep 10, 2024