Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading speed gets throttle when loading large datasets (timeseries array size 230,400,000 datapoints) #5796

Closed
jesusdpa1 opened this issue Apr 4, 2024 · 7 comments
Labels
🚀 performance Optimization, memory use, etc

Comments

@jesusdpa1
Copy link

Describe the bug

When loading a large dataset (32ch, 1h recording, sampled at 24KHz) their is a notable reduction of responsiveness of the viewer. rerun crashes depending on the portion of the data that is loaded (more than 10min). I have been limited to viewing maximum 8ch with duration time 2min and needing to reload rerun to plot different portions of the dataset. It happens with float and int datatypes. The goal is to be able to use rerun as a ephys data viewer since rerun handles really well different data types (use case for multimodal research recordings (video/emg/raster plots, etc)).

I saw the post (https://www.rerun.io/blog/fast-plots) and was wondering if this limitation is due to using python as the interface with rerun

To Reproduce
Steps to reproduce the behavior:

Python Script to reproduce
# %%
from numpy.typing import NDArray
import numpy as np
import rerun as rr


# %%
def gen_color_list(num_colors: int):
    """
    Generates a list of random RGB color values.

    Args:
        num_colors (int): The number of colors to generate.

    Returns:
        list: A list of RGB color values, where each value is a list of three integers between 0 and 255.
    """
    color_list = []
    for _ in range(num_colors):
        r = np.random.randint(0, 256)
        g = np.random.randint(0, 256)
        b = np.random.randint(0, 256)
        color_list.append([r, g, b])
    return color_list


def scale_tsx(tsy: NDArray, ch_count: int) -> NDArray:
    """
    Scales a time series array by adding a constant value to each channel.

    Args:
        tsy (NDArray): _description_
        ch_count (int): _description_

    Returns:
        NDArray: _description_
    """
    scale_factor = tsy.max() + (tsy.std() * 2)
    channel_scale_factor = np.arange(ch_count) * scale_factor
    tsx_scaled = tsy.T + channel_scale_factor.reshape(-1, 1)
    return tsx_scaled


# %%
# mock data

# Define sampling parameters
num_channels = 16  # Number of channels
sample_rate = 24000  # Sampling rate in Hz
duration_min = 10  # Duration in minutes

# Calculate total number of samples
total_samples = sample_rate * 60 * duration_min

# Generate random samples for each channel
random_samples = np.random.uniform(-1.0, 1.0, (total_samples, num_channels)).astype(
    np.int16
)

# Print the shape of the generated random samples array
print("Shape of random samples array:", random_samples.shape)
# %%
traces_scaled = scale_tsx(random_samples[:200000, :], 16)
ch_colors = gen_color_list(16)
# %%
rr.version()
rr.init("testSubject2", spawn=True)

for ch_id in np.arange(8):
    rr.log(
        f"mockdata/ch{ch_id}",
        rr.SeriesLine(color=ch_colors[ch_id], name=f"ch{ch_id}", width=0.5),
        timeless=True,
    )
# %%
# Log the data on a timeline called "step".
for t in range(0, traces_scaled.shape[1]):
    rr.set_time_sequence("step", t)
    for ch_id in np.arange(8):
        rr.log(f"mockdata/ch{ch_id}", rr.Scalar(traces_scaled[ch_id, t]))
# %%
**Expected behavior**

Loading large datasets without affecting performance

Screenshots

First 2000 points, load time ms

image

After 20000 data points per ch [loading speed gets throttle]

image

Real dataset used

image

Backtrace

Desktop (please complete the following information):

  • OS: Windows10

Rerun version
'rerun_py 0.14.1 [rustc 1.74.0 (79e9716c9 2023-11-13), LLVM 17.0.4] x86_64-pc-windows-msvc release-0.14.1 74f1c23, built 2024-02-29T11:07:43Z'
Python 3.12.2

Additional context

@jesusdpa1 jesusdpa1 added 👀 needs triage This issue needs to be triaged by the Rerun team 🪳 bug Something isn't working labels Apr 4, 2024
@emilk emilk added 🚀 performance Optimization, memory use, etc and removed 🪳 bug Something isn't working 👀 needs triage This issue needs to be triaged by the Rerun team labels Apr 5, 2024
@abey79
Copy link
Member

abey79 commented Apr 5, 2024

This is a lot of scalar, and despite our recent work, we still have work to reduce storage and ingestion overhead. In particular, this example calls for a bulk temporal logging, which we track in this issue:

I think the apparent "throttling" is due to some UI specific handling and we recently observed it elsewhere. This is still under investigation, but we're considering a mode where the UI can be disabled during ingestion to make it faster:

Finally, I haven't been able to reproduce the crash yet. Can you give more details about the exact setup? Also, I ran the provided script but for some reason in logs only zeros.

@jesusdpa1
Copy link
Author

Thanks Antoine!

With regards of the temporal batches, is this going to behave like lazy loading?

I had an error in the code, here is the correct one.

Python Script to reproduce
# %%
from numpy.typing import NDArray
import numpy as np
import rerun as rr


# %%
def gen_color_list(num_colors: int):
    """
    Generates a list of random RGB color values.

    Args:
        num_colors (int): The number of colors to generate.

    Returns:
        list: A list of RGB color values, where each value is a list of three integers between 0 and 255.
    """
    color_list = []
    for _ in range(num_colors):
        r = np.random.randint(0, 256)
        g = np.random.randint(0, 256)
        b = np.random.randint(0, 256)
        color_list.append([r, g, b])
    return color_list


def scale_tsx(tsy: NDArray, ch_count: int) -> NDArray:
    """
    Scales a time series array by adding a constant value to each channel.

    Args:
        tsy (NDArray): _description_
        ch_count (int): _description_

    Returns:
        NDArray: _description_
    """
    scale_factor = tsy.max() + (tsy.std() * 2)
    channel_scale_factor = np.arange(ch_count) * scale_factor
    tsx_scaled = tsy.T + channel_scale_factor.reshape(-1, 1)
    return tsx_scaled


# %%
# mock data

# Define sampling parameters
num_channels = 16  # Number of channels
sample_rate = 24000  # Sampling rate in Hz
duration_min = 10  # Duration in minutes

# Calculate total number of samples
total_samples = sample_rate * 60 * duration_min

# Generate random samples for each channel
random_samples = np.random.uniform(-1.0, 1.0, (total_samples, num_channels)).astype(
    np.float32
)

# Print the shape of the generated random samples array
print("Shape of random samples array:", random_samples.shape)
# %%
traces_scaled = scale_tsx(random_samples[:200000, :], 16)
ch_colors = gen_color_list(16)
# %%
rr.version()
rr.init("testSubject2", spawn=True)

for ch_id in np.arange(8):
    rr.log(
        f"mockdata/ch{ch_id}",
        rr.SeriesLine(color=ch_colors[ch_id], name=f"ch{ch_id}", width=0.5),
        timeless=True,
    )
# %%
# Log the data on a timeline called "step".
for t in range(0, traces_scaled.shape[1]):
    rr.set_time_sequence("step", t)
    for ch_id in np.arange(8):
        rr.log(f"mockdata/ch{ch_id}", rr.Scalar(traces_scaled[ch_id, t]))
# %%

to get it to crash I increase the number of data points (channels and duration of the recording). Although I think this might be due to available memory.

Python Script to reproduce
# %%
from numpy.typing import NDArray
import numpy as np
import rerun as rr


# %%
def gen_color_list(num_colors: int):
    """
    Generates a list of random RGB color values.

    Args:
        num_colors (int): The number of colors to generate.

    Returns:
        list: A list of RGB color values, where each value is a list of three integers between 0 and 255.
    """
    color_list = []
    for _ in range(num_colors):
        r = np.random.randint(0, 256)
        g = np.random.randint(0, 256)
        b = np.random.randint(0, 256)
        color_list.append([r, g, b])
    return color_list


def scale_tsx(tsy: NDArray, ch_count: int) -> NDArray:
    """
    Scales a time series array by adding a constant value to each channel.

    Args:
        tsy (NDArray): _description_
        ch_count (int): _description_

    Returns:
        NDArray: _description_
    """
    scale_factor = tsy.max() + (tsy.std() * 2)
    channel_scale_factor = np.arange(ch_count) * scale_factor
    tsx_scaled = tsy.T + channel_scale_factor.reshape(-1, 1)
    return tsx_scaled


# %%
# mock data

# Define sampling parameters
num_channels = 16  # Number of channels
sample_rate = 24000  # Sampling rate in Hz
duration_min = 30  # Duration in minutes

# Calculate total number of samples
total_samples = sample_rate * 60 * duration_min

# Generate random samples for each channel
random_samples = np.random.uniform(-1.0, 1.0, (total_samples, num_channels)).astype(
    np.float32
)

# Print the shape of the generated random samples array
print("Shape of random samples array:", random_samples.shape)
# %%
traces_scaled = scale_tsx(random_samples[:, :], 16)
ch_colors = gen_color_list(16)
# %%
rr.version()
rr.init("testSubject2", spawn=True)

for ch_id in np.arange(16):
    rr.log(
        f"mockdata/ch{ch_id}",
        rr.SeriesLine(color=ch_colors[ch_id], name=f"ch{ch_id}", width=0.5),
        timeless=True,
    )
# %%
# Log the data on a timeline called "step".
for t in range(0, traces_scaled.shape[1]):
    rr.set_time_sequence("step", t)
    for ch_id in np.arange(16):
        rr.log(f"mockdata/ch{ch_id}", rr.Scalar(traces_scaled[ch_id, t]))
# %%

@abey79
Copy link
Member

abey79 commented Apr 9, 2024

I cannot reproduce the crash. I ran the viewer with rerun --memory-limit 2GB and used your 30min script. After reaching the memory limit, I see some more performance drop due to the garbage collector kicking in and deleting early data, but still no crash. Can you tell more about what happens? Do you have traceback?

edit: nvm, I actually see memory usage going way past the --memory-limit. Looking into it.

@abey79
Copy link
Member

abey79 commented Apr 11, 2024

This comment from another issue also largely apply here, btw:

@abey79
Copy link
Member

abey79 commented Apr 11, 2024

Also, while using your script, we did find a weird memory leak that could cause issue. However, that issue did not effect 0.14.1 (only our main branch, and was fixed before the 0.15 release).

@jesusdpa1
Copy link
Author

Got it! just updated to the latest version and will keep an eye on the Perf an memory discussion 👍

@teh-cmc
Copy link
Member

teh-cmc commented Sep 6, 2024

Fixed with 0.18 (send_columns & chunk store)

@teh-cmc teh-cmc closed this as completed Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🚀 performance Optimization, memory use, etc
Projects
None yet
Development

No branches or pull requests

4 participants