-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading speed gets throttle when loading large datasets (timeseries array size 230,400,000 datapoints) #5796
Comments
This is a lot of scalar, and despite our recent work, we still have work to reduce storage and ingestion overhead. In particular, this example calls for a bulk temporal logging, which we track in this issue: I think the apparent "throttling" is due to some UI specific handling and we recently observed it elsewhere. This is still under investigation, but we're considering a mode where the UI can be disabled during ingestion to make it faster: Finally, I haven't been able to reproduce the crash yet. Can you give more details about the exact setup? Also, I ran the provided script but for some reason in logs only zeros. |
Thanks Antoine! With regards of the temporal batches, is this going to behave like lazy loading? I had an error in the code, here is the correct one. Python Script to reproduce# %%
from numpy.typing import NDArray
import numpy as np
import rerun as rr
# %%
def gen_color_list(num_colors: int):
"""
Generates a list of random RGB color values.
Args:
num_colors (int): The number of colors to generate.
Returns:
list: A list of RGB color values, where each value is a list of three integers between 0 and 255.
"""
color_list = []
for _ in range(num_colors):
r = np.random.randint(0, 256)
g = np.random.randint(0, 256)
b = np.random.randint(0, 256)
color_list.append([r, g, b])
return color_list
def scale_tsx(tsy: NDArray, ch_count: int) -> NDArray:
"""
Scales a time series array by adding a constant value to each channel.
Args:
tsy (NDArray): _description_
ch_count (int): _description_
Returns:
NDArray: _description_
"""
scale_factor = tsy.max() + (tsy.std() * 2)
channel_scale_factor = np.arange(ch_count) * scale_factor
tsx_scaled = tsy.T + channel_scale_factor.reshape(-1, 1)
return tsx_scaled
# %%
# mock data
# Define sampling parameters
num_channels = 16 # Number of channels
sample_rate = 24000 # Sampling rate in Hz
duration_min = 10 # Duration in minutes
# Calculate total number of samples
total_samples = sample_rate * 60 * duration_min
# Generate random samples for each channel
random_samples = np.random.uniform(-1.0, 1.0, (total_samples, num_channels)).astype(
np.float32
)
# Print the shape of the generated random samples array
print("Shape of random samples array:", random_samples.shape)
# %%
traces_scaled = scale_tsx(random_samples[:200000, :], 16)
ch_colors = gen_color_list(16)
# %%
rr.version()
rr.init("testSubject2", spawn=True)
for ch_id in np.arange(8):
rr.log(
f"mockdata/ch{ch_id}",
rr.SeriesLine(color=ch_colors[ch_id], name=f"ch{ch_id}", width=0.5),
timeless=True,
)
# %%
# Log the data on a timeline called "step".
for t in range(0, traces_scaled.shape[1]):
rr.set_time_sequence("step", t)
for ch_id in np.arange(8):
rr.log(f"mockdata/ch{ch_id}", rr.Scalar(traces_scaled[ch_id, t]))
# %% to get it to crash I increase the number of data points (channels and duration of the recording). Although I think this might be due to available memory. Python Script to reproduce# %%
from numpy.typing import NDArray
import numpy as np
import rerun as rr
# %%
def gen_color_list(num_colors: int):
"""
Generates a list of random RGB color values.
Args:
num_colors (int): The number of colors to generate.
Returns:
list: A list of RGB color values, where each value is a list of three integers between 0 and 255.
"""
color_list = []
for _ in range(num_colors):
r = np.random.randint(0, 256)
g = np.random.randint(0, 256)
b = np.random.randint(0, 256)
color_list.append([r, g, b])
return color_list
def scale_tsx(tsy: NDArray, ch_count: int) -> NDArray:
"""
Scales a time series array by adding a constant value to each channel.
Args:
tsy (NDArray): _description_
ch_count (int): _description_
Returns:
NDArray: _description_
"""
scale_factor = tsy.max() + (tsy.std() * 2)
channel_scale_factor = np.arange(ch_count) * scale_factor
tsx_scaled = tsy.T + channel_scale_factor.reshape(-1, 1)
return tsx_scaled
# %%
# mock data
# Define sampling parameters
num_channels = 16 # Number of channels
sample_rate = 24000 # Sampling rate in Hz
duration_min = 30 # Duration in minutes
# Calculate total number of samples
total_samples = sample_rate * 60 * duration_min
# Generate random samples for each channel
random_samples = np.random.uniform(-1.0, 1.0, (total_samples, num_channels)).astype(
np.float32
)
# Print the shape of the generated random samples array
print("Shape of random samples array:", random_samples.shape)
# %%
traces_scaled = scale_tsx(random_samples[:, :], 16)
ch_colors = gen_color_list(16)
# %%
rr.version()
rr.init("testSubject2", spawn=True)
for ch_id in np.arange(16):
rr.log(
f"mockdata/ch{ch_id}",
rr.SeriesLine(color=ch_colors[ch_id], name=f"ch{ch_id}", width=0.5),
timeless=True,
)
# %%
# Log the data on a timeline called "step".
for t in range(0, traces_scaled.shape[1]):
rr.set_time_sequence("step", t)
for ch_id in np.arange(16):
rr.log(f"mockdata/ch{ch_id}", rr.Scalar(traces_scaled[ch_id, t]))
# %% |
I cannot reproduce the crash. I ran the viewer with edit: nvm, I actually see memory usage going way past the |
This comment from another issue also largely apply here, btw: |
Also, while using your script, we did find a weird memory leak that could cause issue. However, that issue did not effect 0.14.1 (only our |
Got it! just updated to the latest version and will keep an eye on the Perf an memory discussion 👍 |
Fixed with 0.18 ( |
Describe the bug
When loading a large dataset (32ch, 1h recording, sampled at 24KHz) their is a notable reduction of responsiveness of the viewer. rerun crashes depending on the portion of the data that is loaded (more than 10min). I have been limited to viewing maximum 8ch with duration time 2min and needing to reload rerun to plot different portions of the dataset. It happens with float and int datatypes. The goal is to be able to use rerun as a ephys data viewer since rerun handles really well different data types (use case for multimodal research recordings (video/emg/raster plots, etc)).
I saw the post (https://www.rerun.io/blog/fast-plots) and was wondering if this limitation is due to using python as the interface with rerun
To Reproduce
Steps to reproduce the behavior:
Python Script to reproduce
Loading large datasets without affecting performance
Screenshots
First 2000 points, load time ms
After 20000 data points per ch [loading speed gets throttle]
Real dataset used
Backtrace
Desktop (please complete the following information):
Rerun version
'rerun_py 0.14.1 [rustc 1.74.0 (79e9716c9 2023-11-13), LLVM 17.0.4] x86_64-pc-windows-msvc release-0.14.1 74f1c23, built 2024-02-29T11:07:43Z'
Python 3.12.2
Additional context
The text was updated successfully, but these errors were encountered: