Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max Hyperparameters displayed #6965

Open
kyrxanthos opened this issue Dec 18, 2024 · 5 comments
Open

Max Hyperparameters displayed #6965

kyrxanthos opened this issue Dec 18, 2024 · 5 comments

Comments

@kyrxanthos
Copy link

I was wondering if there is a maximum value of hyperparameters I can track / see in the UI. Currently I am tracking 35 hparams but in the UI only the first 30 are included (see screenshot). Those extra 5 are also not shown in the HPARAMS tab.

I am using the default tb.add_hparams(hparams, metrics)

Screenshot 2024-12-18 at 5 52 05 PM
@arcra
Copy link
Member

arcra commented Dec 20, 2024

AFAIK, there is no limit. The app initially should load up to 1000 parameters.

There's also a "default" sample size per plugin (but is also not 30), which can be overriden by passing a --samples_per_plugin arg.

See this section and this section in our README.

I would suggest to check if you are setting this argument somewhere, and/or else try to validate that you are indeed writing all of the data that you expect.

@kyrxanthos
Copy link
Author

Thanks @arcra , I did confirm that my issue is not a hardcoded max 30 hyperparameters. Instead it seems tensorboard only looks at the number of hyperparameters of the first run in the log-directory, and keeps to that number as its max value. I have created a Minimal Reproducable example, and you can check if you run this you will only see 13 hparams appear, instead of 23. If you change the order of [10, 20] to [20, 10] you see 23 hparams which is the correct.

I dig a bit into the codebase but I couldn't find where this happens. Would be helpful to get your insights on that.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter
import numpy as np
import random
import string
import datetime

# Define some initial hyperparameters
epochs = 10
batch_size = 32
lr = 0.001

# Create a dictionary for hyperparameters
hyperparams = {
    "epochs": epochs,
    "batch_size": batch_size,
    "lr": lr,
}


for n_hparams in [10, 20]:
    for i in range(n_hparams):
        key = ''.join(random.choices(string.ascii_uppercase + string.digits, k=8))
        value = random.random()
        hyperparams[key] = value

    # Create a log directory
    now = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
    log_dir = f"/logs/fit/{now}"
    writer = SummaryWriter(log_dir)

    writer.add_hparams(hyperparams, {})


    # Create a simple model
    class SimpleModel(nn.Module):
        def __init__(self):
            super(SimpleModel, self).__init__()
            self.fc1 = nn.Linear(10, 64)
            self.fc2 = nn.Linear(64, 1)

        def forward(self, x):
            x = torch.relu(self.fc1(x))
            x = self.fc2(x)
            return x

    model = SimpleModel()

    # Define loss function and optimizer
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    # Generate toy data
    x_train = torch.randn(1000, 10)
    y_train = torch.randn(1000, 1)
    x_test = torch.randn(200, 10)
    y_test = torch.randn(200, 1)

    # Train the model
    for epoch in range(epochs):
        model.train()
        optimizer.zero_grad()
        outputs = model(x_train)
        loss = criterion(outputs, y_train)
        loss.backward()
        optimizer.step()
        
        # Log training loss
        writer.add_scalar('Loss/train', loss.item(), epoch)
        
        # Validate the model
        model.eval()
        with torch.no_grad():
            val_outputs = model(x_test)
            val_loss = criterion(val_outputs, y_test)
            
        # Log validation loss
        writer.add_scalar('Loss/val', val_loss.item(), epoch)


    writer.close()
tensorboard --logdir /logs  

@arcra
Copy link
Member

arcra commented Dec 21, 2024

I was able to reproduce (btw, you don't need to actually use a model and train it, it's enough to open the summary writer, write the hparams and close it).

I tracked down the issue to an assumption in the code here.

As described in pytorch's documentation for add_hparams, if run_name is unspecified, will use current timestamp.

In the case of this repro, when the parameters are written in the first loop, there are only 13 parameters in the hyperparameters dictionary, and when they're written in the second loop, there are 33 available. They're written to two separate runs by default, because they're written at slightly different times, apparently.

Now, the code for the hparams plugin was written a while ago, so I don't have context on why this assumption was there. It's possible that this is true for the TensorFlow implementation, and pytorch is just doing something different. It's also possible that it never considered a case like this, or else, perhaps simply that the assumption no longer holds true for some reason.

I'm not sure how we'll want to address this (as I mentioned this was written a while ago, and it might take a while to look into what should be done). Not to mention we have holidays in the next couple of weeks.

Thanks for reporting, tho. If you feel like you'd like to contribute to this, feel free to let me know, and we can accept contributions.

Another note is that I noticed that if I had installed the tensorboard-data-server package, it would so happen that sometimes the data was returned in such an order, that all the data was there, but this was inconsistent. I'm not sure if the python implementation (which is used if you don't have that package installed) would have that inconsistent behavior, or if it reads data the same way every time.

@arcra
Copy link
Member

arcra commented Dec 21, 2024

As a follow-up, looking at the documentation for the "summary" implementation from TB's plugin here and here, it does seem like this is related to how the implementation from pytorch writes the data.

I suppose the case in which there are different sets of hparams (names, not values) for different runs might be somewhat uncommon. I honestly don't know how common that is.

@kyrxanthos
Copy link
Author

Thanks @arcra for taking a look at this, indeed the code line you mentioned is the source of the issue. I might spend some time to see if I find a clean workaround, otherwise if no-one takes this up soon and other people have the same issue, I would recommend to just change the timestamp of the run that has the superset of the hparams you want to use to an earlier date. I understand this is not ideal, just a temporary solution that worked from me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants