Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InnerDirichletPartitioner all-zero row_sums #4377

Open
admaio opened this issue Oct 25, 2024 · 6 comments
Open

InnerDirichletPartitioner all-zero row_sums #4377

admaio opened this issue Oct 25, 2024 · 6 comments
Assignees
Labels
bug Something isn't working part: flwr-dataset Improvement or additions to flwr-dataset state: under review Currently reviewing issue/PR

Comments

@admaio
Copy link

admaio commented Oct 25, 2024

Describe the bug

When the row_sums variable contains all zeros, the class_priors is a list of nans and returns a ValueError: probabilities contain NaN

This is the likely problematic part of the InnerDirichletPartitioner file

while True:
# curr_class = np.argmax(np.random.uniform() <= curr_prior)
curr_class = self._rng.choice(
list(range(self._num_unique_classes)), p=current_probabilities
)
# Redraw class label if there are no samples left to be allocated from
# that class
if class_sizes[curr_class] == 0:
# Class got exhausted, set probabilities to 0
class_priors[:, curr_class] = 0
# Renormalize such that the probability sums to 1
row_sums = class_priors.sum(axis=1, keepdims=True)
class_priors = class_priors / row_sums
# Adjust the current_probabilities (it won't sum up to 1 otherwise)
current_probabilities = class_priors[current_partition_id]
continue
class_sizes[curr_class] -= 1
# Store sample index at the empty array cell
index = partition_id_to_left_to_allocate[current_partition_id]
client_indices[current_partition_id][index] = idx_list[curr_class][
class_sizes[curr_class]
]
break

Steps/Code to Reproduce

Load UNSW-NB15 dataset in a Pandas DataFrame, then

dataset = Dataset.from_pandas(data)

innerdir_partitioner = InnerDirichletPartitioner(
partition_sizes=[int(len(dataset)/2)]*2, partition_by="label", alpha=.2, shuffle=True, seed=3
)

innerdir_partitioner.dataset = dataset

partition = innerdir_partitioner.load_partition(partition_id=0)

Expected Results

A set of partitions.

Actual Results

It crashes with error

File ~/git/netanomaly-fl/.venv/lib/python3.10/site-packages/flwr_datasets/partitioner/inner_dirichlet_partitioner.py:118, in InnerDirichletPartitioner.load_partition(self, partition_id)
116 self._determine_num_unique_classes_if_needed()
117 self._alpha = self._initialize_alpha_if_needed(self._initial_alpha)
--> 118 self._determine_partition_id_to_indices_if_needed()
119 return self.dataset.select(self._partition_id_to_indices[partition_id])

File ~/git/netanomaly-fl/.venv/lib/python3.10/site-packages/flwr_datasets/partitioner/inner_dirichlet_partitioner.py:234, in InnerDirichletPartitioner._determine_partition_id_to_indices_if_needed(self)
231 current_probabilities = class_priors[current_partition_id]
232 while True:
233 # curr_class = np.argmax(np.random.uniform() <= curr_prior)
--> 234 curr_class = self._rng.choice(
235 list(range(self._num_unique_classes)), p=current_probabilities
236 )
237 # Redraw class label if there are no samples left to be allocated from
238 # that class
239 if class_sizes[curr_class] == 0:
240 # Class got exhausted, set probabilities to 0

File numpy/random/_generator.pyx:824, in numpy.random._generator.Generator.choice()

@admaio admaio added the bug Something isn't working label Oct 25, 2024
@jafermarq
Copy link
Contributor

Hi @admaio , thanks for opening this issue. Could it be that your dataset doesn't have enough instances of all classes to meet the alpha=0.2 requirement? Does it work if raising the alpha value?

@jafermarq jafermarq added the part: flwr-dataset Improvement or additions to flwr-dataset label Oct 29, 2024
@admaio
Copy link
Author

admaio commented Nov 7, 2024

I would imagine that is the cause: a class runs out of samples, and then there is a division by zero.

I also tried with alpha=100 and alpha=10000, and fails with the same error.

.venv/lib/python3.10/site-packages/flwr_datasets/partitioner/inner_dirichlet_partitioner.py:244: RuntimeWarning: invalid value encountered in divide
class_priors = class_priors / row_sums
...
ValueError: probabilities contain NaN

by executing:

try_partitioner = InnerDirichletPartitioner(
partition_sizes=[int(len(dataset)/2)]*2, partition_by="label", alpha=100, shuffle=True, seed=3
)

For the InnerDirichletPartitioner, I suppose the intended behavior is that each generated local dataset's label proportion is controlled by a sample from a Dirichlet with n_classes dimension. One idea to fix this would be normalizing the local Dirichlet samples for the whole system (i.e., making the local proportions a marginalization of a system-wise joint distribution) or to have more of an "InnerDirichletSampler" that allows sample repetition.

Thanks for the support

@jafermarq
Copy link
Contributor

jafermarq commented Nov 30, 2024

I would imagine that is the cause: a class runs out of samples, and then there is a division by zero

That sounds like a plausible reason. @adam-narozniak , is there something else you'd recommend @admaio testing? is this something we could implement a fix for?

@adam-narozniak adam-narozniak self-assigned this Dec 6, 2024
@adam-narozniak
Copy link
Contributor

I'm going to check it out at the end of next week

@WilliamLindskog WilliamLindskog added the state: under review Currently reviewing issue/PR label Dec 11, 2024
@adam-narozniak
Copy link
Contributor

@admaio could you provide a full error (I think the bottom of it didn't get pasted because there's no indication of what the error is just some stack trace)?
Aslo, did you use the whole UNSW-NB15 dataset? I tried the UNSW-NB15_1.csv from link and couldn't reproduce the error?

@WilliamLindskog
Copy link
Contributor

Hi @admaio,

Thanks for raising this. Are you still experiencing this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working part: flwr-dataset Improvement or additions to flwr-dataset state: under review Currently reviewing issue/PR
Projects
None yet
Development

No branches or pull requests

4 participants