Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deduplication of captions #1104

Merged
merged 1 commit into from
Oct 27, 2024
Merged

Conversation

mhirki
Copy link
Contributor

@mhirki mhirki commented Oct 27, 2024

This pull request adds deduplication of captions. I ran a little benchmark comparing two deduplication methods with 9997 captions from the pseudo-camera-10k dataset.

captions = list(dict.fromkeys(captions))
Deduplication complete in 875493 ns.
Deduplication complete in 865751 ns.
Deduplication complete in 880586 ns.

captions = list(set(captions))
Deduplication complete in 1007244 ns.
Deduplication complete in 943887 ns.
Deduplication complete in 936598 ns.

So building a dictionary is the fastest method. These are nanoseconds meaning that 9997 captions were deduplicated in less than 1 millisecond.

@bghira
Copy link
Owner

bghira commented Oct 27, 2024

it doesn't scale in the same way that it begins out at 10k entries:

test case

import time
import random
import string

# Generate a list of 700,000 random string entries with some duplicates
entries = [''.join(random.choices(string.ascii_letters + string.digits, k=10)) for _ in range(500000)]
entries += entries[:200000]  # add duplicates to make it exactly 700,000 entries
random.shuffle(entries)

# Benchmark function to time deduplication method
def benchmark_deduplication(method, entries, trials=3):
    times = []
    for _ in range(trials):
        start_time = time.time_ns()
        deduped_entries = method(entries)
        end_time = time.time_ns()
        times.append(end_time - start_time)
    average_time = sum(times) / len(times)
    return average_time, len(deduped_entries)

# Deduplication methods
def deduplicate_dict(entries):
    return list(dict.fromkeys(entries))

def deduplicate_set(entries):
    return list(set(entries))

# Run benchmarks
dict_time, dict_len = benchmark_deduplication(deduplicate_dict, entries)
set_time, set_len = benchmark_deduplication(deduplicate_set, entries)

dict_time, set_time, dict_len, set_len

->

Using list(dict.fromkeys(entries)): Average time: 525,460,000 ns
Using list(set(entries)): Average time: 268,889,000 ns

@mhirki mhirki marked this pull request as draft October 27, 2024 14:03
@mhirki
Copy link
Contributor Author

mhirki commented Oct 27, 2024

Alright, I ran your test case on my end and the results are pretty much the same. Good thing you wrote that test case. I'll update my pull request.

>>> dict_time, set_time, dict_len, set_len
(171640868.0, 91814939.66666667, 500000, 500000)

@mhirki mhirki force-pushed the deduplicate-captions branch from 4c3fbc9 to f1b1073 Compare October 27, 2024 14:09
@bghira
Copy link
Owner

bghira commented Oct 27, 2024

yeah for my multiple millions of image datasets it actually takes several minutes to do the dict test :')

sets are notoriously OP

@mhirki mhirki marked this pull request as ready for review October 27, 2024 14:11
@bghira bghira merged commit 6962afd into bghira:main Oct 27, 2024
1 check passed
@bghira
Copy link
Owner

bghira commented Oct 30, 2024

i have to revert this as it prevents full caching on multigpu systems. not sure why..

@mhirki
Copy link
Contributor Author

mhirki commented Oct 30, 2024

Python sets appear to be randomized between processes. On a multigpu system, each process will end up with its own randomized list of captions. That's probably a bad thing.

The dict method is guaranteed to maintain order of the list (since Python 3.7) but obviously comes at a performance penalty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants