Add deduplication of captions #1104

mhirki · 2024-10-27T13:51:47Z

This pull request adds deduplication of captions. I ran a little benchmark comparing two deduplication methods with 9997 captions from the pseudo-camera-10k dataset.

captions = list(dict.fromkeys(captions))
Deduplication complete in 875493 ns.
Deduplication complete in 865751 ns.
Deduplication complete in 880586 ns.

captions = list(set(captions))
Deduplication complete in 1007244 ns.
Deduplication complete in 943887 ns.
Deduplication complete in 936598 ns.

So building a dictionary is the fastest method. These are nanoseconds meaning that 9997 captions were deduplicated in less than 1 millisecond.

bghira · 2024-10-27T13:56:06Z

it doesn't scale in the same way that it begins out at 10k entries:

test case

import time
import random
import string

# Generate a list of 700,000 random string entries with some duplicates
entries = [''.join(random.choices(string.ascii_letters + string.digits, k=10)) for _ in range(500000)]
entries += entries[:200000]  # add duplicates to make it exactly 700,000 entries
random.shuffle(entries)

# Benchmark function to time deduplication method
def benchmark_deduplication(method, entries, trials=3):
    times = []
    for _ in range(trials):
        start_time = time.time_ns()
        deduped_entries = method(entries)
        end_time = time.time_ns()
        times.append(end_time - start_time)
    average_time = sum(times) / len(times)
    return average_time, len(deduped_entries)

# Deduplication methods
def deduplicate_dict(entries):
    return list(dict.fromkeys(entries))

def deduplicate_set(entries):
    return list(set(entries))

# Run benchmarks
dict_time, dict_len = benchmark_deduplication(deduplicate_dict, entries)
set_time, set_len = benchmark_deduplication(deduplicate_set, entries)

dict_time, set_time, dict_len, set_len

->

Using list(dict.fromkeys(entries)): Average time: 525,460,000 ns
Using list(set(entries)): Average time: 268,889,000 ns

mhirki · 2024-10-27T14:07:39Z

Alright, I ran your test case on my end and the results are pretty much the same. Good thing you wrote that test case. I'll update my pull request.

>>> dict_time, set_time, dict_len, set_len
(171640868.0, 91814939.66666667, 500000, 500000)

…rformance for large datasets.

bghira · 2024-10-27T14:10:26Z

yeah for my multiple millions of image datasets it actually takes several minutes to do the dict test :')

sets are notoriously OP

bghira · 2024-10-30T14:10:48Z

i have to revert this as it prevents full caching on multigpu systems. not sure why..

mhirki · 2024-10-30T21:48:04Z

Python sets appear to be randomized between processes. On a multigpu system, each process will end up with its own randomized list of captions. That's probably a bad thing.

The dict method is guaranteed to maintain order of the list (since Python 3.7) but obviously comes at a performance penalty.

mhirki marked this pull request as draft October 27, 2024 14:03

Add deduplication of captions by building a set which has superior pe…

f1b1073

…rformance for large datasets.

mhirki force-pushed the deduplicate-captions branch from 4c3fbc9 to f1b1073 Compare October 27, 2024 14:09

mhirki marked this pull request as ready for review October 27, 2024 14:11

bghira merged commit 6962afd into bghira:main Oct 27, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add deduplication of captions #1104

Add deduplication of captions #1104

mhirki commented Oct 27, 2024

bghira commented Oct 27, 2024

mhirki commented Oct 27, 2024

bghira commented Oct 27, 2024

bghira commented Oct 30, 2024

mhirki commented Oct 30, 2024

Add deduplication of captions #1104

Add deduplication of captions #1104

Conversation

mhirki commented Oct 27, 2024

bghira commented Oct 27, 2024

mhirki commented Oct 27, 2024

bghira commented Oct 27, 2024

bghira commented Oct 30, 2024

mhirki commented Oct 30, 2024