Optimize alignments #703

eu9ene · 2024-06-25T21:30:55Z

Add Moses tokenization of the corpus to reduce the vocabulary and increase accuracy
Remap the calculated alignments back to whitespace-tokenized ones expected by OpusTrainer
Use a fast C++ tokenizer
Use parallelization with multiprocessing to speed things up

Known issues:

I tested it on a student 400M corpus on 512GB machine without priors and it didn't OOM, but it took ~2 days to complete
We can likely make it more scalable by using chunks, especially if we use the priors pre-calculated on the teacher step. However, it will not help with preemptions if we don't split it into multiple tasks
To run it with existing teachers we'll need to separately run target: alignments-original to recalculate the priors with the new tokenization and then run it again with target: all and existing_tasks: { "alignments-original-src-trg": "<task_id>" }

fixes #507
fixes #663

eu9ene · 2024-06-25T21:31:22Z

The tests will fail due to #689

eu9ene · 2024-06-25T23:57:27Z

It turned out to be hard to restart only the alignments-original, so I added an extra task to recalculate the priors for the student alignments (in a separate branch). I'm testing it here for en-uk https://firefox-ci-tc.services.mozilla.com/tasks/groups/eZKkxqHISTCDwrsylAZZvA. This way we won't need to rerun things multiple times.

bhearsum

The taskgraph parts looks fine to me.

gregtatum

Thanks for taking the time to make really clean commits here. It made it really easy to review these changes. This is some nice work.

pipeline/alignments/tokenizer.py

pipeline/alignments/align.py

gregtatum · 2024-06-26T19:19:59Z

pipeline/alignments/align.py

+        # send lines to worker processes in chunks
+        for aln in tqdm(pool.imap(remap_line, lines, chunksize=10000), mininterval=10):
+            output.write(aln)
+


Thought:

Something to measure process memory on tasks with many steps could be interesting.

Something like:

import psutil def print_memory(): processes = [] # Collect process information for proc in psutil.process_iter(["pid", "name", "memory_info"]): try: pid = proc.info["pid"] name = proc.info["name"] memory_info = proc.info["memory_info"] memory_usage = memory_info.rss processes.append((name, pid, memory_usage)) except Exception: print("Failed to get process information") # Sort the processes based on memory usage (descending) processes.sort(key=lambda x: x[2], reverse=True) # Calculate the maximum length of "name (pid)" for alignment display_names = [f"{name} ({pid})" for name, pid, _ in processes] max_length = max(len(display_name) for display_name in display_names) # Print the sorted process information with right-padding for proc_info, display_name in zip(processes, display_names): _, _, memory_usage = proc_info print(f"{display_name.ljust(max_length)} {memory_usage / (1024 * 1024):.2f} MB") if __name__ == "__main__": print_memory()

plugin-container (6622) 1081.06 MB firefox (6609) 963.39 MB plugin-container (6618) 845.61 MB plugin-container (6619) 734.81 MB iTerm2 (88484) 540.56 MB Slack Helper (Renderer) (49150) 515.27 MB Code Helper (Renderer) (47144) 450.91 MB plugin-container (81278) 438.48 MB plugin-container (6621) 434.69 MB plugin-container (17813) 336.47 MB plugin-container (6620) 333.00 MB

Agreed, proper measuring would definitely be interesting but since the main issue is inside C++ code in eflomal, we can't do much about it anyway. We can run such a tool locally on a 10M dataset if you're interested.

pipeline/alignments/align.py

tests/fixtures/__init__.py

* Add fast Moses tokenizer * Tokenize corpus and remap alignments * Use moses tokenizer in taskcluster * Add tests for index mapping * Add packaged to build fast moses tokenizer * Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer * Rename tokenization function * Rename chunking parameter * Relock poetry * Rerun linter

eu9ene added 6 commits June 25, 2024 14:14

Add fast Moses tokenizer

8d9d327

Tokenize corpus and remap alignments

82cbc32

Use moses tokenizer in taskcluster

7c1e559

Add tests for index mapping

7dd7f1d

Add packaged to build fast moses tokenizer

05a4245

Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer

259dccf

eu9ene requested a review from gregtatum June 25, 2024 21:30

eu9ene requested a review from a team as a code owner June 25, 2024 21:30

eu9ene requested a review from jcristau June 25, 2024 21:30

bhearsum approved these changes Jun 26, 2024

View reviewed changes

bhearsum removed the request for review from jcristau June 26, 2024 13:34

gregtatum approved these changes Jun 26, 2024

View reviewed changes

eu9ene added 6 commits June 26, 2024 16:56

Merge branch 'release' into optimize_aln2

277442c

Rename tokenization function

61fa39e

Rename chunking parameter

f3d3212

Merge branch 'release' into optimize_aln2

d33850c

Relock poetry

f80a6d5

Rerun linter

46b8596

eu9ene merged commit 1ba10cf into release Jun 27, 2024
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize alignments #703

Optimize alignments #703

eu9ene commented Jun 25, 2024

eu9ene commented Jun 25, 2024

eu9ene commented Jun 25, 2024

bhearsum left a comment

gregtatum left a comment

gregtatum Jun 26, 2024

eu9ene Jun 26, 2024

Optimize alignments #703

Optimize alignments #703

Conversation

eu9ene commented Jun 25, 2024

eu9ene commented Jun 25, 2024

eu9ene commented Jun 25, 2024

bhearsum left a comment

Choose a reason for hiding this comment

gregtatum left a comment

Choose a reason for hiding this comment

gregtatum Jun 26, 2024

Choose a reason for hiding this comment

eu9ene Jun 26, 2024

Choose a reason for hiding this comment