Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize alignments #703

Merged
merged 12 commits into from
Jun 27, 2024
Merged

Optimize alignments #703

merged 12 commits into from
Jun 27, 2024

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Jun 25, 2024

  • Add Moses tokenization of the corpus to reduce the vocabulary and increase accuracy
  • Remap the calculated alignments back to whitespace-tokenized ones expected by OpusTrainer
  • Use a fast C++ tokenizer
  • Use parallelization with multiprocessing to speed things up

Known issues:

  • I tested it on a student 400M corpus on 512GB machine without priors and it didn't OOM, but it took ~2 days to complete
  • We can likely make it more scalable by using chunks, especially if we use the priors pre-calculated on the teacher step. However, it will not help with preemptions if we don't split it into multiple tasks
  • To run it with existing teachers we'll need to separately run target: alignments-original to recalculate the priors with the new tokenization and then run it again with target: all and existing_tasks: { "alignments-original-src-trg": "<task_id>" }

fixes #507
fixes #663

@eu9ene eu9ene requested a review from gregtatum June 25, 2024 21:30
@eu9ene eu9ene requested a review from a team as a code owner June 25, 2024 21:30
@eu9ene eu9ene requested a review from jcristau June 25, 2024 21:30
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jun 25, 2024

The tests will fail due to #689

@eu9ene
Copy link
Collaborator Author

eu9ene commented Jun 25, 2024

It turned out to be hard to restart only the alignments-original, so I added an extra task to recalculate the priors for the student alignments (in a separate branch). I'm testing it here for en-uk https://firefox-ci-tc.services.mozilla.com/tasks/groups/eZKkxqHISTCDwrsylAZZvA. This way we won't need to rerun things multiple times.

Copy link
Collaborator

@bhearsum bhearsum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The taskgraph parts looks fine to me.

@bhearsum bhearsum removed the request for review from jcristau June 26, 2024 13:34
Copy link
Member

@gregtatum gregtatum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to make really clean commits here. It made it really easy to review these changes. This is some nice work.

# send lines to worker processes in chunks
for aln in tqdm(pool.imap(remap_line, lines, chunksize=10000), mininterval=10):
output.write(aln)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought:

Something to measure process memory on tasks with many steps could be interesting.

Something like:

import psutil


def print_memory():
    processes = []

    # Collect process information
    for proc in psutil.process_iter(["pid", "name", "memory_info"]):
        try:
            pid = proc.info["pid"]
            name = proc.info["name"]
            memory_info = proc.info["memory_info"]
            memory_usage = memory_info.rss
            processes.append((name, pid, memory_usage))
        except Exception:
            print("Failed to get process information")

    # Sort the processes based on memory usage (descending)
    processes.sort(key=lambda x: x[2], reverse=True)

    # Calculate the maximum length of "name (pid)" for alignment
    display_names = [f"{name} ({pid})" for name, pid, _ in processes]
    max_length = max(len(display_name) for display_name in display_names)

    # Print the sorted process information with right-padding
    for proc_info, display_name in zip(processes, display_names):
        _, _, memory_usage = proc_info
        print(f"{display_name.ljust(max_length)} {memory_usage / (1024 * 1024):.2f} MB")


if __name__ == "__main__":
    print_memory()
plugin-container (6622)                              1081.06 MB
firefox (6609)                                       963.39 MB
plugin-container (6618)                              845.61 MB
plugin-container (6619)                              734.81 MB
iTerm2 (88484)                                       540.56 MB
Slack Helper (Renderer) (49150)                      515.27 MB
Code Helper (Renderer) (47144)                       450.91 MB
plugin-container (81278)                             438.48 MB
plugin-container (6621)                              434.69 MB
plugin-container (17813)                             336.47 MB
plugin-container (6620)                              333.00 MB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, proper measuring would definitely be interesting but since the main issue is inside C++ code in eflomal, we can't do much about it anyway. We can run such a tool locally on a 10M dataset if you're interested.

@eu9ene eu9ene merged commit 1ba10cf into release Jun 27, 2024
33 checks passed
eu9ene added a commit that referenced this pull request Aug 27, 2024
* Add fast Moses tokenizer

* Tokenize corpus and remap alignments

* Use moses tokenizer in taskcluster

* Add tests for index mapping

* Add packaged to build fast moses tokenizer

* Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer

* Rename tokenization function

* Rename chunking parameter

* Relock poetry

* Rerun linter
eu9ene added a commit that referenced this pull request Aug 28, 2024
* Add fast Moses tokenizer

* Tokenize corpus and remap alignments

* Use moses tokenizer in taskcluster

* Add tests for index mapping

* Add packaged to build fast moses tokenizer

* Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer

* Rename tokenization function

* Rename chunking parameter

* Relock poetry

* Rerun linter
eu9ene added a commit that referenced this pull request Aug 29, 2024
* Add fast Moses tokenizer

* Tokenize corpus and remap alignments

* Use moses tokenizer in taskcluster

* Add tests for index mapping

* Add packaged to build fast moses tokenizer

* Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer

* Rename tokenization function

* Rename chunking parameter

* Relock poetry

* Rerun linter
eu9ene added a commit that referenced this pull request Aug 29, 2024
* Add fast Moses tokenizer

* Tokenize corpus and remap alignments

* Use moses tokenizer in taskcluster

* Add tests for index mapping

* Add packaged to build fast moses tokenizer

* Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer

* Rename tokenization function

* Rename chunking parameter

* Relock poetry

* Rerun linter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants