-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize alignments #703
Optimize alignments #703
Conversation
The tests will fail due to #689 |
It turned out to be hard to restart only the alignments-original, so I added an extra task to recalculate the priors for the student alignments (in a separate branch). I'm testing it here for en-uk https://firefox-ci-tc.services.mozilla.com/tasks/groups/eZKkxqHISTCDwrsylAZZvA. This way we won't need to rerun things multiple times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The taskgraph parts looks fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to make really clean commits here. It made it really easy to review these changes. This is some nice work.
# send lines to worker processes in chunks | ||
for aln in tqdm(pool.imap(remap_line, lines, chunksize=10000), mininterval=10): | ||
output.write(aln) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought:
Something to measure process memory on tasks with many steps could be interesting.
Something like:
import psutil
def print_memory():
processes = []
# Collect process information
for proc in psutil.process_iter(["pid", "name", "memory_info"]):
try:
pid = proc.info["pid"]
name = proc.info["name"]
memory_info = proc.info["memory_info"]
memory_usage = memory_info.rss
processes.append((name, pid, memory_usage))
except Exception:
print("Failed to get process information")
# Sort the processes based on memory usage (descending)
processes.sort(key=lambda x: x[2], reverse=True)
# Calculate the maximum length of "name (pid)" for alignment
display_names = [f"{name} ({pid})" for name, pid, _ in processes]
max_length = max(len(display_name) for display_name in display_names)
# Print the sorted process information with right-padding
for proc_info, display_name in zip(processes, display_names):
_, _, memory_usage = proc_info
print(f"{display_name.ljust(max_length)} {memory_usage / (1024 * 1024):.2f} MB")
if __name__ == "__main__":
print_memory()
plugin-container (6622) 1081.06 MB
firefox (6609) 963.39 MB
plugin-container (6618) 845.61 MB
plugin-container (6619) 734.81 MB
iTerm2 (88484) 540.56 MB
Slack Helper (Renderer) (49150) 515.27 MB
Code Helper (Renderer) (47144) 450.91 MB
plugin-container (81278) 438.48 MB
plugin-container (6621) 434.69 MB
plugin-container (17813) 336.47 MB
plugin-container (6620) 333.00 MB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, proper measuring would definitely be interesting but since the main issue is inside C++ code in eflomal, we can't do much about it anyway. We can run such a tool locally on a 10M dataset if you're interested.
* Add fast Moses tokenizer * Tokenize corpus and remap alignments * Use moses tokenizer in taskcluster * Add tests for index mapping * Add packaged to build fast moses tokenizer * Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer * Rename tokenization function * Rename chunking parameter * Relock poetry * Rerun linter
* Add fast Moses tokenizer * Tokenize corpus and remap alignments * Use moses tokenizer in taskcluster * Add tests for index mapping * Add packaged to build fast moses tokenizer * Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer * Rename tokenization function * Rename chunking parameter * Relock poetry * Rerun linter
* Add fast Moses tokenizer * Tokenize corpus and remap alignments * Use moses tokenizer in taskcluster * Add tests for index mapping * Add packaged to build fast moses tokenizer * Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer * Rename tokenization function * Rename chunking parameter * Relock poetry * Rerun linter
* Add fast Moses tokenizer * Tokenize corpus and remap alignments * Use moses tokenizer in taskcluster * Add tests for index mapping * Add packaged to build fast moses tokenizer * Fix an issue with LD_LIBRARY_PATH for fast moses tokenizer * Rename tokenization function * Rename chunking parameter * Relock poetry * Rerun linter
Known issues:
target: alignments-original
to recalculate the priors with the new tokenization and then run it again withtarget: all
andexisting_tasks: { "alignments-original-src-trg": "<task_id>" }
fixes #507
fixes #663