Investigate distillation quality gap #231

eu9ene · 2023-10-24T18:30:25Z

After training en-hu we noticed a somewhat larger quality gap in 4 BLEU points between the teacher and student models.

It’s 24.8 for the quantized and fine-tuned student vs 30.2 BLEU for the teacher enseble on flores-test dataset. For example for en-nl we had 27.0 vs 28.3.

The training looks more or less normal, but a little less smooth than usual.

en-hu:

en-nl

We had a larger and probably a higher quality dataset for en-nl.

We should investigate further whether it’s a pipeline issue, a config issue or a data issue.

eu9ene · 2024-07-16T21:31:44Z

The only idea I have at the moment is to try utilizing more monolingual data that the model hasn't seen. For example, we have pretty small distillation gap for da-en. The training corpus was 161M before filtering and we added 139M monolingual sentences from more diverse sources compared to regular news-crawl that we use for en-xx pairs:

  # The monolingual data contains:
  #   ~139,436,127 sentences
  mono-src:
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-hplt/08/hplt_filtered_da_1.txt.zst  # 65,099,327 sentences
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-hplt/08/hplt_filtered_da_2.txt.zst # 16,579,852 sentences
  - url_https://storage.googleapis.com/releng-translations-dev/data/mono-nllb/nllb-mono-da.txt.zst # 57,756,948 sentences

For en-lt we had 76M sentences in the original training corpus and regular 200M of English news-crawl:

  # The monolingual data contains:
  #   ~195,823,002 sentences
  mono-src:
  - news-crawl_news.2007  #           ~1,557,522 sentences (176M)
  - news-crawl_news.2008 #           ~5,389,380 sentences (609M)
  - news-crawl_news.2009 #           ~6,557,522 sentences (741M)
  - news-crawl_news.2010 #           ~3,247,787 sentences (367M)
  - news-crawl_news.2011 #           ~6,318,584 sentences (714M)
  - news-crawl_news.2012 #           ~6,407,079 sentences (724M)
  - news-crawl_news.2013 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2014 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2015 #          ~10,619,469 sentences (1.2G)
  - news-crawl_news.2016 #           ~7,982,300 sentences (902M)
  - news-crawl_news.2017 #          ~11,504,424 sentences (1.3G)
  - news-crawl_news.2018 #           ~7,920,353 sentences (895M)
  - news-crawl_news.2019 #          ~17,699,115 sentences (2.0G)
  - news-crawl_news.2020 #          ~22,123,893 sentences (2.5G)
  - news-crawl_news.2021 #          ~21,238,938 sentences (2.4G)
  - news-crawl_news.2022 #          ~23,008,849 sentences (2.6G)
  - news-crawl_news.2023 #          ~23,008,849 sentences (2.6G)

So we could mine some mono data for Enslish from HPLT and NLLB. Then we can increase and diversify the mono set for distillation.

See the paper From Research to Production and Back: Ludicrously Fast Neural Machine Translation

2.1 Knowledge distillation with noisy backward-forward translation
In our experience, student training benefits from forward-translated data that was not seen during teacher training. Since we do not have access to additional monolingual source data, we generate noisy back-translated sentences (Edunov et al.,2018), one set per inverse teacher model

@marco-c @gregtatum FYI

eu9ene · 2024-07-19T16:43:05Z

@gregtatum let's prepare NLLB and HPLT data for English in the same way we did for other languages. Then I'll rerun distillation with the more diverse dataset to check the hypothesis. The monolingual shards shouldn't be too big, I'd say ideally not bigger than 50M sentences so that we can add as many as needed to have a good mix.

eu9ene · 2024-07-19T16:45:54Z

Another hypothesis is that our on-the-fly data augmentation affects quality more for worse teacher models. The experiment here would be to disable OpusTrainer augmentations and see how it performs. It's easier to check but the fix would be very complex as we would need to move augmentations from training to preparing corpus for translation.

gregtatum · 2024-07-29T20:37:43Z

I created 3 separate issues for different lines of investigation.

gregtatum · 2024-07-29T21:58:28Z

I did some light analysis of our recent runs, and their distillation gap vs the sentence counts.

https://docs.google.com/spreadsheets/d/1l459Ui9J7ccdP6UMd1qDy51L8Uar2aZWbYWxOGcQqXA/edit?gid=1859623642#gid=1859623642

Data Source	Correllation
All monolingual data	0.331
Newscrawl	-0.215
HPLT	0.295
NLLB	0.385
HPLT+NLLB	0.421

The amount of data is pretty low but, it's an early signal that HPLT+NLLB contribute more to better distillation, and there is a relationship to more monolingual data to a lower distillation gap.

gregtatum · 2024-08-07T19:15:35Z

#790 Here's another idea on applying fluency similar to HPLT to our translations.

gregtatum · 2024-08-20T19:59:52Z

Data Type	COMET Teacher	COMET Student
Parallel	0.782	0.433
Backtranslations	0.095	–
Distillation	–	0.581
Total Data	0.645	0.647

I did another look at correlations on our most recent runs, excluding English monolingual data from the analysis, as it was all the same dataset, and wouldn't show correlations.

My interpretation here is that parallel data is still the best, followed by distillation data, and finally back translation size is only weakly affecting COMET quality. I'd be curious if back translations would correlate higher with fluency scores rather than just translation quality.

Here is the raw data:

COMET Teacher	COMET Student	Gap - Teacher to Student	Parallel	Distillation	Backtranslation	Total Data
0.9013	0.8900	-0.0113	77,673,571	122,958,567	380,595,455	581,227,593
0.8946	0.8700	-0.0246	67,190,349	296,625,390	380,607,008	744,422,747
0.8934	0.8700	-0.0234	86,586,079	178,623,209	380,607,008	645,816,296
0.8817	0.8600	-0.0217	50,790,274	109,841,918	380,607,008	541,239,200
0.8791	0.8600	-0.0191	27,638,186	23,251,451	380,607,008	431,496,645
0.8979	0.8500	-0.0479	59,206,140	384,244,370	380,607,008	824,057,518
0.8765	0.8500	-0.0265	35,879,023	71,024,429	380,607,008	487,510,460
0.8763	0.8500	-0.0263	35,295,006	60,127,368	380,607,008	476,029,382
0.8757	0.8500	-0.0257	42,547,739	267,167,662	380,607,008	690,322,409
0.8759	0.8400	-0.0359	18,641,618	39,043,989	380,607,008	438,292,615
0.8667	0.8400	-0.0267	33,800,821	74,692,280	380,607,008	489,100,109
0.8871	0.8300	-0.0571	104,589,182	361,141,678	380,607,008	846,337,868
0.8705	0.8300	-0.0405	32,804,682	179,648,350	380,607,008	593,060,040
0.8652	0.8200	-0.0452	3,930,889	1,179,106	380,607,008	385,717,003
0.9041	0.8900	-0.0141	76,657,035	380,607,008	20,702,561	477,966,604
0.9054	0.8600	-0.0454	26,926,860	380,607,008	15,259,842	422,793,710
0.8932	0.8600	-0.0332	34,686,229	380,607,008	5,015,152	420,308,389
0.8863	0.8500	-0.0363	18,431,326	380,607,008	11,764,550	410,802,884
0.8971	0.8400	-0.0571	33,801,723	380,607,008	12,300,137	426,708,868
0.9070	0.8400	-0.0670	102,545,869	380,607,008	142,727,850	625,880,727
0.8944	0.8300	-0.0644	32,804,682	380,607,008	3,301,596	416,713,286

* Cleanup API: Refactor request on-complete transition (#80) * Handle empty translation requests Fixes https://github.com/browsermt/bergamot-translator/issues/101. ResponseBuilder is called with empty histories to trigger a valid but mostly-empty response. * Control validating the config options via a boolean flag (#116) * Control validating the config options via a boolean flag - parseOptions() function now validates the parsed options based on the validate argument * Minor syntactic fix * JS bindings for loading model and shortlist files as bytes (#117) * Bindings to load model and shortlist files as bytes * Modified wasm test page for byte based loading of files * Updates wasm README for byte loading based usage of TranslationModel * Make wasm test page work with bergamot-models repository - bergamot-models now contains lexical shortlist bin files as well * Better error logging for wasm test page * Update to marian-dev master * Full windows support with ssplit from browsermt, not a fork (#109) * Update marian-dev to the newest mac version * Attempt windows workflow * force workflow rerun * Separate id * Attempt 3 at github action * Marian dev submodule now compiles with apple clang * Updated ssplit version to something more recent * Attempt to fix compile on wasm * Do not compile subproject tests * Fix emscripten compilation on Mac * 99% on the way to windows compile * Try with a different generator * Build release not debug * Revert CMakeLists.txt hacks * Fix sse2 compilation failure * MSVC settings for WIN32 * Add nodefaultlib LIBCMT * Do not compile ssplit.cpp as it contains sys/mman.h * Revert ab56b9aa4f4360b0ab98d5806658d4302f31db1d * Update paths * Set the build type to release if not set previously * Attempt to build release with the windows workflow * Attempt 5 at VS studio release build * Attempt 6 at getting release build on MSVC generator * The windows build is debug at the moment... * fix ssplit for ubuntu 16.04 * Fix compilation with clang * Compile on ubuntu16.04 * Explain what is going on * Updated ssplit and workflow * Enabled gemm-precision in wasm test page - This increases the inference speed while providing models as bytes to the translation engine (it wasn't needed while providing models as files) * Updated wasm/README file with instructions for byte loading APIs * WASM Bindings collapse (#87) * Safe transfer of bindings through typedefs * Removing Translation* files and bringing in counterparts * Remove previously commented out code * Removing commented out include * Absorb Translation* documentation Co-authored-by: abhi-agg <[email protected]> * Improve script to patch wasm artifacts and load EN->DE vocabulary in wasm test (#125) * Improved script that patches wasm artifacts to enable wormhole - Made the regex pattern ignore multiple whitespaces b/w words of the matching pattern * Fix for loading EN->DE vocabularies in wasm test page - Loading vocabularies for EN->DE was failing because of the new structure of bergamot-models * Improved wasm scripts and README (#128) * Minor README change - Changed "browsermt" to "mozilla" * Updating ci scripts for the latest upstream changes - The upstream browsermt/bergamot-translator builds the wasm artifacts in top level build folder now * Extension desired changes (#129) * Enable worker file system * Avoid node.js-code in emscripten glue-code * Extension desired changes (#129) * Enable worker file system * Avoid node.js-code in emscripten glue-code * Fix busy loop in windows (#131) * Fix busy loop in windows * Nick wants the while loop gone * Fix continue leftover Co-authored-by: Nikolay Bogoychev <[email protected]> * Making bytearray a commandline switch (#127) * Adding bytearray option * collapse intermediate for bytearray apps * Removing service-cli-bytearray * Removing the bergamot bytearray app * Bumping updates to brt collapsing apps * Reasonable defaults and hard check when cmd enabled * Update documentation for flags * Bump brt with MKL check and skip * Bumping BRT with MKL_FOUND instead of USE_MKL * Bumping BRT with no mkl enforce * Bumping BRT with ssse3 output * Let's try disabling OpenBLAS * Trying to disable apple accelerate * Using WASM compatible BLAS can enable intgemm * Adding a CMake -L to see what exactly is the diff * Revert "Let's try disabling OpenBLAS" This reverts commit 9a6b9bc53bf7dec956889f6e0b7047e5388e1b7e. * Revert "Using WASM compatible BLAS can enable intgemm" This reverts commit 936a592e18431c279e6c5952a278d012d18ff295. * Restricting mac tests through tags and on GitHub CI * Using only check-bytearray * Bumping BRT with change of default behaviour * Faithful to source-structure translation (#115) * First draft of faithful translation * Comments explaining pre and post * Comments on response_builder * Updating bergamot-translator-tests with new outputs * Cosmetic changes in response target text construction * Replacing &(x[0]) -> x.data() to avoid illegal indices * Removing nullptr given both branches init pointer with legal values * pre, post -> gap(i) addressing review comments Functions which were pre and post before are subsumed by gap(i), and the algorithm in ResponseBuilder adjusted to fix. `x = nullptr` is back, should be harmless. * Updating brt with paragraph outputs * Bumping brt with updated outputs, buffer text at begin as well * Bumping BRT with sync after bytearray collapse merge * Pointing BRT to main after merge Co-authored-by: Nikolay Bogoychev <[email protected]> * Enable vocabs pass as byte arrays (#122) * first attempt to enable vocabs pass as byte arrays * pass vocabs bytes as AlignedMemory * add vocabIndices to avoid double loading * small fix on parameter names and documentation * fix windows build plus tiny update on documentation * update marian-dev submodule * move validate model bytearray in BatchTranslator * small refactors on validateBinaryModel() * switch vocab memories to std::vector<marian::Ptr<AlignedMemory>> * update marian-dev submodule * replace marian::Ptr to std::shared_ptr for vocab memories * add note for vocab memories * Update ssplit submodule, removing absl (#132) * Update ssplit submodule, removing absl * Fix ssplit variables * Update ssplit branch * Fix emscripten compilaiton * Update tests * Minor rename: sentence_ranges -> annotation (#134) * Target master of ssplit-cpp * Remove unused used types TokenRanges, SentenceTokenRanges, UPtr (#137) * Change USE_WASM_COMPATIBLE_SOURCE =OFF by default on native, force on for WASM (#138) * Change WASM_COMPATIBLE_SOURCE=OFF by default The default was WASN_COMPATIBLE_SOURCE=ON COMPILE_WASM=OFF which is a testing configuration, not a sensible default for native or wasm. * Always USE_WASM_COMPATIBLE_SOURCE with COMPILE_WASM * Set CMP0077 to fix variable handling * Export "addOnPreMain" function from wasm module - This is required in the extension while using wasm module in a worker environment * Enable Debugging information in wasm module builds - Added "-g2" flag furing linking step * JS bindings for vocabularies as bytes * Updated wasm test page to pass vocabulary files as bytes * Refactoring TranslationModelBindings class - typdef AlignedMemory for code readability - Added documentation for one of the binding function * Avoid packaging vocab files into wasm binary in CI builds - We don't need to package vocab files into wasm binary any more as a sync with upstream enabled passing vocabs as bytes * Updated wasm README to update for passing vocabs as bytes - Updated Using JS APIs section to pass vocabs as bytes * Updated README to remove packaging steps for wasm compilation - We don't need to package model, shortlist or vocab files into wasm binary at build time * Updated CMakeLists.txt to remove packaging steps for wasm compilation - Removed PACKAGE_DIR cmake option - Removed Workerfs, FORCE_FILESYSTEM=1 in wasm builds -- File system support is not needed any more (since model, shortlist and vocabs are being passed as bytes now) * Bundle AlignedMemory inputs with MemoryBundle (#147) * Enabling ccache on github builds for Ubuntu (#95) * CI Changes to add tiny regression tests * Adding an inspect cache step * Removing ccache, pursue in another * Incorporating Nick's changes through submodule merge * Submodule now points to master * Restoring ccache enabled workflow file * Restoring ccache enabled CMakeLists * cache -> ccache typo fix * Moving CCACHE setup to GitHub runner file * Find also uses CCACHE dir * Updating CMakeLists not to override env * Cache compiler binary's contents * Changing a few names to trigger new build; Testing cache looks fun * USE_CCACHE=on, -L for inspection * Adding a ccache_cmd, but will only use in next commit * Using ccache_cmd * Removing " * Adding compiler hash script * Bunch of absolute paths * GITHUB_WORKSPACE typo * Nah, I'll keep -L and trigger another build * Trying something with compiler hash on cache key backup as well * builtin, bash it seems * Empty commit #1 * Move ccache stats to after compile * Reshuffling ccache vars * No comments * Updates to Github output set syntax * Empty Commit 1 * Empty Commit 2 * Empty commit 3 * /bin/bash -> bash; ccache_cmd for consistency * Adding ccache -s before and after build * Adding comments to compiler-hash script * Let's build cached and non-cached variants together for comparison * Fixing quotes, /bin/bash -> bash * Minor var/env adjustment * Adding ccache -z before the job * Reverting CMakeLists.txt without CCACHE * Switching to CMAKE_LANG_COMPILER_LAUNCHER instead of CMakeLists.txt rule * 5G -> 1G cache size * 1G -> 2G; Hyperparameter tuning * Refactor vocabs in Service (#143) Co-authored-by: Nikolay Bogoychev <[email protected]> * Rewrite annotation class to remove corner cases (#135) * Added cmake file to compute version information - Reads BERGAMOT_VERSION file for generating various strings for versioning * Import GetVersionFromFile cmake file in root level CMakeLists.txt * Modified wasm cmake file to include version information in built artifacts * Generate project version file for native builds - The header file exposes a function that provides version information for native binaries * Bumped version to 0.3.0 - This brings the version info in sync with the various releases of extension * Corrected the version number - To be in sync with versioning in mozilla/bergamot-translator repo * Marian submodule with unified loading (#157) * Collapsing TranslationRequest -> ResponseOptions (#139) * Rewriting batching for threadsafety (#155) This does make the batcher a critical section across job submission and cleaving though. If that becomes a problem, we should go back to incoming and outgoing queues with a batcher thread. Also removes blocking mode from native compiles. Note that translateMultiple no longer guarantees great batching. Guess we could lease the mutex from ThreadsafeBatcher and create a session. There is the risk that one sentence comes in at a time and each thread grabs one sentence at a time instead of better batching. Not sure what to do about that other than some sort of Nagle algorithm. Due to non-deterministic batching, even with one thread, the regression tests will go haywire. * Use binary lexical shortlist in documentation (#152) * Use binary lexical shortlist in documentation * MKL/AppleAccelerate note Co-authored-by: Nikolay Bogoychev <[email protected]> Co-authored-by: Jerin Philip <[email protected]> * initialise MemoryBundle members (#167) * Adding clang-format and updating existing sources to adhere (#151) * Adding a first version of clang-format * Adding run-clang-format.py * Adding coding styles to workflow * Fix indentation on coding-styles workflow * run-clang-format.'py' * -style -> --style in python * Updating ColumnLimit: 120 * Format update with clang-format * Revert "Format update with clang-format" This reverts commit 5340b19eae8fcc91a2a79205e0b3dd65ad61ad4c. * Apply update after sync * Removing a few empty lines * Removing one more empty line * Removing empty in workflow file * Updating README with coding style instructions * clang-format-* provided in this repository doc update Co-authored-by: Nikolay Bogoychev <[email protected]> * Pin emsdk version to the same one used in Circle CI (#165) * GitHub action to push browsermt/main branch to mozilla/bergamot-translator every hour (#160) * Create push-browsermt-main-to-mozilla-main.yml * Update .github/workflows/push-browsermt-main-to-mozilla-main.yml Co-authored-by: Graeme <[email protected]> * Tweaks * Fix yaml syntax * Parametrized the workflow based on @jerinphilip's example Co-authored-by: Graeme <[email protected]> * Update tests * Bumping BRT for hotfixes (#169) * Bumping BRT for hotfixes * updating brt to point to main * Remove O(N^2) reallocation (#171) * Adding documentation action (#168) Adds a GitHub workflow that builds documentation from sources through doxygen through sphinx on push to the main branch or on push of any semantic version tags. The built documentation is deployed at https://github.com/browsermt/docs@gh-pages, which is rendered at https://browser.mt/docs/<suffix>, where <suffix> is 'main' or a tag vM.m.p corresponding to a semantic version. On pull request artifacts are uploaded for reviewers to inspect if need be. * Fix failures when loading text shortlist (#154) * Updating marian dev RelwithDebInfo -> Release (#178) * Updating marian dev RelwithDebInfo -> Release * Updating submodule to point to master * Single executable (#175) * Collapsing executables * Adding new test executable * Deleting old executable sources * Updating brt to operate with modes * cli-framework -> cli * Updating workflows to check for bergamot instead of bergamot-translator-app * Adding documentation * Making fn pure virtual * Shuffling apps into app namespace, alongside class documentation * Include app folder in documentation * BRT update service-cli -> native * parser.h: service-cli -> native * Updates to marian-integration.md * Cleanup: Remove templates, interface proper * change 4 to 2 cores for build instructions * service-cli -> native * Commenting the string constructor explanation * Not doing halfway interface / inheritance * Nick hates state, let's try this one * Revert "Nick hates state, let's try this one" This reverts commit e56db9f474b1906e62af0b06afb7c7d9e08ea9c8. * class -> struct before trying std::function stuff * oop -> functional? * Hints on what is happening * app::ftable -> app::REGISTRY * We have if-else and functions now. And we won't have test apps. * Doc linking to usage examples in brt * Remove unordered_map * Documentation updates * Fix warning * Deploy generated documentation only if browsermt (#179) * Including WASM documentation in sphinx build toc (#176) * Updating marian-dev: intgemm with env variable matmul switches (#187) * Remove addSentenceWithPriority (#186) * Update native (ubuntu, mac) workflows with ccache (#181) * Matrix is now more organized, Ubuntu 20.04-gcc9.3, Ubuntu-18.04-gcc7.5 is added. * ccache is extended to MacOS, and brings down CI run times to <5m when ccache works. * The compiler hash scripts are gone, ccache already covers most ground by default. The shell script is unnecessary. Cache works by preprocessor mode output of running the compiler with -E, which includes the necessary information. ccache-docs:How the cache works. * BRT if failed prints the final 20 lines of the test*.log to inspect what's going wrong without having to artifact download. * Pull request on any branch triggers workflow. * Push on main and ci-sandbox triggers workflow. * Replace resize with possible negative range with pop_back() (#189) * Consistent EMSDK version and parallel make jobs in README and github actions - Set EMSDK version to 2.0.9 to make it consistent everywhere in repo - Set parallel make jobs to 2 * CMake fixes: Generate project.h in binary dir, fix GetVersionFromFile for use as submodule. (#193) * Use CMAKE_CURRENT_SOURCE_DIR instead of CMAKE_SOURCE_DIR for project bound version string * marian-dev cmake fix * Generate project.h in binary dir * We don't want people asking about extra spaces * Fixing if syntax with YAML var subsitution (#188) * Generating cmake configured project version (.js) file in build folder (#194) - Earlier this file was being generated in folder containing actual sources - Fixes https://github.com/browsermt/bergamot-translator/issues/161 * Partial test-apps and tolerance in evaluations (#184) * Partial test applications Previously service-cli was used to generate output and accomplish regression testing for all of: (1) translated-text (2) alignment tokens + scores (3) quality scores (4) indirectly annotation and tokenizations. The --mode native now only outputs a faithful to source translated text of the input source on stdin. Test apps are separated into testing only individual functionalities. This can help in independently testing ssplit-cpp, quality-scores for the quality estimation implementation etc. Separating numbers and text have the advantage of being able to compare one with tolerance using BLEU (text) and some allowed error-rates (numbers). * Removing #mac tag * Moving test apps to src/tests * Tests are always on for CI Unit tests are turned off looking for WASM_COMPATIBLE_SOURCES. * Fixing WASM_COMPATIBLE_SOURCE -> USE_WASM_COMPATIBLE_SOURCE * Workaround for now; CMakeLists.txt horrors are starting to bite * BRT: use bergamot-test instead of bergamot now * This should fix issues: CMakeLists.txt has so many paths * Casing to camelCase and removing legacyServiceCli * removing leftover service-cli declaration, some doc updates * #pragma once is starting to look easier * All the more reasons to do #pragma once * Updating marian-dev with intgemm::kCPU print, resolved from INTGEMM_CPUID * BRT: Use --gemm-highest-arch instead of python script * Adding intgemm resolve here, where always(?) have intgemm on? * intgemm-resolve in default binary directory * BRT: Update to use intgemm-resolve * marian-dev: Reset to without --gemm-highest-precision Co-authored-by: Kenneth Heafield <[email protected]> * Removing alignments and quality-scores test-code (#196) * Removing alignments and quality-scores test-code * BRT: Update to main * Refactor wasm bindings to use consistent interface names as in native (#195) * Refactored wasm bindings code - Replaced TranslationModel, TranslationRequest and TranslationResult with Service, ResponseOptions and Response - Corresponding documentation changes - Names of the bindings files changed - Moved Vector<Response> definition in Response specific bindings file * Account for EOS in both source and target annotations (#190) * Load sentence-splitter (non-breaking prefixes) from ByteArray Service now allows loading Sentence-Splitter (non-breaking prefix file) from ByteArray. Behaviour is consistent with the rest of the ByteArray loads (model, shortlist), where first the ByteArray is checked if empty, if not fall back to loading from file-path. Adds regression test to check if source-sentences in constructed Response match expected behaviour when the non-breaking-prefixes file is provided. Bonus refactoring to remove an extra layer that existed for no reason. * maxLengthBreak_ -> wrapStep bugfix (#200) * Change ResponseBuilder to accept callback instead of future (#142) * Change ResponseBuilder to accept callback Breaks things everywhere, now we follow the compiler to fix and convert the std::future -> callback. * More std::future -> callback * std::future out of service.{h,cpp} * compile is working, so is callback * Some reshuffling of args * Fixing merge error * Fixing signature conflicts out of merge * Fixing that test duct-taping future * Minor adjustment to get that future back * Add documentation for the new callback function * Applying clang-format after update * Using default responseOptions * Remove future references from documentation * translateMultiple only for WASM (#177) * BRT: update to main; fresh-failures hopefully * Converting test translateFromStdin to use callback * BRT: Add fresh #native and #wasm tags * future from promise, fix error * Adding #native to GitHub CI Co-authored-by: Nikolay Bogoychev <[email protected]> * Added public methods in Response class to return sentences - Refactored ByteRange struct and moved it to definition.h * JS bindings to return sentence byte ranges * Wasm: Enabled sentence byte ranges in the wasm test page - Use JS bindings to print all sentences individually on console * Windows workflow: run-vcpkg7.{3->4}; vcpkg master (#208) A cmake change has caused vcpkg to fail without much error message, which is causing windows workflow runs to fail. Details in the following link: * https://github.com/microsoft/vcpkg/issues/18718 To fix, we're going with a version bump in vcpkg. Seeing that run-vcpkg also seems to have gotten an update, updating run-vcpkg from 7.3 to 7.4 Playing with fire: vcpkg master commit * Added build instructions to run on other browsers - Disabled compiling with wormhole which is Firefox specific feature * Add a clang-tidy run (#214) Adds a clang-tidy run in addition to the existing clang-format checks. The clang-tidy checks are not enforced, but is potentially useful to point to during review. * Wasm test page using web workers now (#218) * Updated marian submodule to latest commit of master * Wasm builds without SharedArrayBuffer * Circle CI wasm artifacts for non-wormhole builds * BRT: Update sacrebleu to get tests back working (#217) Co-authored-by: Nikolay Bogoychev <[email protected]> * QualityEstimation: Preliminary Implementation (#197) Unifies quality estimation with an interface, refactors previously available quality scores to fit this interface. Adds a new class of model with Logistic Regression powering the predictions as an implementation of said interface. QE now provides annotations on words using subwords to word rule-based algorithms working with space characters. QualityEstimation ----------------- Implementations of QE are bound together by a `QualityEstimator` Interface. 1. The log-probabilities from the machine-translation model re-interpreted as quality scores are crafted as an implementation of QualityEstimator. 2. A Logistic-Regression based model is added. This class of models is trained supervised with scores labeled by a human annotator. Handcrafted features - number of words, log probs from MT model and statistics over the sequence are used to generate the numeric features. LogisticRegressor, Matrix (to hold features) are added. The creation of an instance is switched by the `AlignedMemory` supplied (be it loaded from the file-system or supplied as a parameter). An empty AlignedMemory leads to quality scores from NMT while supplying weights of a trained logistic-regression model in binary format as the contents lead to an additional pass through the said model to provide more refined scores. Both the above now transform subwords into "words" using a heuristic algorithm, scanning for spaces. This allows the client to work with "words" to denote quality instead of subwords, as the former is more sensible to the user. Testing ------- 1. BRT now has two new test apps to check the QE outputs in text (covers subword to words) and numbers domain (covers quality scores). These are tested with en-et models for which QualityEstimation is available now, on a new input to avoid architecture/compiler issues. 2. Unit test for LogisticRegression model is added. Docs ---- Doxygen now supports MathJax properly to render explanations for Logistic Regressions' reductions in place to make computation more efficient correctly. Co-authored-by: Felipe C. Dos Santos <[email protected]> Co-authored-by: Jerin Philip <[email protected]> * Multiple TranslationModels Implementation (#210) For outbound translation, we require having multiple models in the inventory at the same time and abstracting the "how-to-translate" using a model out. Reorganization: TranslationModel + Service. The new entity which contains everything required to translate in one direction is `TranslationModel`. The how-to-translate blocking single-threaded mode of operation or async multi-threaded mode of operation is decoupled as `BlockingService` and `AsyncService`. There is a new regression-test using multiple models in conjunction added, also serving as a demonstration for using multiple models in Outbound Translation. WASM: WebAssembly due to the inability to use threads uses `BlockingService. Bindings are provided with a new API to work with a Service, and multiple TranslationModels which the client (JS extension) can inventory and maintain. Ownership of a given `TranslationModel` is shared while translations using the model are active in the internal mechanism. Config-Parsing: So far bergamot-translator has been hijacking marian's config-parsing mechanisms. However, in order to support multiple models, it has become impractical to continue this approach and a new config-parsing that is bergamot specific is provisioned for command-line applications constituting tests. The original marian config-parsing tooling is only associated with a subset of `TranslationModel` now. The new config-parsing for the library manages workers and other common options (tentatively). There is a known issue of: Inefficient placing of workspaces, leading to more memory usage than what's necessary. This is to be fixed trickling down from marian-dev in a later pull request. This PR also brings in BRT changes which fix speed-tests that were broken and also fixes some QE outputs which were different due to not using shortlist. * Adapted wasm test page for new Service interface (#224) - The new interface now supports running multiple TranslationModels * Wasm test page UI for translating b/w non-English language pairs (#231) * Updated Wasm test page UI for translating b/w non-English language pairs * Both "from" and "to" language dropdowns now allow non-English languages * Import matrix-multiply from a separate wasm module (#232) * Updated marian-dev submodule * Import wasm gemm from a separate wasm module - The fallback implementation of gemm is currently being imported dynamically for wasm target * Updated CI scripts and README to import GEMM from a separate wasm module * Setting model config to int8shiftAlphaAll in wasm test page * JS bindings for Quality Estimation (#239) * Quality Score bindings complete * Updated wasm test page to test the bindings - Word and sentence scores can be seen in browser console * Cache for translations (#227) Sets a cache to operate for each sentence that a TranslationModel process caching the corresponding marian::History for a {TranslationModel::Id, marian::Words} key. Cache is thus shared across multiple TranslationModels bound to the lifetime of a Service. Cache gracefully downgrades in the case of WebAssembly. * Set PR to any branch to trigger workflows (#230) * [ssplit-cpp] Enable position independent library when compiled from sources (#240) * EXCLUDE_FROM_ALL for marian and ssplit-cpp 3rd-party libraries (#243) * Update config "skip-cost" to enable log probabilities for QE scores (#247) - Updated wasm test page * Recover logging (#226) * Deprecate hardAlignment in favour of softAlignment (#250) * Updated marian submodule (#256) * Update ssplit cpp, pcre2 source compile to fix broken builds (#258) * Update ssplit cpp, pcre2 source compile to fix tests * Syncing with browsermt/ssplit-cpp * Removing accidental binary inclusion * Removing brt accidental update by git add -u * Fix windows workflow, vcpkg is broken use our cmake route * [ssplit-cpp] Try searching different library names for Windows * Fixes windows workflow for PCRE2 (#260) * Fix badge to point to this repo instead mozilla's (#261) * Make script run from any directory (#262) * Make script run from any directory * Import optimized gemm implementation (when available) for wasm target (#265) * Enable importing optimized gemm module for wasm - Updated emscripten generated JS code to -- import and use the optimized gemm module when available, otherwise use fallback gemm implementation * Added logging for gemm implementation being used for wasm target * HTML input (#253) Co-authored-by: Jelmer van der Linde <[email protected]> Co-authored-by: Abhishek Aggarwal <[email protected]> * HTML handling improvements (#266) * Fix out-of-bounds error when determining alignment for whole word If token at offset 0 was a continuation (which it always is, since the first word of a sentence does not start with a space) it would jump to (unsigned) -1 which is probably out of bounds. * Don't segfault if alignment info is not available When alignment info is requested, but model is missing `alignment: soft` you'd get empty alignment info for every target token. * Partial fix for handling empty elements This fixes a parse error when dealing with something like `<p>...<br></p>` or `...<br>` where there is no text after the last empty element. This also prevents losing empty elements in the source side of the translation. Empty elements are not yet transferred correctly to the target side. * Fix formatting * Updated marian-dev submodule * Updated configuration for html text translation to work in wasm test page (#269) * Updated translator configuration in wasm test page - Added alignment: soft * Set ResponseOptions::alignment to "true" - Had to be set for html text translation to work * More robust logic to import wasm gemm (#276) - Import optimized gemm implementation only if all the necessary functions are provided by it, othewise use the fallback gemm * Constrain mistune to fix docs CI (#278) * Additional logs in JS translation worker (#277) - Print source text received in the response - Print no. of block elements in the input * Proper arch setting on win32 (#275) * Proper arch detection on win32 * Whoops * Remove value length limit from HTML parser & interpolated alignments (#274) * Remove InterpolateAlignment And some code improvements * Replace the fixed value buffer with a std::string backing * Fix tests that had no alignment info These depended on the linear interpolation that I removed * Remove arbitrary limits on tag and attribute names This might also fix a bug caused by the eager lower casing of tag names, which could break <![CDATA , <style> and <script> * Remove equals() in favour of operator==() I trust the compiler can come up with better optimisations than I can. * Expose std::strings instead of their data Should save us some std::strlen() calls * Add & remove headers and no-longer-defined functions from header files * Remove all string buffers from xh_scanner It now directly refers to either the input stream or constant strings * Replace custom string_view with even lighter struct that's only used internally To the outside world we just expose std::string_view * Remove __builtin_sub_overflow for MSVC * ABORT if trying to restore HTML when no alignment info is available * Add test cases specifically for xh_scanner Both good for testing regression, and as a little example/reference for what behaviour to expect from it. * Add --html option to bergamot for tests This should make it easier to have some integration tests for HTML input * Add test and fix for empty inputs failing due to alignment check Co-authored-by: Jerin Philip <[email protected]> * Disabled importing optimized gemm module (#282) - Until the optimized gemm module stops requiring Shared Array Buffer, we can't really use it in Firefox * Adding circle ci job to push the wasm artifacts to github releases (#280) * Adding circle ci job to push the wasm artifacts to github releases. * Updated config.yml * Increase HTML test coverage (#279) * Fix bug in HasAlignments check When fixing it to allow empty sentences, it no longer caught misconfigured models. I've added a test that triggers this scenario, and a fix in HasAlignments for it. * Add more unit tests for xh_scanner Trying to increase that code coverage to 100% * Add test for whitespaces around attributes * Make accessing value(), attr_name() and tag_name() at the wrong time safer * Fix bug in <style> and <script> parsing The end tag was never found * Fix parsing of mix of valueless and quoteless attributes * Sync list of void tags with Firefox' implementation of outerHTML and innerHTML Also lets use their name for it: IsVoidTag instead of IsEmptyElement. Empty was a bit ambiguous. * Bring back support for processing instructions support in xh_scanner I noticed in https://searchfox.org/mozilla-central/source/dom/base/nsContentUtils.cpp#8961 that these can be produced by innerHTML under some circumstances. * More permanent link * Use CamelCase for the internal functions I added * Rename *_PI to *_PROCESSING_INSTRUCTION Your IDE will do the typing for you anyway * Match symbol naming of the rest of code base CapitalCase for classes, camelCase for functions, snake_case for variables still. * Missed one 😴 * Change xhscanner's variable case also to camelCase * Partially fix case variables in html.cpp * Better command-line with isolation for both Services and co-located defaults and parsing (#252) * CLI Rework * Consolidate common tests, template specialize CLI * Remove remnant cache stuff * [BRT]: Run BRT with new cli * Formalizing bridge * Removing stuff from parsing and moving to TestSuite * Template includes, everything consolidating at tests * Inlining readFromStdin * Removing unnecessary headers * Checking in template implementation which was missing * Sane defaults, some catches at BRT * BRT: Install fixes * Updating marian-dev to point to main * Removing the enum indirection, using strings at one place, directly * Fix typo; * [BRT] test blocking service via native * Conservative defaults for workers and cache-mutex buckets in AsyncService * Create proper barriers for cmdline app * Build failure fixes * Moving common, common-impl to a familiar structure * Binary reorganization: async, blocking, wasm - async tests AsyncService - blocking tests BlockingService - wasm arranges tests for things that are Mozilla requirements. eg: - bytearray - multiple sentences in same translate request workflow. * [brt] updates to adapt to cli rework * [brt] updates to adapt to cli rework, all working * Empty commit, sync brt online and run GitHub CI * Switch for parser to have multiple mode or not * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Fix for --bergamot-mode being removed from CLI app * [brt]: Removing remnant faithful translation test from blocking/ * HTML transfer empty elements (#283) * Fix test case This should now be implemented * Remove FilterEmpty This path wasn't used anymore anyway, empty tags just got their own spans, and never reached the stack. * Insert skipped empty source spans into target HTML Also refactor variable names to better match their contents and be more consistent with each other. This implementation passes all test cases, finally! * Fix remaining style changes * Move HTML formatting to its own section That code had become exact copies in three different places * CI: Circle CI config script update (#287) - Robust artifact presence check - Variable name refactoring - Storing only those artifacts that are required - Remove commit sha from the names of the Github Releases - Use BERGAMOT_VERSION file contents for Git Tag names * GitHub CI: Update YAML to run all tests on marian-full (#292) Previously there were #native tags and #wasm tags separating the two. There is now a clear separation between async, blocking and wasm. * HTML basic integration tests (#291) * Fix typo in BRT args on CI runs (#294) * Turn logging off by default, allow turning on via config/cmdline (#295) * Turn logging off by default, allow turning on via config/cmdline * No need to store config in member variable if things are decided at construction time * cache: threadsafety-fixes; optional stats collection (#245) * Make stats hits misses atomic to guard when mutex has multiple buckets * Use compile time switch for cache-stats-collection bound to COMPILE_TESTS cmake variable * -DENABLE_CACHE_STATS on if COMPILE_TESTS otherwise optional * Make stats() call without enabling build fatal abort * Have alignments placed if HTML is on (#296) * HTML transfer script/style/etc elements (#285) * CI guaranteed example documentation (#300) * Convert marian-integration markdown to rst * Convert native run into a script, include in rst * Check with CI that the native running example works without fail * Defer model loading to parallel worker thread (#303) * Treat most HTML elements as word-breaking (#286) * First class pivot translation capability (#236) Translates a text from source-language to target-language through a pivot-language. Effectively runs models in series, while having the following additional benefits compared to when `Service::translate(...)` would be used repeatedly. 1. Consistency in sentences between source and target. Consistent creation of the alignment matrix for use in downstream tasks like tag-translation. 2. Efficient sentence-splitting (does not sentence-split twice, creating inconsistencies). 3. The `Response` generated can be used as if it were coming through `translate(...)`, eliminating any need for additional code for clients in JS or python or C++. `AsyncService::pivot(...)` is provisioned for C++ multi-threaded setting and `BlockingService::pivotMultiple(...)` provisioned for blocking use-case targeted at WebAssembly. # [BRT]: Test additions, accompanying fixes For `AsyncService` for a test-case involving of en->es, es->en (same vocabulary, another one might be more coverage but is too much work). 1. Asserts the Alignment generated after pivoting is a probability distribution over source tokens given target. 2. Outputs the sentences going from en->en, which should stay consistent over continuous development to ensure nothing breaks. 3. An accuracy minimum of 70% of token matches from source to target calibrated on the standard bergamot input text is additionally present, ensuring that the English tokens at start and end match exactly. # HTML Pipeline This PR reworks the HTML translation pipeline to be outside response-construction via callbacks. * Accept XHTML-style self-closing void tags (#305) Allow the self-closing `/>` end for void tags. For non-void tags these were already "allowed" due to how the HTML parser works, but for elements where they actually occur, like `<br/>`, they caused a parse error. Support for them was not implemented since we only expect valid HTML5, e.g. the output of Firefox' Element.innerHTML. Use case: TranslateLocally uses Qt's HTML representation of rich text. That HTML uses self-closing tags like `<meta .../>` and `<br/>`. Implementing a string replace operation that would only match these elements without parsing HTML is tricky. Fixing it in bergamot-translator is not. Implementation: Currently `<img>` is marked as a void tag (an element which cannot have children or text, and therefore treated differently. Since void tags normally have no close tag, they are treated as immediately closed. The HTML parser we use reads `<img/>` as `<img></img>` which thus causes a problem since now we close an element that was never open, to begin with. This fix ignores the `TT_TAG_END` token from the parser when the tag name is that of a void tag. * Streamline memory-bundle loads (#307) Provides an additional constructor which takes care of the bundle loading inside the boundary of the source here, when a configuration file is supplied from a client like translateLocally or python bindings. Once the config file is read, we have access to the information required to construct the MemoryBundle. - The command-line application supplied from here, app/bergamot is configured to use the fast-load path now. - Changes to binary-loading additionally revealed a bug in the example-run script used in docs and tied to CI and the fix is included. - Shortlist is made optional in the memory bundle, making changes to getModelMemoryFromConfig. Fixes #304. Fixes #306. See also: XapaJIaMnu/translateLocally#82. * Add API to trigger fast shutdown of AsyncService (#297) Add a way to AsyncService to shut down without finishing the full queue through `AsyncService::clear()`. The default behaviour is that `AsyncService::~AsyncService()` will wait for any pending translation requests to finish. One can call `AsyncService::clear()` before the calls to the destructor to ensure there is no work for the service to finish before the workers can stop and join. Marian batches that are already in progress will not stop. We are not trying to cause interrupts in threads or something that complex. However, these single batches often do not take that long to complete. Changes: - Add clear() to AsyncService - Add clear() to BatchingPool - Documentation See also: XapaJIaMnu/translateLocally#80 * Speed up Windows CI with ccache (#308) Use https://github.com/cristianadam/ccache/releases/ to speed up windows compilation. Remove /Zi as it is unsupported by ccache at the moment. This is a debug flag that was removed in upstream marian-dev https://github.com/browsermt/marian-dev/pull/43. However, the bergamot CMakeLists.txt which was originally taken from marian maintained this under MSCV. * Remove unused compiler hash script (#309) * Batteries included python package (#310) Imports python bindings and associated sources incubated in https://github.com/jerinphilip/lemonade to bergamot-translator. Adds a pybind11 dependency for python bindings. Following the import, the python build is integrated into the existing CMake based build system here. There is a command-line application provided through python which provides the ability to fetch and prepare models from model-repositories (like browsermt/students or OPUS). Wheels built for a few common operating systems are provided via GitHub releases through automated actions configured to run at tagged semantic versions and pushes to main. The documentation for python is also integrated into our existing documentation setup. Previous documentation GitHub action is now configured to run behind python builds in Ubuntu 18.04 Python3.7, in order to pick up the packaged as a wheel bergamot module and the sphinx documentation using the python module. Formatting checks of black, isort with profile black and a pytype type checker is configured for the python component residing in this repository. * BRT: Update to fix QE download failures (#321) * Fix HTML with pivoting (#323) Previously BlockingService pivoting missed preproc and postproc for HTML leading to issues in WebAssembly API. This change adds fixes for the same, along with test coverage for the functionality over both async and blocking services. * Remove obsolete workflow transferring source across forks (#326) * Wasm/JS: Pivot translation API JS binding and test page update (#327) * emscripten: ccache and artefact upload (#325) Enables ccache for emscripten. The configuration uses pyiodide for a reference (https://github.com/pyodide/pyodide/pull/1805). Two workflows to run on macOS and Ubuntu, reduced to one on Ubuntu. As emscripten and the target is cross-platform, also macOS runners being limited - it makes sense to have this removed. Upload artefact enabled in preparation for a release action to be scheduled which will upload the bergamot*.wasm and bergamot*.js for consumption. * Consolidate release artefacts (#329) Brings in the previously wasm.yml into python.yml and the new file is renamed as build.yml. python.yml already had a version and pre-release jobs. These jobs downloaded artefacts from prior ran jobs (python wheel builds). The newly attached emscripten build now uploads artefacts of a WebAssembly binary and javascript file which are fed into the release and pre-release jobs in addition to the existing python builds. * Increment version to v0.4.0 (#328) * Make default throw exception on abort for python (#333) This also allows conversion of exiting aborts into runtime errors in python, providing informative messages to the user via pybind11 existing tooling. * Revert "Make default throw exception on abort for python (#333)" This reverts commit 97bd6e36dbdec3519133d91289d7fd31816cb09a. As discussed, we need messages for debugging in -fno-exceptions. * Revert "Revert "Make default throw exception on abort for python (#333)"" This reverts commit 62ff781ed4ea642912878145beaf3157123520fe. Sorry I should have realized Jerin was only amending python and therefore this didn't break WASM. Apologies to Jerin on this. * JS/WASM: Re-enable importing optimized gemm module for (#336) - Re-enabled the code that imports optimized gemm module for wasm when available * Print errors by default in WASM build (#343) * Remove BadHTML exception in favour of ABORT macro `ABORT()` gives us readable error messages, even when exception support is disabled. * Control marian exception global setting in tests through fixture * WASM: construct BlockingService with critical logging by default This log level is only used by ABORT() See also: - mozilla/firefox-translations#65, - mozilla/firefox-translations#68 - mozilla/firefox-translations#70 - mozilla/firefox-translations#56 * Add ability to load `.npz` models (#342) Changes `ABORT` on non `.bin` model to an additional check for a `.npz` extension. If `.bin`, the fast load path is activated by returning `AlignedMemory`. Otherwise, the return of empty `AlignedMemory` causes fallback to filesystem-based loads. BRT: A test that checks if translation using `.npz` is approximately similar to that of default CLI translation is checked in to ensure stability going ahead. Previously, we only supported `.bin` models' loading via a fast mmap path. While we had the underlying capability to load non `.bin` models, this was not exposed, encouraging fast loads. Loading `.npz` models are helpful for quick debugging and broader coverage of models available, which will enhance user experience at translateLocally and python bindings. Fixes #341. See also: XapaJIaMnu/translateLocally#89 * Allow per-input options (#346) Changes signature of BlockingService::{translate,pivot}Multiple functions to take per input options, so a mix of HTML and plaintext can be sent from the extension. Templating over testing is adjusted to allow for continuous evaluations by modifying the test code. Updates WebAssembly bindings to reflect the change in signature and the javascript test-page to work with the new bindings. This change lacks an accompanying test specific to the mixed HTML and plaintext inputs. Fixes: #345 See also: mozilla/firefox-translations#94 Co-authored-by: Jelmer van der Linde <[email protected]> * JS/WASM: Passing ResponseOptions for every item for translation batch api (#348) - Now translate() JS API accepts ResponseOptions per batch item - Fixed the logic to create vector<ResponseOption> * Update aligned vector following intgemm 1b8cbd6f611c21011325cfe0312940f0635dea33 (#334) Fixes memory leak ifdef for -fno-exceptions including clang-cl Move spacing back to intgemm upstream Co-authored-by: Jerin Philip <[email protected]> * Improve cache (#347) Hide `cache-mutex-buckets` from the user. Now configured to be equal to number of workers. Python bindings which had exposed these are modified to reflect the API change. `std::optional` enabled on cache, constructed only if enabled. Pointers used are replaced with an equivalent `std::optional.` Fixes: #317 * JS: Refactoring wasm test page (#354) * Free all the objects properly that were constructed for translation api * Refactored pivot detection mechanism * Create github release via CircleCI only for mozilla fork (#349) * Create github release via circleci only for mozilla fork - The extension uses mozilla fork for translator artifacts -- Hence create github release via circleci only when running in mozilla fork * Small refactoring in ci script * Bump version to 0.4.1 (#356) * Improve handling HTML special cases (#312) - Prefer spreading markup over a full word. - Ignore certain tags that are unlikely to be supposed to be translated, such as `<code>` and `<samp>`. - Never treat `<wbr>` as a space. - Allow for inconsistent cases in tag names. - Fix bug where void elements were inserted multiple times. - Better handling of whitespace around punctuation. - Ignore parsing `<noscript>` to be compatible with Firefox. - Improvements to documentation and readability of `HTML` and `Scanner` classes. Fixes: #313, #339 * Simplify cache config and bind for use in JS (#359) Deprecates cacheEnabled parameter to be replaced with cacheSize=0. Python bindings, Documentation in comments and tests updated to reflect this change. Exposes the fields corresponding to cache via embind as a value object. The equivalent object-based syntax in worker.js allows propagation from JS. Fixes: #351 See also: mozilla/firefox-translations#96 * Embed quality-scores as HTML tag attributes (#358) Quality scores for HTML translation exposed as <font x-bergamot-sentence-score=""> and <font x-bergamot-word-score=""> tags in the HTML output. While this increases the size of the HTML returned, the resulting rendered HTML can easily be styled to show the scores. With Javascript or CSS, developers can easily have some interface based on these extra attributes. Also includes updates to the test page to show a proof-of-concept demonstration. Fixes: #355 * Enable dependabot to automate updating dependencies (#365) Following marian-nmt/marian-dev. * Use right range and threshold for showing "bad" words/sentences (#370) * Use ln(0.5) as the threshold * Use right range for showing "bad" words/sentences * Bump version to 0.4.2 (#371) * Bump 3rd_party/marian-dev from `08b1544` to `7e67124` (#372) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `08b1544` to `7e67124`. - [Commits](https://github.com/browsermt/marian-dev/compare/08b1544636fe13eaf1fbacb17c6fb050abfb8d42...7e67124ae0bc11b42f2e6373489831c9a2498499) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * JS: Reuse Model registry from firefox-translation-models for test page (#377) * JS: Reuse Model registry from firefox-translation-models repo for test page - https://github.com/mozilla/firefox-translations-models/blob/main/registry.json is reused - Removed existing registry * JS: Using supervised QE models for available language pairs (#378) * JS: Refactored model loading - Passing single vocab memory via JS * JS: Use supervised QE models when available * Ran clang format * Bump 3rd_party/marian-dev from `7e67124` to `844800e` (#382) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `7e67124` to `844800e`. - [Release notes](https://github.com/browsermt/marian-dev/releases) - [Commits](https://github.com/browsermt/marian-dev/compare/7e67124ae0bc11b42f2e6373489831c9a2498499...844800efccba6e670250caac1735ca2c8c8e508e) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * JS: Update languages & use Intl API for their display names (#379) Got the languages from registry.json, including non-prod models. Code now calls into `Intl.DisplayNames()`[1] to make life easier. [1] (http://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/DisplayNames/DisplayNames) * JS: Fix swap button on test-page (#388) * Docs: Pin Jinja2 to last known working version (#389) Fixes the docs workflow which is failing after pip is picking up Jinja 3.20. We only need >=2.3, this one sets it to 3.0.3 builds were successful last. * Bump version to 0.4.3 (#392) * Bump bergamot-translator-tests from `d03a9d3` to `7984d14` (#394) Bumps [bergamot-translator-tests](https://github.com/browsermt/bergamot-translator-tests) from `d03a9d3` to `7984d14`. - [Release notes](https://github.com/browsermt/bergamot-translator-tests/releases) - [Commits](https://github.com/browsermt/bergamot-translator-tests/compare/d03a9d316d40ba45c475018287971523666bf51e...7984d140aef00489699d0b7711fa942816224294) --- updated-dependencies: - dependency-name: bergamot-translator-tests dependency-type: direct:production ... * Fix call to `isspace` (#396) Documentation is explicit about only calling it with unsigned char, and Windows runtime is checking this. * Bump 3rd_party/ssplit-cpp from `a08d6bc` to `49fde6d` (#408) Bumps [3rd_party/ssplit-cpp](https://github.com/browsermt/ssplit-cpp) from `a08d6bc` to `49fde6d`. - [Release notes](https://github.com/browsermt/ssplit-cpp/releases) - [Commits](https://github.com/browsermt/ssplit-cpp/compare/a08d6bce20619a8475736832d5418458c14db9d4...49fde6df7ee9199aedb9571be800448192e3515c) --- updated-dependencies: - dependency-name: 3rd_party/ssplit-cpp dependency-type: direct:production ... * Update and fix windows CI (#410) * Use a more vanilla windows workflow from translateLocally, remove the complicated lukka/*. Also removes port overrides in the overall upgrade. * Disable vcpkg binary caching * Remove PCRE library hacks after upstream ssplit improvements * Upgrade emsdk to 3.1.8 (#414) * Rework WASM compilation options Necessary to work with newer versions of emscripten that are more picky about which option goes to the compiler, and which to the linker. Also took the opportunity to remove the need for the patching of the bergamot-translation-worker.js file, this can now easily be done through supported apis. Furthermore, I tried to downsize the generated javascript and wasm code a bit. Initial estimates show that bergamot-translator compiled with emscripten 3.0.0 runs at about 3x the speed of 2.0.9 (when using embedded intgemm). Speed-up when using mozIntGemm is less dramatic. * Updated marian-dev submodule * Revert changes specific to patching external gemm modules for wasm * Better Compilation and Link flags - Added "-O3" optimization flag for linking as well - "-g2" only for release and debug builds - "-g1" for release builds - Replaced deprecated "--bind" flag with "-lembind" - Removed redundant link flag * Upgraded emsdk to 3.1.8 * Enclosed EXPORTED_FUNCTIONS values in a list * Fixed the remaining 2.0.9 reference in circle ci build script * Updated README Co-authored-by: Jelmer van der Linde <[email protected]> * Bump version to 0.4.4 (#415) * Bump 3rd_party/marian-dev from `199201e` to `e88c1aa` (#416) Bumps [3rd_party/marian-dev](https://github.com/browsermt/marian-dev) from `199201e` to `e88c1aa`. - [Release notes](https://github.com/browsermt/marian-dev/releases) - [Commits](https://github.com/browsermt/marian-dev/compare/199201eb89b2941afdadb14164e936d412f897ad...e88c1aa5d5c5622cb52c7df09fbb7c3d7f4b5b5a) --- updated-dependencies: - dependency-name: 3rd_party/marian-dev dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Set up python packaging for pypi distribution (#424) Old GitHub CI using Ubuntu and MacOS explicitly and building wheels have been removed in favour of the more portable pypa specified builds. These wheels should work just as well across a wider range of distributions. pybind11:CMakeLists.txt requires Development.Module instead of Development.* to avoid Embed from getting in the way of manylinux builds. manylinux_x86_64 builds are added for cp3.6 - 3.10. The linux build uses an old image via docker. Since the docker images are able to use shared ccache folder, builds quite fast on warm starts. ccache usage in setup.py is now triggered by an environment variable. This allows for builds not to fail if ccache not present. On tag pushes corresponding to versions, CI is configured to deliver built wheels to PyPI, reading from repository secrets. Improves setup.py including documentation and some formatting, and additional links to source. Fixes: #315 * Basic HTML property testing for WebAssembly (#425) Import https://gist.github.com/jelmervdl/a4c8b6b92ad88a885e1cbd51c6ad4902 and attach it to CI. NodeJS-14 is failing on trying to use the WebAssembly binary. So we use node-16 independently setup. This paves way for more complicated testing for WebAssembly bindings in the future. * Bump version to 0.4.5 (#427) * Python package: pyyaml >= 5.1 (#429) Fixes issue on Colab which says vanilla YAML intall (3.x) does not have yaml.FullLoader (https://stackoverflow.com/a/55553392/4565794). Fix a broken link for presentation in PyPI. * Python: Work offline if models are available (#431) Try to check if models.json is downloaded first, if it is use it. If not, fall back to attempting to fetch it from the network. Fixes: #430 * MacOS Wheels (#432) * Remove trailing whitespace * Additional MacOS wheels: Wheels for python 3.6 to 3.10 with a minimum target of MacOS 10.9 * Install bergamot package from wheel directory * Remove no-index as we need dependencies * update download path * try to update coding_styles workflow * Latest and greatest clang-format * Bump qs and express in /wasm/test_page (#444) Bumps [qs](https://github.com/ljharb/qs) to 6.11.0 and updates ancestor dependency [express](https://github.com/expressjs/express). These dependencies need to be updated together. Updates `qs` from 6.7.0 to 6.11.0 - [Release notes](https://github.com/ljharb/qs/releases) - [Changelog](https://github.com/ljharb/qs/blob/main/CHANGELOG.md) - [Commits](https://github.com/ljharb/qs/compare/v6.7.0...v6.11.0) Updates `express` from 4.17.1 to 4.18.2 - [Release notes](https://github.com/expressjs/express/releases) - [Changelog](https://github.com/expressjs/express/blob/master/History.md) - [Commits](https://github.com/expressjs/express/compare/4.17.1...4.18.2) --- updated-dependencies: - dependency-name: qs dependency-type: indirect - dependency-name: express dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Arm updated (#443) * ARM Support using ruy and simd_utils * Adding ARM build on GitHub CI * Add workflow and successful build ssplit-cpp modified to get cross compiled android on GitHub CI working. * Client side fixes for int8 no shift on ARM [python] * Revert "Client side fixes for int8 no shift on ARM [python]" This reverts commit 020af05a8b1f4b4ef46373e6e61dcd32869fc1b1. * moving int8shift no-op inside the library * Bump 3rd-party/marian-dev * update the marian branch test * arm backend works * Latest and greatest clang-format Co-authored-by: Jerin Philip <[email protected]> * Apply security update and formatting * Expand the node-test.js example code with documentation (#434) * Expand the node-test.js example code with documentation Is there a better way to document code than by providing an annotated & working example of it? Just listing all the exposed methods feels like giving people a box of bricks and expecting them to build a house with it. * Use @Jerin's feedback to simplify node-test.js explanations * Use native `console.assert` instead See #426 for an explanation * Fix comment Co-authored-by: Nikolay Bogoychev <[email protected]> * More portable WASM demo (#437) * Replace most of the wasm demo page with code from the firefox extension This code should be more generic and copy/pastable into other projects. Maybe one day it will be an npm package? * Fix Ukrainian model support * Add quality estimation output Automatically enabled when the model(s) support it * Little "Translating…" indicator * Don't make Safari fail on something tiny * Rewire lots of async state to be able to predictably know when the translator is working or not Previously so much was lazy loaded that it was not easy to catch lack of SIMD support. Now I can just enable the interface only after it has properly loaded. * No need for a two-stage setup for the worker. Just promise to call `initialize()`! * More (correct) types and comments for code * Keyboard shortcuts for input area for bold, italic and underline. Enough to demo mark-up translation * Fix `delete()` * Move javascript glue code into its own npm package * Add nodejs support and test to package * More stand-alone build command …for now, not really used by anything I think * Ignore build packages * Use local filesystem for build so it is automatically cached * fix overflow on demo page But this might break the mobile demo? I'll have to check into that * Bring back integrity check, except for NodeJS for now * Make `build` part of `prepare` so we always make sure we build a complete package * Move worker code into its own folder This way I can mark it as a commonjs module which will help cause nodejs treat the files the same as WebWorkers do right now. Firefox doesn't implement `{type: 'module'}` yet for WebWorkers. * Add README * Fix paths * Add npm publish automation * Make sure webpack ignores node compatibility code * Add missing webpack:ignore around a worker * Default to getting models from S3 * Separate "loading" and "translating" indicators * Bump npm package version * Add credits * Don't block on the worker loading * Not just Mozilla, but Bergamot! * Make individual translation requests cancelable * Swap button turns vertically when in skyscraper mode * Make it easier to debug errors from inside the worker * Don't bork on deleting a failed worker * Don't bork on calling translate() with a failed worker * Handle compilation error with more grace * `contenteditable=true` seems to work better with some browser extensions Looking at you, Vimium! * Clean up abort promise * Bump npm package version * Remove `workerUrl` option in favour of better webpack support With that option it was hard for Webpack to figure out dependencies, and it did not enter my worker script for rewriting. With the hardcoded url it does, and with a bit of `new webpack.DefinePlugin({'typeof self': JSON.stringify('object')}),` we can have webpack remove node-specific code on build! * Bump version Minor API change hehe Co-authored-by: Nikolay Bogoychev <[email protected]> * Fix comp…

eu9ene · 2024-10-07T21:43:04Z

I have a hypothesis that it's just harder to train Balto-Slavic languages.

Those comprise the Baltic:

Latvian
Lithuanian

and the Slavic ones:

Czech
Slovak
Slovenian
Ukrainian
Russian
Polish
Serbo-Croatian (Serbian, Croatian, Bosnian, Montenegrin)
Bulgarian

This list almost matches the languages we had problems with in the big training.

eu9ene added the quality Improving robustness and translation quality label Oct 24, 2023

eu9ene self-assigned this Oct 24, 2023

eu9ene mentioned this issue Oct 31, 2023

[meta] General translation quality improvements #216

Open

eu9ene mentioned this issue Nov 20, 2023

Distillation is broken #272

Closed

eu9ene mentioned this issue Jul 19, 2024

English to Lithuanian did not meet our quality bar #756

Open

This was referenced Jul 29, 2024

Reduce monolingual data for da-en to investigate distillation performance #771

Closed

Investigate improving en-lt student distillation by adding more data #772

Closed

Figure out the behavior of OpusTrainer augmentation on student distillation gap #773

Closed

eu9ene assigned gregtatum Aug 28, 2024

eu9ene removed their assignment Sep 10, 2024

eu9ene mentioned this issue Sep 12, 2024

Consider rebalancing datasets with clustering #844

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate distillation quality gap #231

Investigate distillation quality gap #231

eu9ene commented Oct 24, 2023

eu9ene commented Jul 16, 2024 •

edited

Loading

eu9ene commented Jul 19, 2024

eu9ene commented Jul 19, 2024

gregtatum commented Jul 29, 2024

gregtatum commented Jul 29, 2024

gregtatum commented Aug 7, 2024

gregtatum commented Aug 20, 2024 •

edited

Loading

eu9ene commented Oct 7, 2024

Investigate distillation quality gap #231

Investigate distillation quality gap #231

Comments

eu9ene commented Oct 24, 2023

eu9ene commented Jul 16, 2024 • edited Loading

eu9ene commented Jul 19, 2024

eu9ene commented Jul 19, 2024

gregtatum commented Jul 29, 2024

gregtatum commented Jul 29, 2024

gregtatum commented Aug 7, 2024

gregtatum commented Aug 20, 2024 • edited Loading

eu9ene commented Oct 7, 2024

eu9ene commented Jul 16, 2024 •

edited

Loading

gregtatum commented Aug 20, 2024 •

edited

Loading