Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameters set using createWorker config argument overwritten by default arguments #975

Open
Balearica opened this issue Nov 25, 2024 · 2 comments

Comments

@Balearica
Copy link
Member

The createWorker config argument allows for setting parameters prior to initialization. While this function was originally added to support a handful of init-only parameters (notably load_system_dawg, load_number_dawg, and load_punc_dawg), it should be able to support all parameters, and there is nothing in the documentation to indicate it only supports specific parameters.

However, at present, any settings provided in this config argument that conflict with the default parameters defined in defaultParams.js are overwritten by the defaults. It looks like this only impacts tessedit_pageseg_mode and tessedit_char_whitelist, as these are the only Tesseract parameters in the defaults file.

params = defaultParams;
await setParameters({ payload: { params } });

I will investigate the commit history before making a change, however I currently believe the code that sets the default Tesseract parameters can be cut entirely. Both values we are setting are already the defaults for the Tesseract API, so it's unclear why we are setting them manually.

@Balearica
Copy link
Member Author

Upon a brief review, it looks like setting the default parameters here may have served a couple different purposes in the past.

  1. At specific points in this repo's history, our default arguments have been different from those of Tesseract
    1. E.g. this version of the file sets user_defined_dpi to 300, which is not a default behavior.
  2. A previous version of the repo combines the defaults with user-defined parameters, which makes much more sense than what happens now.
    1. /**
      * handleParams
      *
      * @name handleParams
      * @function hanlde params from users
      * @access private
      * @param {string} langs - lang string for Init()
      * @param {object} customParams - an object of params
      */
      const handleParams = (langs, customParams) => {
      const {
      tessedit_ocr_engine_mode,
      ...params
      } = {
      ...defaultParams,
      ...customParams,
      };
      api.Init(null, getLangsStr(langs), tessedit_ocr_engine_mode);
      Object.keys(params).forEach((key) => {
      api.SetVariable(key, params[key]);
      });
      };
    2. I don't think it's necessary to implement this, however, as I do not believe our defaults are any different from the Tesseract defaults.

I now am fairly confident that this can be cut without consequence, so will do so.

@Balearica
Copy link
Member Author

If we cut the settings discussed above, the only thing left in the defaultParams.js file is the tessjs_create_hocr/tessjs_create_tsv/etc. settings that were depreciated multiple major releases ago. Therefore, we should be able to cut that entire file. The only thing to confirm is that the default output formats stay the same before/after, as otherwise this would be a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant