Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a new thread-safe interface. #1

Merged
merged 7 commits into from
Jun 22, 2016

Conversation

markbrown
Copy link
Contributor

The new API is as discussed on the libthai forum.

I've changed the names from what was first suggested, however, because I found the suggested names very confusing (th_brk_brk, th_brk_wbrk, etc).

I've also exported functions to access the shared dictionary used by the backwards compatible part of the implementation. These can be used to force loading as th_ensure_dict_loaded() did in an earlier proposal, and to safely free up resources if the application knows they are not needed. The shared dictionary has a const type to prevent it being explicitly deleted.

With this API you can use the word breaking functions in two ways:

  • Create your own dictionary (using a mutex) and use the new word breaking functions. This allows you to optionally specify the dictionary location.
  • Continue using the old word breaking functions, which use a shared dictionary. This dictionary always comes from the default location.

In other words, if you want a non-default dictionary you have to manage the sharing yourself.

The new API is as discussed on the libthai forum.

* include/thai/thbrk.h:
* include/thai/thwbrk.h:
* src/thbrk/thbrk.c
* src/thwbrk/thwbrk.c
  - New type ThDict for a loaded dictionary.
  - Define exported functions th_dict_new(), th_dict_delete(),
    th_dict_brk(), th_dict_brk_line(), th_dict_wbrk() and
    th_dict_wbrk_line() for using explicit dictionaries.
  - Define exported functions th_dict_get_shared() and
    th_dict_free_shared() for accessing a shared dictionary.
  - (th_brk, th_brk_line, th_wbrk, th_wbrk_line):
    Use the new functions with the shared dictionary.

* src/thbrk/brk-maximal.c:
* src/thbrk/brk-maximal.h:
  - (BrkEnv):
    Include a reference to the dictionary in the thread-local
    environment.
  - (brk_root_pool):
    Use the dictionary from the environment rather than always
    using the shared dictionary.

* src/thbrk/brk-common.c:
* src/thbrk/brk-common.h:
  - Define the ThDict data type, with brk_dict_new(), brk_dict_delete()
    and brk_dict_trie() operations.
  - Get rid of now-unneessary brk_get_dict() and brk_on_unload().

*src/libthai.c:
  - Call th_dict_free_shared() from the destructor.
@thep
Copy link
Contributor

thep commented May 16, 2016

I don't quite like exposing the concept of dictionary via a dedicated type ThDict. Thai word segmentation can be implemented with several approaches, including rule-based, dictionary-based, trigram-based, or other statistical ones. That's why I used a generic name like ThBrk in the discussion. The dictionary should be hidden, or at least the API should not emphasize it.

With this, the free list in BrkEnv can also be moved into ThBrk instance, to promote more resource reuse.

In src/libthai.map:

+LIBTHAI_0.1.24svn {
...
+} LIBTHAI_0.1.19;

Please remove the "svn" suffix from symbol versioning.

Thanks for your work!

markbrown added 2 commits May 18, 2016 08:35
As suggested by the library maintainer.

* Types ThDict and struct _ThDict are renamed to ThBrk and struct _ThBrk.
* Variables of these types are renamed from "dict" to "brk".
* Functions named th_dict_* are renamed to th_brk_*.
@markbrown
Copy link
Contributor Author

On Mon, May 16, 2016 at 5:18 PM, thep [email protected] wrote:

I don't quite like exposing the concept of dictionary via a dedicated type
ThDict. Thai word segmentation can be implemented with several approaches,
including rule-based, dictionary-based, trigram-based, or other statistical
ones. That's why I used a generic name like ThBrk in the discussion. The
dictionary should be hidden, or at least the API should not emphasize it.

Ah, I see. I have changed the names to what you originally suggested.

With this, the free list in BrkEnv can also be moved into ThBrk instance,
to promote more resource reuse.

Isn't this still needed to be separate for thread safety? I expect
applications will create a single ThBrk for use by all threads, but each
thread will have to create its own BrkEnv including its own free list. Same
reason why BrkEnv was created in the first place, if I understand correctly.

In src/libthai.map:

+LIBTHAI_0.1.24svn {
...
+} LIBTHAI_0.1.19;

Please remove the "svn" suffix from symbol versioning.

Fixed.

Thanks for your work!

I'm happy to contribute :)

@thep
Copy link
Contributor

thep commented May 18, 2016

On Wed, May 18, 2016 at 6:18 AM, markbrown [email protected] wrote:

On Mon, May 16, 2016 at 5:18 PM, thep [email protected] wrote:

With this, the free list in BrkEnv can also be moved into ThBrk instance,
to promote more resource reuse.

Isn't this still needed to be separate for thread safety? I expect
applications will create a single ThBrk for use by all threads, but each
thread will have to create its own BrkEnv including its own free list. Same
reason why BrkEnv was created in the first place, if I understand correctly.

Ah, you're right. Let's leave it separated, then.

In src/libthai.map:

+LIBTHAI_0.1.24svn {
...
+} LIBTHAI_0.1.19;

Please remove the "svn" suffix from symbol versioning.

Fixed.

Just want to make sure that you have also bumped the version
to 0.1.25.

Regards,

Theppitak Karoonboonyanan
http://linux.thai.net/~thep/

@markbrown
Copy link
Contributor Author

On Wed, May 18, 2016 at 3:05 PM, Theppitak Karoonboonyanan <
[email protected]> wrote:

Just want to make sure that you have also bumped the version
to 0.1.25.

Fixed now.

@markbrown
Copy link
Contributor Author

I sent that rather quickly. Did you mean bump the version in configure.ac also?

@thep
Copy link
Contributor

thep commented May 19, 2016

On Thu, May 19, 2016 at 3:20 AM, markbrown [email protected] wrote:

I sent that rather quickly. Did you mean bump the version in configure.ac also?

No. The version in configure.ac is only bumped on new releases.

Theppitak Karoonboonyanan
http://linux.thai.net/~thep/

@thep
Copy link
Contributor

thep commented May 19, 2016

--- a/src/libthai.c
+++ b/src/libthai.c
@@ -53,8 +53,14 @@

  • th_wcisldvowel(), th_wcisflvowel(), th_wcisupvowel(), th_wcisblvowel(),
  • th_wcchlevel(), th_wciscombchar()
    *

s/ThDictBrk/ThBrk/

  • * @subsection ThBrk Functions for word segmentation with the shared dictionary
  • * th_brk_get_shared(), th_brk_free_shared(),
    • th_brk(), th_brk_line(), th_wbrk(), th_wbrk_line()

I think the shared dictionary should be hidden from user.
Its purpose is just to retain old function behaviors internally.

So, please rename th_brk_get_shared() and th_brk_free_shared()
to brk_get_shared_dict() and brk_free_shared_dict() and make
them private.

--- a/src/thbrk/thbrk.c
+++ b/src/thbrk/thbrk.c
@@ -33,12 +33,46 @@
#include "thbrk-utils.h"
#include "brk-ctype.h"
#include "brk-maximal.h"
+#include "brk-common.h"

#define MAX_ACRONYM_FRAG_LEN 3

/**

  • * @brief Create a word break dictionary instance
  • * @param dictpath : the dictionary path, or NULL for default
  • * @return the created instance, or NULL on failure
  • * Loads the dictionary from the given file and returns the created word
  • * break dictionary instance. If @A dictpath is NULL, first searches in
  • * the directory given by the LIBTHAI_DICTDIR environment variable,
  • * then in the library installation directory. Returns NULL if the
  • * dictionary file is not found or cannot be loaded.
  • */
    +ThBrk *
    +th_brk_new (const char *dictpath)

In the documentation, please describe ThBrk as
"word breaker" instead of "word break dictionary".
This includes the 'brk' argument in th_brk_* functions.

As we have discussed, ThBrk represents a generic idea
of word breaking engines.

As an exception, the documentation for th_brk_new()
may describe it as "dictionary-based word breaker" for now.

Imagine when other approaches are added, we can
create other constructors and reimplement the th_brk_brk() et. al.
with polymorphism. So, the document should not mention
the dictionary.

Thanks!

* Do not expose th_brk_get_shared() and th_brk_free_shared():
  - Remove them from the header files and documentation.
  - Move the functions and static variable to src/thbrk/brk-common.c,
    and name the functions brk_get_shared_dict() and
    brk_free_shared_dict().
  - Remove the "const" from ThBrk in places where it was used. This
    helped users avoid accidentally freeing the shared dictionary,
    but no longer serves that purpose.

* Documentation fixes:
  - Describe ThBrk as a "word breaker".
  - Instead of separate ThDictBrk and ThBrk groups, just group them
    under ThBrk.
@markbrown
Copy link
Contributor Author

I've addressed your latest comments.

I've also removed the "const" from ThBrk in the places where I had added it. The reason I put it there in the first place was to help users avoid accidentally deleting the shared dictionary (instead of using the function for that purpose, which also clears the static variable), but we don't need to worry about that if the shared dictionary is not exposed.

Thanks for the feedback!

@thep
Copy link
Contributor

thep commented May 22, 2016

I feel a bit uneasy to see ThBrk-related stuffs taken apart in two files,
thbrk.c and brk-common.c.

Previously, brk-common.c held 2 functionalities regarding the dictionary,
loading default and keeping the shared instance. Trying to keep both
in the old place makes us put struct _ThBrk details in brk-common, while
the more natural place for it should be thbrk.c.

Let's implement ThBrk in thbrk.c, including keeping the shared instance.
brk-common.c should only hold the brk_load_default_dict() function and
it shouldn't know anything about ThBrk.

Thanks!

@markbrown
Copy link
Contributor Author

markbrown commented May 22, 2016 via email

@thep
Copy link
Contributor

thep commented Jun 11, 2016

Sorry that I somehow missed the notification, until I log into GitHub to check it.

thbrk-utils.h is meant for thbrk-module-wide utilities, while the ThBrk structure is not. Let's add a new header file thbrk-priv.h to hold the non-public declarations.

This keeps the imeplementation in one place, and means that brk-common.c
does not need to know about ThBrk.

* src/thbrk/thbrk.c:
* src/thbrk/brk-common.c:
  - Move the implementation of ThBrk from brk-common.c to thbrk.c,
    including the shared instance.

* src/thbrk/thbrk-priv.h:
  - New header file for non-public declarations of thbrk.c.

* src/thbrk/brk-common.h:
  - Export brk_load_default_dict().
  - Remove declarations for accessing ThBrk.

* src/libthai.c:
* src/thbrk/brk-maximal.c:
  - Use thbrk-priv.h instead of brk-common.h.

* src/thbrk/Makefile.am:
  - Add the new header file.
@markbrown
Copy link
Contributor Author

Done. How does it look to you?

@thep
Copy link
Contributor

thep commented Jun 21, 2016

Looks great!

BTW, I just realize that the names for brk_{get,free}_shared_dict() should be brk_{get,free}_shared_brk() instead, to reflect the ThBrk object they manipulate.

Otherwise, I think it's ready for merging.

* src/libthai.c:
* src/thbrk/thbrk-priv.h:
* src/thbrk/thbrk.c:
  - Rename functions brk_{get,free}_shared_dict() to
    brk_{get,free}_shared_brk, respectively.
  - Rename the static variable brk_shared_dict -> brk_shared_brk.
@markbrown
Copy link
Contributor Author

Fixed.

@thep thep merged commit b34b0af into tlwg:master Jun 22, 2016
thep added a commit that referenced this pull request Jun 22, 2016
Implement a new thread-safe interface for word break.

To achieve more thread-safety without depending on mutex mechanisms,
a new set of APIs is added so that the client can create
a shared instance of word break engine by him/herself
under appropriate mutex. Then, the word break functions
can be safely called in parallel using the shared engine.

* include/thai/thbrk.h:
* include/thai/thwbrk.h:
* src/libthai.c:
* src/libthai.def:
* src/libthai.map:
  - Add new exported APIs:
    th_brk_new(), th_brk_delete(),
    th_brk_brk(), th_brk_brk_line(),
    th_brk_wbrk(), th_brk_wbrk_line().

* src/thbrk/brk-common.h, src/thbrk/brk-common.c
  (-brk_on_unload, -brk_get_dict, +brk_load_default_dict):
  - Remove old shared dict management. It's to be as part of
    ThBrk implementation in ThBrk layer instead.
  - The logic for finding and loading dictionary at
    default paths is still retained here.

* src/thbrk/thbrk.c (th_brk_new, th_brk_delete):
  - Implement ThBrk (de)allocation, with dictionary loading
    at specified path or at default paths if not specified.

* src/thbrk/brk-maximal.h, src/thbrk/brk-maximal.c
  (struct _BrkEnv, brk_env_new):
  - Add ThBrk engine as a member of BrkEnv.
* src/thbrk/brk-maximal.c (brk_root_pool):
  - Access dict trie from ThBrk in BrkEnv instead of getting
    shared dict directly.

* src/thbrk/thbrk.c
    (th_brk -> th_brk_brk, th_brk_line -> th_brk_brk_line):
* src/thwbrk/thwbrk.c
    (th_wbrk -> th_brk_wbrk, th_wbrk_line -> th_brk_wbrk_line):
  - Modify old functions to new ones by adding ThBrk* parameter.

* src/thbrk/Makefile.am, +src/thbrk/thbrk-priv.h,
  src/thbrk/thbrk.c (brk_get_shared_brk, brk_free_shared_brk):
  - Add functions for managing the shared engine to preserve
    old behavior.
* src/libthai.c (_libthai_on_unload):
  - Call brk_free_shared_brk() on unload.
* src/thbrk/thbrk.c (th_brk, th_brk_line):
* src/thwbrk/thwbrk.c (th_wbrk, th_wbrk_line):
  - Provide old APIs as wrappers to the new APIs,
    for backward compatibility.

Merging pull request #1.
@thep
Copy link
Contributor

thep commented Jun 22, 2016

Merged. Thank you very much for your contribution!

@markbrown markbrown deleted the thread-safe-dict branch June 23, 2016 21:24
@markbrown
Copy link
Contributor Author

On Wed, Jun 22, 2016 at 6:26 PM, Theppitak Karoonboonyanan <
[email protected]> wrote:

Merged. Thank you very much for your contribution!

You're welcome!

Looking forward to the next release. Do you have an expected time for that?

Mark

@thep
Copy link
Contributor

thep commented Jun 24, 2016

It should be released soon, after I finish updating other parts related to the API addition.

@thep
Copy link
Contributor

thep commented Jun 25, 2016

Before they get public, I decide to rename the ThBrk methods to be more meaningful:
- th_brk_brk() -> th_brk_find_breaks()
- th_brk_brk_line() -> th_brk_insert_breaks()
- th_brk_wbrk() -> th_brk_find_breaks_wc()
- th_brk_wbrk_line() -> th_brk_insert_breaks_wc()

thep added a commit that referenced this pull request Jun 25, 2016
The methods for ThBrk I proposed to Mark Brown in our discussion
were too confusing. Before they get public, let's pick more
meaningful names instead:
  - th_brk_brk()       -> th_brk_find_breaks()
  - th_brk_brk_line()  -> th_brk_insert_breaks()
  - th_brk_wbrk()      -> th_brk_find_breaks_wc()
  - th_brk_wbrk_line() -> th_brk_insert_breaks_wc()

* include/thai/thbrk.h:
* include/thai/thwbrk.h:
* src/libthai.c:
* src/libthai.def:
* src/libthai.map:
* src/thbrk/thbrk.c:
* src/thwbrk/thwbrk.c:
* tests/test_thbrk.c:
* tests/test_thwbrk.c:
  - Rename functions as listed above.
  - Rename the 'n' argument in the functions to indicate
    whose size it describes.

Cc: pull request #1
thep added a commit that referenced this pull request Jun 26, 2016
To be consistent with other modules, wide-char APIs should be
prefixed, not suffixed with 'wc'.

* include/thai/thwbrk.h:
* src/libthai.c:
* src/libthai.def:
* src/libthai.map:
* src/thwbrk/thwbrk.c:
* tests/test_thwbrk.c:
  - Rename th_brk_find_breaks_wc() -> th_brk_wc_find_breaks().
  - Rename th_brk_insert_breaks_wc() -> th_brk_wc_insert_breaks().

Cc: pull request #1
@thangta thangta mentioned this pull request Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants