-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a new thread-safe interface. #1
Conversation
The new API is as discussed on the libthai forum. * include/thai/thbrk.h: * include/thai/thwbrk.h: * src/thbrk/thbrk.c * src/thwbrk/thwbrk.c - New type ThDict for a loaded dictionary. - Define exported functions th_dict_new(), th_dict_delete(), th_dict_brk(), th_dict_brk_line(), th_dict_wbrk() and th_dict_wbrk_line() for using explicit dictionaries. - Define exported functions th_dict_get_shared() and th_dict_free_shared() for accessing a shared dictionary. - (th_brk, th_brk_line, th_wbrk, th_wbrk_line): Use the new functions with the shared dictionary. * src/thbrk/brk-maximal.c: * src/thbrk/brk-maximal.h: - (BrkEnv): Include a reference to the dictionary in the thread-local environment. - (brk_root_pool): Use the dictionary from the environment rather than always using the shared dictionary. * src/thbrk/brk-common.c: * src/thbrk/brk-common.h: - Define the ThDict data type, with brk_dict_new(), brk_dict_delete() and brk_dict_trie() operations. - Get rid of now-unneessary brk_get_dict() and brk_on_unload(). *src/libthai.c: - Call th_dict_free_shared() from the destructor.
I don't quite like exposing the concept of dictionary via a dedicated type ThDict. Thai word segmentation can be implemented with several approaches, including rule-based, dictionary-based, trigram-based, or other statistical ones. That's why I used a generic name like ThBrk in the discussion. The dictionary should be hidden, or at least the API should not emphasize it. With this, the free list in BrkEnv can also be moved into ThBrk instance, to promote more resource reuse. In src/libthai.map: +LIBTHAI_0.1.24svn { Please remove the "svn" suffix from symbol versioning. Thanks for your work! |
As suggested by the library maintainer. * Types ThDict and struct _ThDict are renamed to ThBrk and struct _ThBrk. * Variables of these types are renamed from "dict" to "brk". * Functions named th_dict_* are renamed to th_brk_*.
On Mon, May 16, 2016 at 5:18 PM, thep [email protected] wrote:
Ah, I see. I have changed the names to what you originally suggested.
Isn't this still needed to be separate for thread safety? I expect
Fixed.
I'm happy to contribute :) |
On Wed, May 18, 2016 at 6:18 AM, markbrown [email protected] wrote:
Ah, you're right. Let's leave it separated, then.
Just want to make sure that you have also bumped the version Regards,Theppitak Karoonboonyanan |
On Wed, May 18, 2016 at 3:05 PM, Theppitak Karoonboonyanan <
Fixed now. |
I sent that rather quickly. Did you mean bump the version in configure.ac also? |
On Thu, May 19, 2016 at 3:20 AM, markbrown [email protected] wrote:
No. The version in configure.ac is only bumped on new releases. Theppitak Karoonboonyanan |
s/ThDictBrk/ThBrk/
I think the shared dictionary should be hidden from user. So, please rename th_brk_get_shared() and th_brk_free_shared()
In the documentation, please describe ThBrk as As we have discussed, ThBrk represents a generic idea As an exception, the documentation for th_brk_new() Imagine when other approaches are added, we can Thanks! |
* Do not expose th_brk_get_shared() and th_brk_free_shared(): - Remove them from the header files and documentation. - Move the functions and static variable to src/thbrk/brk-common.c, and name the functions brk_get_shared_dict() and brk_free_shared_dict(). - Remove the "const" from ThBrk in places where it was used. This helped users avoid accidentally freeing the shared dictionary, but no longer serves that purpose. * Documentation fixes: - Describe ThBrk as a "word breaker". - Instead of separate ThDictBrk and ThBrk groups, just group them under ThBrk.
I've addressed your latest comments. I've also removed the "const" from ThBrk in the places where I had added it. The reason I put it there in the first place was to help users avoid accidentally deleting the shared dictionary (instead of using the function for that purpose, which also clears the static variable), but we don't need to worry about that if the shared dictionary is not exposed. Thanks for the feedback! |
I feel a bit uneasy to see ThBrk-related stuffs taken apart in two files, Previously, brk-common.c held 2 functionalities regarding the dictionary, Let's implement ThBrk in thbrk.c, including keeping the shared instance. Thanks! |
That's a good idea. brk-common.c won't need to depend on thbrk at all, but
there are three other dependencies to deal with:
- libthai.c needs to call brk_free_shared_dict()
- brk-maximal needs to be able to get the trie out of ThBrk
- thwbrk needs to get the shared dictionary.
The last of these I have circumvented by allowing callers to pass NULL to
th_brk_brk to get the shared dictionary, although this is undocumented. To
deal with the dependencies properly I'll need to declare non-public
functions for thbrk in a header file. Should I use thbrk-utils.h for this
purpose, or create a new header file such as thbrk-priv.h?
Thanks for the reviews.
|
Sorry that I somehow missed the notification, until I log into GitHub to check it. thbrk-utils.h is meant for thbrk-module-wide utilities, while the ThBrk structure is not. Let's add a new header file thbrk-priv.h to hold the non-public declarations. |
This keeps the imeplementation in one place, and means that brk-common.c does not need to know about ThBrk. * src/thbrk/thbrk.c: * src/thbrk/brk-common.c: - Move the implementation of ThBrk from brk-common.c to thbrk.c, including the shared instance. * src/thbrk/thbrk-priv.h: - New header file for non-public declarations of thbrk.c. * src/thbrk/brk-common.h: - Export brk_load_default_dict(). - Remove declarations for accessing ThBrk. * src/libthai.c: * src/thbrk/brk-maximal.c: - Use thbrk-priv.h instead of brk-common.h. * src/thbrk/Makefile.am: - Add the new header file.
Done. How does it look to you? |
Looks great! BTW, I just realize that the names for Otherwise, I think it's ready for merging. |
* src/libthai.c: * src/thbrk/thbrk-priv.h: * src/thbrk/thbrk.c: - Rename functions brk_{get,free}_shared_dict() to brk_{get,free}_shared_brk, respectively. - Rename the static variable brk_shared_dict -> brk_shared_brk.
Fixed. |
Implement a new thread-safe interface for word break. To achieve more thread-safety without depending on mutex mechanisms, a new set of APIs is added so that the client can create a shared instance of word break engine by him/herself under appropriate mutex. Then, the word break functions can be safely called in parallel using the shared engine. * include/thai/thbrk.h: * include/thai/thwbrk.h: * src/libthai.c: * src/libthai.def: * src/libthai.map: - Add new exported APIs: th_brk_new(), th_brk_delete(), th_brk_brk(), th_brk_brk_line(), th_brk_wbrk(), th_brk_wbrk_line(). * src/thbrk/brk-common.h, src/thbrk/brk-common.c (-brk_on_unload, -brk_get_dict, +brk_load_default_dict): - Remove old shared dict management. It's to be as part of ThBrk implementation in ThBrk layer instead. - The logic for finding and loading dictionary at default paths is still retained here. * src/thbrk/thbrk.c (th_brk_new, th_brk_delete): - Implement ThBrk (de)allocation, with dictionary loading at specified path or at default paths if not specified. * src/thbrk/brk-maximal.h, src/thbrk/brk-maximal.c (struct _BrkEnv, brk_env_new): - Add ThBrk engine as a member of BrkEnv. * src/thbrk/brk-maximal.c (brk_root_pool): - Access dict trie from ThBrk in BrkEnv instead of getting shared dict directly. * src/thbrk/thbrk.c (th_brk -> th_brk_brk, th_brk_line -> th_brk_brk_line): * src/thwbrk/thwbrk.c (th_wbrk -> th_brk_wbrk, th_wbrk_line -> th_brk_wbrk_line): - Modify old functions to new ones by adding ThBrk* parameter. * src/thbrk/Makefile.am, +src/thbrk/thbrk-priv.h, src/thbrk/thbrk.c (brk_get_shared_brk, brk_free_shared_brk): - Add functions for managing the shared engine to preserve old behavior. * src/libthai.c (_libthai_on_unload): - Call brk_free_shared_brk() on unload. * src/thbrk/thbrk.c (th_brk, th_brk_line): * src/thwbrk/thwbrk.c (th_wbrk, th_wbrk_line): - Provide old APIs as wrappers to the new APIs, for backward compatibility. Merging pull request #1.
Merged. Thank you very much for your contribution! |
On Wed, Jun 22, 2016 at 6:26 PM, Theppitak Karoonboonyanan <
You're welcome! Looking forward to the next release. Do you have an expected time for that? Mark |
It should be released soon, after I finish updating other parts related to the API addition. |
Before they get public, I decide to rename the ThBrk methods to be more meaningful: |
The methods for ThBrk I proposed to Mark Brown in our discussion were too confusing. Before they get public, let's pick more meaningful names instead: - th_brk_brk() -> th_brk_find_breaks() - th_brk_brk_line() -> th_brk_insert_breaks() - th_brk_wbrk() -> th_brk_find_breaks_wc() - th_brk_wbrk_line() -> th_brk_insert_breaks_wc() * include/thai/thbrk.h: * include/thai/thwbrk.h: * src/libthai.c: * src/libthai.def: * src/libthai.map: * src/thbrk/thbrk.c: * src/thwbrk/thwbrk.c: * tests/test_thbrk.c: * tests/test_thwbrk.c: - Rename functions as listed above. - Rename the 'n' argument in the functions to indicate whose size it describes. Cc: pull request #1
To be consistent with other modules, wide-char APIs should be prefixed, not suffixed with 'wc'. * include/thai/thwbrk.h: * src/libthai.c: * src/libthai.def: * src/libthai.map: * src/thwbrk/thwbrk.c: * tests/test_thwbrk.c: - Rename th_brk_find_breaks_wc() -> th_brk_wc_find_breaks(). - Rename th_brk_insert_breaks_wc() -> th_brk_wc_insert_breaks(). Cc: pull request #1
The new API is as discussed on the libthai forum.
I've changed the names from what was first suggested, however, because I found the suggested names very confusing (th_brk_brk, th_brk_wbrk, etc).
I've also exported functions to access the shared dictionary used by the backwards compatible part of the implementation. These can be used to force loading as th_ensure_dict_loaded() did in an earlier proposal, and to safely free up resources if the application knows they are not needed. The shared dictionary has a const type to prevent it being explicitly deleted.
With this API you can use the word breaking functions in two ways:
In other words, if you want a non-default dictionary you have to manage the sharing yourself.