Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build instruction\requirements windows? #3

Closed
DoumanAsh opened this issue Feb 6, 2016 · 20 comments
Closed

Build instruction\requirements windows? #3

DoumanAsh opened this issue Feb 6, 2016 · 20 comments

Comments

@DoumanAsh
Copy link
Contributor

subj :)

@DoumanAsh DoumanAsh changed the title Build instruction\requirements? Build instruction\requirements windows? Feb 6, 2016
@tsurai
Copy link
Owner

tsurai commented Feb 6, 2016

Mecab doesn't have any special requirements in itself. There are two ways to build it on windows.

  • If you are using the MSVC ABI version of rust all you need to do is download the windows binaries of mecab and put the library into your search path. But using the MSVC ABI requires at least Visual Studio 2013 to build.
  • Using the GNU ABI is being recommended by rust but it leads to its own problems. The GNU ABI can not link the windows library binaries and there is no linux binaries version which means that you would have to build it with the gnu toolchain yourself. In the end you would probably need cygwin or msys to configure and build it.

To be honest developing on Windows is always troublesome and I don't have a development environment setup on windows to test it right now. I hope that these few tips can help you otherwise I'll set up a little test environment in a virtual machine.

@DoumanAsh
Copy link
Contributor Author

I actually started my experiments but so far i was able to build mecab only with MSVC compiler... I suppose i'm going to switch to MSVC ABI to try out if it works fine, and i wasn't able to work out how to build mecab with gcc

For 64bit MSVC there is instruction from another project

UPD: Right now for gcc i'm struggle with this:

note: E:\Downloads\Git\temp\mecab-test\target\debug/libmecab.a(libmecab.cpp.obj):libmecab.cpp:(.text+0x8): undefined reference to `std::ios_base::Init::~Init()'
E:\Downloads\Git\temp\mecab-test\target\debug/libmecab.a(libmecab.cpp.obj):libmecab.cpp:(.text.startup+0xc): undefined reference to `std::ios_base::Init::Init()'
ld: E:\Downloads\Git\temp\mecab-test\target\debug/libmecab.a(libmecab.cpp.obj): bad reloc address 0xc in section `.text.startup'

But it seems to be more or less possible to build libmecab itself.
I'm not able to build binaries though

p.s. i wasn't able to configure with msys so i just wrote CMake config:

cmake_minimum_required(VERSION 2.6)

project(Mecab)

SET( CMAKE_C_FLAGS  "${CMAKE_C_FLAGS} -Os -Wall -DNDEBUG -lstdc++")
SET( CMAKE_CXX_FLAGS  "${CMAKE_CXX_FLAGS} -Os -Wall -DNDEBUG")

include_directories("${PROJECT_BINARY_DIR}")
add_definitions(-DVERSION="0.996")
add_definitions(-DMECAB_DEFAULT_RC="mecabrc")
add_definitions(-DPACKAGE="mecab")
add_definitions(-DDIC_VERSION=102)
add_definitions(-D_WIN32_IE=0x0900)
add_definitions(-DMECAB_USE_THREAD -D_CRT_SECURE_NO_DEPRECATE -DHAVE_WINDOWS_H -DDLL_EXPORT -DHAVE_GETENV -DUNICODE -D_UNICODE)

add_library(mecab STATIC mecabrc viterbi.cpp tagger.cpp  utils.cpp utils.h eval.cpp iconv_utils.cpp iconv_utils.h dictionary_rewriter.h dictionary_rewriter.cpp dictionary_generator.cpp dictionary_compiler.cpp context_id.h context_id.cpp winmain.h thread.h connector.cpp nbest_generator.h nbest_generator.cpp connector.h writer.h writer.cpp mmap.h ucs.h string_buffer.h string_buffer.cpp tokenizer.h stream_wrapper.h common.h darts.h char_property.h ucstable.h freelist.h viterbi.h param.cpp tokenizer.cpp ucstable.h char_property.cpp dictionary.h scoped_ptr.h param.h mecab.h dictionary.cpp feature_index.cpp  feature_index.h  lbfgs.cpp lbfgs.h  learner_tagger.cpp  learner_tagger.h  learner.cpp  learner_node.h libmecab.cpp)

add_executable(mecab-dict-index  mecab-dict-index.cpp)
target_link_libraries(mecab-dict-index mecab)

add_executable(mecab-dict-gen mecab-dict-gen.cpp)
target_link_libraries(mecab-dict-gen mecab)

add_executable(mecab-system-eval mecab-system-eval.cpp)
target_link_libraries(mecab-system-eval mecab)

add_executable(mecab-cost-train mecab-cost-train.cpp)
target_link_libraries(mecab-cost-train mecab)

add_executable(mecab-test-gen mecab-test-gen.cpp)
target_link_libraries(mecab-test-gen mecab)

add_executable(mecab-cli mecab.cpp)
target_link_libraries(mecab-cli mecab)

UPD 2:
Removal of "winmain.h" in binaries solve build issue with msys2.
Now i only need to solve issue with linking c++ std

@DoumanAsh
Copy link
Contributor Author

Ok, putting in linker options link to stdc++ and pthread solves build issue with gcc.
Though examples are panicking....

@tsurai
Copy link
Owner

tsurai commented Feb 7, 2016

What exactly is it panicking on? Can you give me a backtrace?

@DoumanAsh
Copy link
Contributor Author

It panicks in method parse_str

I don't really understand what is wrong as of now. But it happens here:

  let mut result = tagger.parse_str(input);
  println!("RESULT: {}", result);

For some reason result contains invalid utf8 string...

thread '<main>' panicked at 'failed printing to stdout: text was not valid unicode', ../src/libstd\io\stdio.rs:605

@tsurai
Copy link
Owner

tsurai commented Feb 7, 2016

hmm.... could it by that the input string you provide is not utf encoded but ascii or shiftjs? This especially applies to windows command line and powershell. Both are not using utf8 and could cause this error. Or your actual rust source file is saved as ascii

@DoumanAsh
Copy link
Contributor Author

I'm using eample from repository which is saved as utf-8. https://github.com/tsurai/mecab-rs/blob/master/examples/simple.rs
And here is lossy conversion result:
Complete trash....

�       ����,���,*,*,*,*,*
�       ����,���,*,*,*,*,*
郎�     ����,���,*,*,*,*,*
��      �L��,���,*,*,*,*,*
�       ����,���,*,*,*,*,�,�R�_�},�R�_�}
�       ����,�T�ϐڑ�,*,*,*,*,*
郎�     ����,���,*,*,*,*,*
��      �L��,���,*,*,*,*,��,��,��
�       ����,���,*,*,*,*,*
�っ     ����,�T�ϐڑ�,*,*,*,*,*
�       ����,���,*,*,*,*,*
�       ����,���,*,*,*,*,*
�       ����,���,*,*,*,*,*
��      �L��,���,*,*,*,*,*
��      �L��,�A���t�@�x�b�g,*,*,*,*,��,�P�C,�P�C
�       ����,���,*,*,*,*,*
�       ����,���,*,*,*,*,*
を�     ����,���,*,*,*,*,*
��      ����,����,*,*,��i,�A�p�`,����,�q,�q
子�     ����,���,*,*,*,*,*
��      �L��,���,*,*,*,*,��,��,��
�       ����,���,*,*,*,*,�,�e��,�e��
�       ����,�T�ϐڑ�,*,*,*,*,*
し�     ����,���,*,*,*,*,*
��      �L��,���,*,*,*,*,��,��,��
�       ����,�ŗL����,�g�D,*,*,*,*
�       �L��,���,*,*,*,*,*
EOS

Input itself is correctly printed, but something goes wrong inside mecab :(

@DoumanAsh
Copy link
Contributor Author

Anyway from my experiments with msys2 it seems that there is actually no need even to build with gcc. You can link with windows dll without problem even on rust with gcc toolchain, but binary will require this dll for work.

UPD:
C printf can print result just fine but you can find some problems in output:

螟      名詞,一般,*,*,*,*,*
ェ       名詞,一般,*,*,*,*,*
驛弱    名詞,一般,*,*,*,*,*
・      記号,一般,*,*,*,*,*
谺      名詞,一般,*,*,*,*,谺,コダマ,コダマ
。       名詞,サ変接続,*,*,*,*,*
驛弱    名詞,一般,*,*,*,*,*
′      記号,一般,*,*,*,*,′,′,′
謖      名詞,一般,*,*,*,*,*
√▲    名詞,サ変接続,*,*,*,*,*
縺      名詞,一般,*,*,*,*,*
ヲ       名詞,一般,*,*,*,*,*
縺      名詞,一般,*,*,*,*,*
・      記号,一般,*,*,*,*,*
k      記号,アルファベット,*,*,*,*,k,ケイ,ケイ
譛      名詞,一般,*,*,*,*,*
ャ       名詞,一般,*,*,*,*,*
繧定    名詞,一般,*,*,*,*,*
干      動詞,自立,*,*,一段,連用形,干る,ヒ,ヒ
蟄舌    名詞,一般,*,*,*,*,*
↓      記号,一般,*,*,*,*,↓,↓,↓
貂      名詞,一般,*,*,*,*,貂,テン,テン
。       名詞,サ変接続,*,*,*,*,*
縺励    名詞,一般,*,*,*,*,*
◆      記号,一般,*,*,*,*,◆,◆,◆
縲      名詞,固有名詞,組織,*,*,*,*
・記号,一般,*,*,*,*,*
EOS

@tsurai
Copy link
Owner

tsurai commented Feb 7, 2016

I've set up a gcc toolchain test environment and am currently investigating the problem. There seem to be a problem with the String conversion before passing it to mecab. Some functions work fine and others return an invalid utf string. Will post an update when I found something out.

Edit:
found the case. trying to figure out a solution right now

@DoumanAsh
Copy link
Contributor Author

Thanks, let me know if you'll need any help

@tsurai
Copy link
Owner

tsurai commented Feb 9, 2016

Turns out that a lifetime problem was the cause for the errors and panics. I've tested the new version on my Windows and it works fine now. Please look if this fixed your problems and I'll upload a new crate version

@DoumanAsh
Copy link
Contributor Author

Hm... Strange but it doesn't work for me.
Were you using 32bit environment and pre-built dll from mecab installation?

INPUT: 太郎は次郎が持っている本を花子に渡した。
thread '<main>' panicked at 'invalid utf-8: invalid byte near index 0', src\mecab.rs:760

It seems input's ptr is still getting corrupted in my case.
Added simple traces with printf to see actual input and output:

    pub fn parse_str<T: Into<Vec<u8>>>(&self, input: T) -> String {
        unsafe {
            let ptr_to_parse = str_to_ptr(&CString::new(input).unwrap());
            printf("parse=%s".as_ptr() as *const i8, ptr_to_parse);
            ptr_to_string(mecab_sparse_tostr(self.inner, ptr_to_parse))
        }
    }

Result:

INPUT: 太郎は次郎が持っている本を花子に渡した。
parse=螟ェ驛弱・谺。驛弱′謖√▲縺ヲ縺・k譛ャ繧定干蟄舌↓貂。縺励◆縲・アM螟 名詞,一般,*,*,*,*,*
ェ       名詞,一般,*,*,*,*,*
驛弱    名詞,一般,*,*,*,*,*
・      記号,一般,*,*,*,*,*
谺      名詞,一般,*,*,*,*,谺,コダマ,コダマ
。       名詞,サ変接続,*,*,*,*,*
驛弱    名詞,一般,*,*,*,*,*
′      記号,一般,*,*,*,*,′,′,′
謖      名詞,一般,*,*,*,*,*
√▲    名詞,サ変接続,*,*,*,*,*
縺      名詞,一般,*,*,*,*,*
ヲ       名詞,一般,*,*,*,*,*
縺      名詞,一般,*,*,*,*,*
・      記号,一般,*,*,*,*,*
k      記号,アルファベット,*,*,*,*,k,ケイ,ケイ
譛      名詞,一般,*,*,*,*,*
ャ       名詞,一般,*,*,*,*,*
繧定    名詞,一般,*,*,*,*,*
干      動詞,自立,*,*,一段,連用形,干る,ヒ,ヒ
蟄舌    名詞,一般,*,*,*,*,*
↓      記号,一般,*,*,*,*,↓,↓,↓
貂      名詞,一般,*,*,*,*,貂,テン,テン
。       名詞,サ変接続,*,*,*,*,*
縺励    名詞,一般,*,*,*,*,*
◆      記号,一般,*,*,*,*,◆,◆,◆
縲      名詞,固有名詞,組織,*,*,*,*
・記号,一般,*,*,*,*,*

UPD:
Problem with printf most possible due to my environment and i suspect the same goes for mecab functionality...
Not sure how to deal with that.
CString as of now assumes that all strings are UTF-8 regardless of environment
But your environment determines how C works with strings...

UPD 2:
I have some idea for solving issue, i'll let you know later on.

UPD 3:
Tried to encode as wide unicde just in case. Makes no difference... no idea what is wrong in my case

@tsurai
Copy link
Owner

tsurai commented Feb 10, 2016

Please bear with me as I express my love for windows....

This is definitly not a problem with the code or rust itself. My whole system is set to japanese and the codepage of my commandline is set to 932 (Japanese Shift-JIS) which is utf-8 compatible (?).

Now this is where the fun begins... changing the japanese font MS ゴシック to the TrueType font Consolas causes the program to print the usual missing character symbols as expected. But changing it to the default bitfont causes the program to panic and crash while PRINTING the characters.

Setting the codepage to 65001 (UTF-8) should also work but the problem is that windows only offers a few fonts (none of them supports japanese). Adding a new font can only be done by changing an entry in the registry which requires a REBOOT to apply. Please note that there are numerous reports of bugs and undefinied behavior with the UTF-8 codepage because windows has a slightly different standard to support some legacy behavior.

That sums things up. Your best bet is it change the codepage to 932 if you want to get it to work with the standard commandline. Other than that you could use a different console program with proper utf-8 support. Sadly utf-8 is a second class citizen on windows.

@DoumanAsh
Copy link
Contributor Author

I suppose. Anyway i think we need these requirements described for windows and since it works just fine we can close this issue :)

@tsurai
Copy link
Owner

tsurai commented Feb 11, 2016

Thanks a lot for your research into this. Wish I could have helped you more but windows is a bit stubborn in this case. I added a few notes to the README and linked this issue for transparencies sake.

@tsurai tsurai closed this as completed Feb 11, 2016
@DoumanAsh
Copy link
Contributor Author

Np.
Btw, it is possible to use pre-built binaries with GCC toolchain if you're using msys2.
It seems msys2 linker is capable of properly linking with Windows dynamic libraries.

And link to my final CMake config for mecab sources for reference
https://drive.google.com/file/d/0B7w3ZGc8CTgqRVo0Snp2ZzBTNkk/view?usp=sharing

64bit library and binaries:
MSVC: https://drive.google.com/file/d/0B7w3ZGc8CTgqSmtrM2JCd3VXaVk/view?usp=sharing
GCC: https://drive.google.com/file/d/0B7w3ZGc8CTgqUjJweENpa2dvcG8/view?usp=sharing

@tsurai
Copy link
Owner

tsurai commented Feb 11, 2016

I just tested it with my gcc toolchain and it seems that you don't even need msys2 to use the pre-built binaries (if you are using the 32bit toolchain). Just copy the pre-build dll to /lib/rustlib//lib and the compiler will handle the rest by itself.
So basically you only have to compile it from source if you want it to be 64bit

@DoumanAsh
Copy link
Contributor Author

That's even greater!
Note that produced executable will require dll to be bundled together

@tsurai
Copy link
Owner

tsurai commented Feb 11, 2016

I'm pretty sure that you will need to bundle the two together for it to work. Or at least have it somewhere reachable in your path variable.
Can I directly link to your CMake file and binaries? I would like to avoid having binaries inside of the repo if possible

@DoumanAsh
Copy link
Contributor Author

Yes, of course feel free to use these links

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants