We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I am getting stuck at 500,000 chars no matter what I set the max length to:
def _get_content(self, content): """ Read text from a file to chunk. """ extractor = Extractor() extractor.set_extract_string_max_length(1000) result, metadata = extractor.extract_file_to_string(content) print(len(result)) print(metadata) return result, metadata
Result:
500000 {'X-TIKA:Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Encoding': ['UTF-8'], 'resourceName': ['pg2000.txt'], 'X-TIKA:Parsed-By-Full-Set': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'Content-Type': ['text/plain; charset=UTF-8'], 'Content-Length': ['2184315']}
Any idea what is going on?
The text was updated successfully, but these errors were encountered:
I figured it out.
The set_extract_string_max_length returns a new Extractor with the updated setting.
extractor = extractor.set_extract_string_max_length(1000)
You should change the readme.
Sorry, something went wrong.
No branches or pull requests
I am getting stuck at 500,000 chars no matter what I set the max length to:
Result:
Any idea what is going on?
The text was updated successfully, but these errors were encountered: