-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FTP crawler #1203
Add FTP crawler #1203
Conversation
* fix: gitignore * doc: writing documentation * doc: how to test specific module * feat: ftp crawler * doc: ftp settings
...va/fr/pilato/elasticsearch/crawler/fs/test/integration/elasticsearch/FsCrawlerTestFTPIT.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks amazing. Thanks a lot for adding this to the project.
Could we try to fix the integration tests?
Also we should have all the FTP related things (like dependencies) only in the new ftp module.
Is it possible to move most of the things there?
...va/fr/pilato/elasticsearch/crawler/fs/test/integration/elasticsearch/FsCrawlerTestFTPIT.java
Outdated
Show resolved
Hide resolved
...va/fr/pilato/elasticsearch/crawler/fs/test/integration/elasticsearch/FsCrawlerTestFTPIT.java
Show resolved
Hide resolved
...-ftp/src/test/java/fr/pilato/elasticsearch/crawler/fs/crawler/ftp/FileAbstractorFTPTest.java
Show resolved
Hide resolved
I just found that FTP crawling is very slow comparing to SSH/SMB. Sometimes it can't parse docs content which can be parsed in SSH/SMB mode efficiently. There is still something need to be fixed so this PR is switched to WIP. Edit: It has been fixed by |
...wler-ftp/src/main/java/fr/pilato/elasticsearch/crawler/fs/crawler/ftp/FileAbstractorFTP.java
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
We are getting super close now.
Excited to see this change. Thanks again!
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Show resolved
Hide resolved
core/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsParserAbstract.java
Show resolved
Hide resolved
crawler/crawler-ftp/pom.xml
Outdated
<scope>test</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.apache.logging.log4j</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move it before the test dependency, before MockFtpServer
?
logger.debug("Opening FTP connection to {}@{}", server.getUsername(), server.getHostname()); | ||
|
||
ftp = new FTPClient(); | ||
ftp.addProtocolCommandListener(ftpListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is used for debugging, right?
If so, should we test if we are in the debug level instead of always adding this listener?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn! I missed that comment... Need to see what is happening...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some logs for you.
/usr/bin/fscrawler: 47: /usr/bin/fscrawler: ps: not found
ERROR StatusLogger Reconfiguration failed: No configuration found for '55054057' at 'null' in 'null'
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.apache.logging.log4j.core.appender.ConsoleAppender.addFilter(org.apache.logging.log4j.core.Filter)" because "console" is null
at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:132)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton! #1224 is coming.
...wler-ftp/src/main/java/fr/pilato/elasticsearch/crawler/fs/crawler/ftp/FileAbstractorFTP.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a tiny last change ;)
...wler-ftp/src/main/java/fr/pilato/elasticsearch/crawler/fs/crawler/ftp/FileAbstractorFTP.java
Outdated
Show resolved
Hide resolved
Great work @helsonxiao! Thank you for adding a new feature to the project ❤️ |
No description provided.