Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FTP crawler #1203

Merged
merged 16 commits into from
Aug 4, 2021
Merged

Add FTP crawler #1203

merged 16 commits into from
Aug 4, 2021

Conversation

helsonxiao
Copy link
Contributor

No description provided.

@helsonxiao helsonxiao mentioned this pull request Jul 26, 2021
@helsonxiao helsonxiao mentioned this pull request Jul 26, 2021
Copy link
Owner

@dadoonet dadoonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks amazing. Thanks a lot for adding this to the project.

Could we try to fix the integration tests?
Also we should have all the FTP related things (like dependencies) only in the new ftp module.

Is it possible to move most of the things there?

README.md Outdated Show resolved Hide resolved
docs/source/admin/fs/ftp.rst Show resolved Hide resolved
docs/source/admin/fs/ssh.rst Outdated Show resolved Hide resolved
docs/source/admin/fs/ssh.rst Outdated Show resolved Hide resolved
docs/source/dev/build.rst Outdated Show resolved Hide resolved
docs/source/index.rst Outdated Show resolved Hide resolved
framework/pom.xml Outdated Show resolved Hide resolved
@dadoonet dadoonet changed the title feat: FTP crawler Add FTP crawler Jul 29, 2021
@dadoonet dadoonet added the new For new features or options label Jul 29, 2021
This was linked to issues Jul 29, 2021
@dadoonet dadoonet mentioned this pull request Jul 29, 2021
@helsonxiao helsonxiao requested a review from dadoonet July 31, 2021 16:33
@helsonxiao helsonxiao changed the title Add FTP crawler WIP - Add FTP crawler Jul 31, 2021
@helsonxiao
Copy link
Contributor Author

helsonxiao commented Jul 31, 2021

I just found that FTP crawling is very slow comparing to SSH/SMB. Sometimes it can't parse docs content which can be parsed in SSH/SMB mode efficiently. There is still something need to be fixed so this PR is switched to WIP.

Edit: It has been fixed by 9d8b489.

Copy link
Owner

@dadoonet dadoonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments.
We are getting super close now.

Excited to see this change. Thanks again!

<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move it before the test dependency, before MockFtpServer?

logger.debug("Opening FTP connection to {}@{}", server.getUsername(), server.getHostname());

ftp = new FTPClient();
ftp.addProtocolCommandListener(ftpListener);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is used for debugging, right?
If so, should we test if we are in the debug level instead of always adding this listener?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't run docker container with the latest master branch. Could you fix that? I reverted these two commits in our dev branch and it works again. It seems cli is broken by console logger.
f0bc2ad
b147f17

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn! I missed that comment... Need to see what is happening...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some logs for you.

/usr/bin/fscrawler: 47: /usr/bin/fscrawler: ps: not found
ERROR StatusLogger Reconfiguration failed: No configuration found for '55054057' at 'null' in 'null'
Exception in thread "main" java.lang.NullPointerException: Cannot invoke "org.apache.logging.log4j.core.appender.ConsoleAppender.addFilter(org.apache.logging.log4j.core.Filter)" because "console" is null
        at fr.pilato.elasticsearch.crawler.fs.cli.FsCrawlerCli.main(FsCrawlerCli.java:132)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton! #1224 is coming.

@dadoonet dadoonet marked this pull request as draft August 2, 2021 09:45
@dadoonet dadoonet changed the title WIP - Add FTP crawler Add FTP crawler Aug 2, 2021
@dadoonet dadoonet added this to the 2.7 milestone Aug 2, 2021
@helsonxiao helsonxiao marked this pull request as ready for review August 2, 2021 16:25
Copy link
Owner

@dadoonet dadoonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a tiny last change ;)

@dadoonet dadoonet merged commit cc612a4 into dadoonet:master Aug 4, 2021
@dadoonet
Copy link
Owner

dadoonet commented Aug 4, 2021

Great work @helsonxiao! Thank you for adding a new feature to the project ❤️

@helsonxiao helsonxiao deleted the crawler-ftp branch August 4, 2021 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new For new features or options
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FTP crawler A forward slash '/' is lost crawling smb share Ability to lookup files over ftp
2 participants