Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support S3-compatible provider as storage backend #317

Merged
merged 8 commits into from
Jun 13, 2022

Conversation

Contextualist
Copy link
Contributor

@Contextualist Contextualist commented Jun 10, 2022

Currently, this is a working implementation, but still work and discussion are needed. This resolves #159 and achieve part of the goals in #180.

To try this out, set a config as the following:

[server]
hostname = "https://example-bucket-name.s3-website-us-west-2.amazonaws.com"

[storage]
  type = "s3"
  [storage.s3]
  endpoint = "https://s3.us-west-2.amazonaws.com"
  region = "us-west-2"
  bucket = "example-bucket-name"

# ...

And set the access key and secret in environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY (credential should also works per aws-sdk-go's documentation, but I haven't tried it out).

Optional: use CLI arg --headless to quit after first round of updates.

To be discussed:

  • Should --update-once be a config entry instead? Should we rename it? Renamed to --headless
  • Where should the s3 config entry be located? In the new storage config section
  • Default values for the s3 config entries?
  • Should the fs.Storage interface include http.FileSystem?
  • Dependency aws/aws-sdk-go is quite heavy (compiled binary size is increased from 4MB to 22MB); should we use a lighter implementation for the S3 API (e.g. rhnvrm/simples3)? No change

To do:

  • tests for fs.S3
  • documentation (README, example config)

Any other discussions are welcome. I hope this PR can contribute some ideas on the S3 backend. If you already have something better, feel free to close this in favor of another PR!

@mxpv
Copy link
Owner

mxpv commented Jun 10, 2022

This is awesome!
Could you please run go mod tidy to update go mod files, to make CI happy?
Other than that, LGTM.

@mxpv mxpv mentioned this pull request Jun 10, 2022
6 tasks
@mxpv
Copy link
Owner

mxpv commented Jun 10, 2022

Should --update-once be a config entry instead? Should we rename it?

How about headless mode? This one will be useful when running in AWS Lambda, for instance.

Where should the s3 config entry be located?

I'd define storage_type=local|s3

Dependency aws/aws-sdk-go is quite heavy (compiled binary size is increased from 4MB to 22MB); should we use a lighter implementation for the S3 API (e.g. rhnvrm/simples3)?

I'd stick with official AWS SDK

@Contextualist
Copy link
Contributor Author

I'd define storage_type=local|s3

Do you mean something like this?

storage_type = "local" # default value if omitted

[server]
hostname = "https://example-host.com"
port = 8080
data_dir = "/data/podsync/"

# ...
storage_type = "s3"

[server]
hostname = "https://example-bucket-name.s3-website-us-west-2.amazonaws.com"

[s3]
endpoint = "https://s3.us-west-2.amazonaws.com"
region = "us-west-2"
bucket = "example-bucket-name"

# ...

@mxpv
Copy link
Owner

mxpv commented Jun 11, 2022

May be something like

[storage]
type="local|s3"

  [storage.s3]
  ...
  [storage.local]
  ...

WDYT?

@Contextualist
Copy link
Contributor Author

What would be under storage.local? If we move server.data_dir to storage.local.data_dir, we introduce a backward incompatibility.

@Contextualist
Copy link
Contributor Author

I think storage.local.data_dir makes more sense in the current model anyway. Maybe we could deprecate server.data_dir: emit a warning to the user and encourage the following settings

[storage]
# type="local" by default
  [storage.local]
  data_dir = "/data/podsync/"

and remove server.data_dir in the future. These changes will be transparent to the Docker users.

@mxpv
Copy link
Owner

mxpv commented Jun 12, 2022 via email

This also deprecate `server.data_dir` in favor of `storage.local.data_dir`
@Contextualist
Copy link
Contributor Author

Everything is ready except for documentation updates (README.md, config.toml.example, cloud_formation.yml). Since the storage section is a major change to the config, should we postpone the documentation change until the next release?

@mxpv
Copy link
Owner

mxpv commented Jun 12, 2022

Next release will include these changes, so it’s fine to include documentation updates too.

@Contextualist
Copy link
Contributor Author

Ok, done!

@mxpv
Copy link
Owner

mxpv commented Jun 13, 2022

Thanks!

@mxpv mxpv merged commit 2aa7309 into mxpv:main Jun 13, 2022
@Contextualist Contextualist deleted the feat-s3 branch June 13, 2022 04:37
mxpv added a commit that referenced this pull request Jun 15, 2022
Fix two regressions introduced by PR #317
@mxpv
Copy link
Owner

mxpv commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: crawler mode without web server
2 participants