Add support for S3 Datastore #159

DinoBektesevic · 2019-05-17T02:59:05Z

As was suggested to me by @r-owen and @timj the draft PR for S3Datastore and PostgreSqlRegistry together was too large and clunky to be reasonable. Draft PR #147 is now closed and substituted by two PRs, separating S3Datastore and PostgreSqlRegistry.

Changes

Based on review comments from the Draft-PR the following was changed compared to the draft PR.

Reworked how to/from/Bytes method work
- Added toBytes and fromBytes functionality to JsonFormatter
- Changed the toFile and fromFile methods to internally use the to/from/Bytes on all formatters that support it
- S3Datastore now downloads all files as bytes and attempts to read them. If it fails it stores them in a temporary file and then attempts to read them from there.
Refactored formatters section in datastore YAML config files into their own file that is included into datastore config files.
Removed anything RDS related out of this PR.
Config class was made aware of S3 and the code in butlerConfig and Butler.makeRepo that was used as a work-around to that was removed.
Butler.makeRepo was re-factored back to (almost) the way it was and is now pretty again
Removed the Butler.computeChecksum as I never made it to work.
Moved all S3-related utilities from daf.butler.core.utils to daf.butler.core.s3utils
Added tests for s3utils.
Fixed up the de-duplication hacks in S3Location by fixing parsePathToUriElements
General housekeeping (better names, formatters are now their own yaml file included where needed, removed old unnecessary comments, fixed docstrings, added more docstrings, new default and test config files for S3, better style for boto3 imports protection, mocks and skips for tests when moto or boto3 is missing...)

Unresolved/Other

<butlerRoot> tag is still not supported for the current S3Datastore as, from what I can see, the main effect of the butler relocation code is that sqlite:///:memory: registry location is replaced by 'sqlite:///<butlerRoot>/gen3.sqlite3 which just does not make practical sense when working with S3Datastore. We can not talk to sqlite DB in a Bucket, while, for S3Datastore tests at least, it does make sense to have an ephemeral Registry. This makes sense for cases when the db string in the registry YAML config can be specified directly and won't be changed, so I'm deferring this issue for later.
Did not clean up S3Datastore.ingest as was requested. There are several things slightly unclear to me here so I haven't quite polished up the large block of if-else code containing a lot of path and string manipulations to figure out what type of ingest needs to happen and how. It is on my todo list.
boto3.client.delete_object(bucket_name, object_name) is just bad. Deleting objects from a Bucket deletes the current version (S3 versions file when upload overwrites them) so technically a delete can just revert to an older file version when versioned Buckets are used. But that's not as bad since almost no files are allowed to be rewritten. The problem is the following, still unresolved, issue Boto3 delete_objects response does not match documented response boto/boto3#507 . On testing I find that the only failure I can get from delete_object is the one when I give it an non-existing Bucket. In all other cases I just get a HTTP 200 OK response, even when files don't exist.
The comments on S3Datastore.remove is that I need to find the location of the file before I can remove it, so while at it why not check for errors. This is an exact duplicate of PosixDatastore except for the remove call.

The rest of the comments were either fixed, or enough has changed, I believe, that they are not relevant anymore. Thanks to all the input I got, I think this is a much better iteration of S3Datastore than what was in the original PR - so, many thanks to everyone who inputted.

timj

This is much much better. Thank you for working on this. My main comments are associated with URI handling. I will try to help you out by augmenting #167 such that you can simplify your code here.

python/lsst/daf/butler/core/s3utils.py

python/lsst/daf/butler/core/butlerConfig.py

python/lsst/daf/butler/butler.py

python/lsst/daf/butler/datastores/s3Datastore.py

timj · 2019-05-30T22:21:26Z

python/lsst/daf/butler/datastores/s3Datastore.py

+
+        # format the downloaded bytes into appropriate object directly, or via
+        # tempfile (when formatter does not support to/from/Bytes)
+        try:


We can fix this later, but I'm worried that we are setting ourselves up here to read in multiple gigabytes of data, then decided that we can't actually convert from bytes and then writing it all to disk, when if we knew that fromBytes was not going to work we could do the efficient incrementally write to disk, then call the formatter to read from disk (which might result in only a subset of the file being read if we wanted a component) with much less memory overhead.

I don't know how big of a problem this would actually be, in the sense that the fix seems really easy and that, currently, the only Formatters supporting direct download from bytes are YAML, JSON, and Pickle. I'm not seeing how they would be GB in size.

On the other hand I agree that there is a lot of place for optimization here, especially when downloading files to disk because boto3 offers variety of options to do very fast large file downloads. For example there's the overarching configuration in TransferConfig that sets various chunk sizes, number of concurrent download threads etc. when using boto3's TransferManager.

And of course the same applies for any uploads, which would ideally probably be batched together after certain number of new files are created (or some such) and the uploaded with multipart_upload as the speedups seem significant.

This was discussed in AWS POC meetings, where it was decided to table the issue until we get the POC done first and then it was decided to experiment whether the uploads should be optimized from within boto3 or via condor uploads/downloads and at which sizes of data might one be better than other, if at all.

But a very good point to make for sure.

EDIT: A point I forgot to make - solution is trivial because we know the filesize in advance because the way s3CheckFileExists works is that it returns True/False for file existence and its size if it does exist. That information is in response['ContentLength'] which you can see couple of lines above. So we know the filesize in advance.

But we could support FITS files as bytes so who knows what could happen. It's up to the formatter which is outside the control of the datastore.

python/lsst/daf/butler/datastores/s3Datastore.py

python/lsst/daf/butler/core/config.py

timj

Here is an early review on the new code. I've not covered everything but I'm about to go out for the day on vacation so want to send something.

python/lsst/daf/butler/core/config.py

python/lsst/daf/butler/core/location.py

timj

Thanks for doing the requested fixes. My main concern is still mainly with the URI mangling. There is also a huge amount of duplication between PosixDatastore and S3Datastore that I have to think about at some point.

python/lsst/daf/butler/butler.py

python/lsst/daf/butler/core/butlerConfig.py

python/lsst/daf/butler/core/config.py

python/lsst/daf/butler/registries/sqliteRegistry.py

timj · 2019-07-16T22:39:35Z

tests/test_butler.py

+        return rndstr + '/'
+
+    def setUp(self):
+        config = Config(self.configFile)


Shouldn't this be a ButlerConfig? (So that the doubled up datastore.datastore are removed)

Not sure how to make this ButlerConfig since I need to call makeRepo first line of which is

if isinstance(config, (ButlerConfig, ConfigSubset)): raise ValueError("makeRepo must be passed a regular Config without defaults applied.")

I guess I could try and recast it as Config when I make the call?

Leave it for now. I'll have a think. We shouldn't be doing datastore.datastore because it's confusing.

tests/test_butler.py

tests/test_s3utils.py

timj · 2019-07-16T22:49:14Z

python/lsst/daf/butler/datastores/s3Datastore.py

+
+        # format the downloaded bytes into appropriate object directly, or via
+        # tempfile (when formatter does not support to/from/Bytes)
+        try:


But we could support FITS files as bytes so who knows what could happen. It's up to the formatter which is outside the control of the datastore.

timj · 2019-07-24T21:27:17Z

python/lsst/daf/butler/core/config.py

@@ -894,7 +894,7 @@ def updateParameters(configType, config, full, toUpdate=None, toCopy=None, overw
            for key in toCopy:
                if key in localConfig and not overwrite:
                    log.debug("Not overriding key '%s' from defaults in config %s",
-                              key, localConfig.__class__.__name__)
+                              key, value, localConfig.__class__.__name__)


The log message has to placeholders so you can not put three parameters in the call here (ie I think you were right the first time -- or you need to edit the log message).

I removed value here, I can make a different PR if you want the changes made immediately. Or just make them yourself if that's easier and faster.

timj · 2019-07-25T21:13:05Z

There's a lot to disentangle in the URI discussion and it is likely this is not the best venue for that discussion at this point. The important point though is that you can not ever represent a relative file path using a file URI. This is forbidden by RFC-8089 and when we started on this a few months back there was a big discussion on why it was that urlparse wouldn't round trip a relative path and I submitted a bunch of GitHub issues on this for other python URI parsers. They all came back to me with the same point about relative URIs.

ButlerURI was written specifically to deal with the case where you want a file scheme-less file system path to be usable as well as file or s3 URI schemes. By "scheme-less" I really do mean "no scheme specified". You cannot convert a scheme-less relative path to a file scheme without also turning it into an absolute path (recall there was a review comment about that: if people are requesting that the path be turned into an absolute path then a case could be made for converting a schemeless URI to a file scheme URI). For a schemeless URI it makes sense that netloc is None. It is telling you that you have a local path. If you have a file URI we are assuming POSIX semantics for paths (that seems to be consistent with the examples) but for schemeless URIs we are assuming os.path semantics. That's why ButlerURI goes through many hoops to ensure we are consistently using posixpath and os.path.

I'm not sure I understand the issue with relativeToNetloc. That method is specifically designed to turn an absolute path to a path relative to the root so would by definition strip the leading / as you require. It does this regardless of what netloc is set to and should only be dealing with the path component. What am I missing?

It might make life simpler for you if ButlerUri had an exists method that would either look on local filesystem or use your s3 utility routine. It would have to always use a separate code path for schemeless and file schemes.

Does this help?

DinoBektesevic · 2019-07-30T00:21:01Z

(I couldn't figure how to reply to your comment. so I'm making a new one).

I have reverted the ButlerURI changes as you requested. The thing that confused me is having a relativeToNetloc when netloc is None. The method did not seem to do what it declares on the tin. Adding netloc in seemed closer to described behaviour for URIs and made the following difference:

original/current

(lsst-scipipe-1172c30) [dinob@boom lsstspark]$ python testNetloc.py 
scheme  | netloc          | relativeToNetloc               | path                           | expected                                 | uri
file    |                 | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file:///rootDir/absolute/file.ext
file    |                 | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file:///rootDir/absolute/file.ext
file    |                 | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file:///rootDir/absolute/file.ext
file    |                 | rootDir/absolute/              | /rootDir/absolute/             | /rootDir/absolute/                       | file:///rootDir/absolute/
file    |                 | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file:///tmp/relative/file.ext
file    |                 | tmp/relative/directory/        | /tmp/relative/directory/       | /tmp/relative/directory/                 | file:///tmp/relative/directory/
file    | relative        | file.ext                       | /file.ext                      | /file.ext                                | file://relative/file.ext
file    |                 | absolute/directory/            | /absolute/directory/           | /absolute/directory/                     | file:///absolute/directory/
        |                 | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | /tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        | relative/file.ext
s3      | bucketname      | rootDir/relative/file.ext      | /rootDir/relative/file.ext     | /rootDir/relative/file.ext               | s3://bucketname/rootDir/relative/file.ext
file    |                 | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file:///home/dinob/relative/file.ext
file    |                 | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file:///home/dinob/relative/file.ext
        |                 | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | /tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        |

proposed

file    | localhost       | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file://localhost/rootDir/absolute/file.ext
file    | localhost       | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file://localhost/rootDir/absolute/file.ext
file    | localhost       | rootDir/absolute/file.ext      | /rootDir/absolute/file.ext     | /rootDir/absolute/file.ext               | file://localhost/rootDir/absolute/file.ext
file    | localhost       | rootDir/absolute/              | /rootDir/absolute/             | /rootDir/absolute/                       | file://localhost/rootDir/absolute/
file    | localhost       | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file://localhost/tmp/relative/file.ext
file    | localhost       | tmp/relative/directory/        | /tmp/relative/directory/       | /tmp/relative/directory/                 | file://localhost/tmp/relative/directory/
file    | relative        | file.ext                       | /file.ext                      | /file.ext                                | file://relative/file.ext
file    | localhost       | absolute/directory/            | /absolute/directory/           | /absolute/directory/                     | file://localhost/absolute/directory/
file    | localhost       | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file://localhost/tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        | relative/file.ext
s3      | bucketname      | rootDir/relative/file.ext      | /rootDir/relative/file.ext     | /rootDir/relative/file.ext               | s3://bucketname/rootDir/relative/file.ext
file    | localhost       | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file://localhost/home/dinob/relative/file.ext
file    | localhost       | home/dinob/relative/file.ext   | /home/dinob/relative/file.ext  | /home/dinob/relative/file.ext            | file://localhost/home/dinob/relative/file.ext
file    | localhost       | tmp/relative/file.ext          | /tmp/relative/file.ext         | /tmp/relative/file.ext                   | file://localhost/tmp/relative/file.ext
        |                 | relative/file.ext              | relative/file.ext              | relative/file.ext                        | relative/file.ext

which does exactly what you describe

The important point though is that you can not ever represent a relative file path using a file URI.
You cannot convert a scheme-less relative path to a file scheme without also turning it into an absolute path

and the 3 cases where the relative path was provided were the three that were not parsed as URIs or parsed correctly (as they would have been intended by imaginary user, correct in the sense of the rules). The rest were promoted to completely qualified URIs.

In any case, I changed it back as requested and am dropping further discussion. I kept the relativeToNetloc method and yes, the way its implemented it will work regardless of whether netloc exits or not. This is acceptable, correct?

The other changes were adding support for !include statements in Yaml configs to work when !include is pointing to another file in the same bucket, replacing all single quotes with doubles, adding in some missing tests and skips and rebasing on top of newest master.

timj · 2019-07-30T00:39:43Z

Re relativeToNetloc how about we call it relativeToPathRoot ?

timj

I've had a quick look at the URI stuff and it's looking better now. I've made some more comments since I'm not entirely sure how you are handling scheme-less relative paths still. You should consider adding some tests for relativeToPathRoot.

python/lsst/daf/butler/butler.py

python/lsst/daf/butler/core/butlerConfig.py

python/lsst/daf/butler/core/config.py

python/lsst/daf/butler/core/location.py

timj

If you deal with the scheme-less vs file vs other comments I just made I'm happy to approve this change. Any further issues I can clean up later. I wonder if we should rename the branch to be tickets/DM-13361 (which I can assign to you I think and would better reflect your work on this). This would require a new PR of course.

One minor complication is that #176 just merged and that changed the Formatter API a little. This will cause you some merge conflicts with your read/write bytes API. Sorry about that. Let me know if you need help disentangling things.

DinoBektesevic · 2019-08-06T01:06:28Z

The Formatters were not a big deal. I saw the PR and I was ready for the changes. It actually helped me notice how out of date the S3Datastore was and fix a bunch of mistakes. I should have paid more attention to that.
One thing I have not gotten an answer on are the pytypes in method signatures for formatters and I have not noticed that was addressed in the #176 so I have again left them as is.

When I separated Draft PR #147 the advice I got from @r-owen was that "PRs are cheap". If you want me I can close this PR.

EDIT: also if you list back through all review comments there are some I have left open, and even though it states "outdate" there's still comments/answers/pings that could be made to appropriate people, whom I assume you would know, under them.

timj · 2019-08-06T22:20:48Z

@DinoBektesevic - I think it's ready for you to push this directly to the daf_butler repo on a ticket branch. That would allow us to run tests on jenkins to make sure nothing is broken. Yes, that means a new PR. I've added you as a collaborator so you can push directly.

… Dimensions. Changes ------- Added path parsing utilities that figure out if a path is an S3 URI or filesystem path to the utilities. ButlerConfig checks at instantiation, in the case its not being instantiated from another ButlerConfig instance, whether the given path is an S3 URI. If it is, the butler yaml configuration file is downloaded and a ButlerConfig is instantiated from that. The download location is hard-coded and won't work on different machine. Edited fromConfig instantiation method of Registry. If a registry is instantiated from general Config class it checks whether the path to db is an S3 URI or filesystem path. If it's S3 URI it will download the database locally to hardcoded path that will only work on my machine. The remainder of instantiation is then executed as normal from the downloaded file. The RegistryConfig is updated to reflect the local downloaded path. S3Datastore has a different config than the default PosixDatastore one. It manually appends the s3:// to default datastore templates. I do not know outfront if this is really required or not. Initialization of the S3DataStore is different compared to PosixDatastore. Attributes service, bucket and rootpath are added identifying the 's3://', bucket name and path to root of the repository elements of the object storage repository respectively. Attributes session and client provide access, through boto3 API, to upload and download functionality. The file exists and filesize checks were changed to query the S3 API. Currently this is performed by making a listing request to the S3 Bucket. This is allegedly the fastest, but for searching for the name of the file among many. It is also 12.5 times as expensive as directly making a request for the object and catching the exception. I do not know if there are data that can have multiple different paths returned. I suspect not, so this might not be worth it. There is some functionality that will create a bucket if one doesn't exists in the S3DataStore init method, but this is just showcase code - doesn't work or make sense because Registry has to be instantiated before Datastore is and we need that yaml config file to exists - so the bucket must exist too. The get method was not particularily drastically changed, because the mechanism was pushed to formatters. LocationFactory used in the DataStores was changed to return S3Location if it was instantiated from an S3 URI. The S3Location keeps the repository root, the bucket name and the full original bucket URI stored as attributes so its possible to figure out what is the bucket path from which to download the file in the Formatters. The S3FitsExposureFormatter readFull method uses the S3Location to download the exposure to a baked in path that will, again, only work for me. The mapped DatasetType is then used to instantiate from that file and return appropriate in-memory (hopefully) object. All downloaded Datasets are given the same name to avoid clutter, but this should be done differently anyhow so the hacking solution will do for now.

Supports a local Registry and an S3 backed DataStore or a RDS database service Registry and S3 backed datastore. Tests are not yet mocked! CHANGES ------- S3Datastore: - Call to ingest at the end of butler.put now takes in a location.uri - In ingest fixed several wrong if statements that would lead to errors being raised in correct scenarios. - Liberally added comments to help me remember . - TODO: Fix the checksum generation, fix the patch for parsePath2Uri. schema.yaml: - Increased maximum allowed length of instrument name. Default instrument name length was too short (8) which would raise errors on Instrument insert in PostgreSQL. Apparently, Oracle was only tested on 'HSC' and SQLite doesn't care so the short name went un-noticed. - Increased the length attribute on all instances of "name: instrument" entries. - Is there a better way to define global length values than manually changing a lot of entries? Butler: - Refactored the makeRepo code into two functions. Previusly this was one classmethod with a lot of if-s. It should be easier to see what are the steps that need to happen to create a new repo in both cases now. This is a temporary fix, as indicated by the comments in the code, untill someone tells me how I should solve the issue: 2 Butlers, 1 dynamically constructed Butler, multiple classmethods etc... - Removed any and all hard-coded download locations. - Added a check if boto3 import was succesfull but I don't know if this is the correct style to do that in. I left a comment. - Liberal application of comments. - TODO: BUTLER_ROOT_TAG is still being ignored in the case of S3 DataStore and an in-memory Registry. ButlerConfig: - Removed the hardcoded local paths where I used to download yaml config files. They are now downloaded as bytestr and then loaded with yaml. No need for tmpfiles. - There is a case to be made about splitting ButlerConfig into different ones, since there could be a proliferation of if-else statements in its __init__. Registry: - Changed instantiation logic: previously I would create a local SqlRegistry and then upload it to the S3 Bucket. This was deemed a useless idea so it was scratched. The current paradigm is to have an S3 backed DataStore and an in-memory SQLite or an S3 backed DataStore and and RDS backed Registry. Since SqlAlchemy is capable of always creating a new in-memory DB (no name clashes etc.) and considering that the RDS service must be created and exist, we have no need for checking if the Registry exists or creating persistent local SQLRegistries and uploading them. Utils: - Removed some unnecessary comments. - Tried fixing a mistake in parsePath2Uri. It seems like 's3://bucket/root' and 's3://bucket/root/' are parsed differently by the function. This is a mistake. I am sure its fixable here, but I confused myself so I patched it in S3Datastore. The problem is that it is impossible to discern if uri's like 's3://bucket/root' are pointing to a directory or a file. - TODO: Revisit the URI parsing issue. views.py: - added an PostgreSql compiler extensions to create views. In PostgreSql views are created via 'CREATE OR REPLACE VIEW'. FileFormatter: - Code refactoring: _assembleDataset method now contains most of the duplicated code that existed in read and readBytes YamlFormatter: - I think it was only the order of methods that changed, and some code beautification oracleRegistry.py: - In experimentation I kept the code here, since PostgreSQL seems to share a lot of similarities with Oracle. That code has now been excised and moved to postgresqlRegistry.py PostgreSqlRegistry: - Wrote a new class to separate the previously mixed OracleRegistry and PostgreSqlRegistry. - Wrote in additional code that will read `~/.rds/credentials` file where it expects to find a nick under which RDS username and password credentials are stored. The credentials are read at SqlAlchemy engine creation time so they should not be visible anywhere in the code. The connection string can now be given as: `dialect+driver://NICKNAME@host:port/database` as long as (case-sensitive) NICKNAME exists in the `~/.rds/credentials` and is given in the following format: [NICKNAME] username = username password = password - The class eventually ended up being an exact copy of Oracle again because its required for the username and password to be read as close to engine creation as possible, so there we go and here we are. butler-s3store.yaml: - Now includes an sqlRegistry.yaml configuration file. This configures the test_butler.py S3DatastoreButlerTestCase to use an S3 backed DataStore and an in-memory SQLite Registry. Practical for S3Datastore testing. sqlRegistry.yaml: - Targets the registry to an in-memory SQLite registry butler-s3rds.yaml: - Now includes an rdsRegistry.yaml configuration file. This configures the test_butler.py S3RdsButlerTestCase to use an S3 backed DataStore and an (existing) RDS backed Registry. rdsRegistry.yaml: - Targets the registry to an (existing) RDS database. The connection string used by default is: 'postgresql://[email protected]:5432/gen3registry'. This means that an RDS identifier with the name gen3registry must exist (the first 'gen3registry') and in it a **Schema that is on the search_path must exist and that Schema must be owned by the username under the DEFAULT nickname**. This is very important. test_butler.py: - Home to 2 new tests: S3DatastoreButlerTestCase and S3RdsButlerTestCase. Both tests passed at the time of this commit. - S3DatastoreButlerTestCase will use the butler-s3store.yaml to connect to a Bucket to which it will authorize by looking at `~/.aws/credentials` to find the aws_access_key_id and aws_secret_access_key. The name of the bucket to which it will connect is set by the S3DatastoreButlerTestCase bucketName class variable. The permRoot class varible sets the root directory in that Bucket only when useTempRoot is False. The Registry is a in-memory SQLite registry. This is very practical for testing the S3Datastore. This test seems mockable, I just haven't succeeded at it yet. - S3RdsButlerTestCase will use the butler-s3rds.yaml file to connect to a Bucket to which it will authorize by looking at `~/.aws/credentials` expecting to find the aws `access_key` and `secret_access_key`. The name of the bucket is set by S3RdsButlerTestCase bucketName class variable. The permRoot class varible is only used when useTempRoot variable is False. The Registry is an RDS service identified by a "generalized" connection string given in rdsRegistry.yaml configuration file in the test directory. The DEFAULT represents a nickname defined in the `~/.rds/credentials` file under which the username and password of a user with sufficient priviledges (enough to create and drop databases) is expected. The tests are conducted by creating many DBs, each of which is assigned to a particular Registry instance tied to a particular test. This test seems absolutely impossible to mock. test_butlerfits.py: - New S3DatastoreButlerTestCase added. The test is an S3 backed DataStore using a local in-memory Registry. Obviosuly, extends trivially to the S3RdsButlerTestCase since only the S3Datastore is really being tested here - but no such test case exists because I suspect the way the setUp and tearDown is performed in that case will cause a lot of consternation so I'll defer that till a later time when I know what people want/expect.

General: - updated to incorporate recent master branch changes - Removed NO_BOTO and replaced imports with a nicer boto3=None style YAML config files: - Fixed S3 related default config files to lowercase tablenames - Refactored the formatters section from Posix and S3 default yaml files into a formatters.yaml Butler: - Renamed parsePath2Uri --> parsePathToUriElements - Removed example makeRepo functions: - Uploading is now handled from within Config class - makeButler is once again a method of Butler class - Removed del and close methods ButlerConfig: - Removed the code that downloaded butler.yaml file in init Config: - Added a dumpToS3 method that uploads a new config file to Bucket - Added a initFromS3File method - Modified initFromFile method to check whether file is a local one or S3 one Location: - Removed unnecessary comments - Fixed awkward newline mistakes S3Location: - Removed unnecessary comments, corrected all references from Location to S3Location in the docstrings utils.py: - Removed all s3 related utility functionality into s3utils.py - Added more docstrings, removed stale comments and elaborated on unclear ones parsePathToUriElements: - refactored the if-nastiness into something more legible and correct test_butler: - Moved the testPutTemplates into generic Butler test class as a not-tested for method - Added tested-for versions of that method to both PosixDatastoreButlerTestCase and S3DatastoreButlerTestCase - Added a more generic checkFileExists functionality that discerns between S3 and POSIX files - removed a lot of stale comments - improved on the way S3DatastoreButlerTestCase does tear-down - Added mocked no-op functionality and test skipping for the case whne boot3 does not exist.

- refactored Formatters, read/write/File is now implemented over read/write/Bytes. - Removed all things RDS related from the commit. - refactored test_butler and test_butlerFits

- converted the S3 datastore records to lowercase - rebased to most recent changes and survived merger - fixed formatters.yaml (again) - removed my own fix for config overwrites and am currently using the overwrite keyword functionality from a recent ticket.

- added JSON from and to bytes methods. - fixed all the flake8 errors. - wrote tests for s3utils. - rewrote parsedPathToUriElements to make more sense. Added examples, better documentation, split on root capability - refactored how read/write to file methods work for formatters. No more serialization code duplication. - Fix in ingest S3Datastore functionality and path-uri changes made it possible to kick out the duplicate path removing code from S3LocationFactory.

All Formatters now have both read/write and to/from/Bytes methods. Those specific formatter implementations that also implement the _from/_to/Bytes method will default to directly downloading the bytes to memory. The rest will be downloaded/uploaded to/from a temporary file. Checks if file exists or does not, in S3Datastore, were changed since now we definitely incurr a GET charge for a Key's header everytime so there is no need to duplicate the checks with s3CheckFileExists calls.

Hardcoded Butler.get from S3 storage for simplest of Datasets with no - rebased to include newest changes to daf_butler (namely, ButlerURI) - removed parsePathToUriElements, S3Location and S3LocationFactory, and replaced all path handling with ButlerURI and its test cases. - added transaction checks back into S3Datastore. Unsure on proper usage. - added more path manipulation methods/properties to LocationFactory and Location. * bucketName returns the name of the bucket in the Location/ LocationFactory * Location has a pathInBucket property that will convert posix style paths to S3 protocol style paths. The main difference is with respect to leading and trailing separators. S3 interprets `/path/`, `/path`, `path/` and `path` keys differently, even though some of them are equivalent on a POSIX compliant system. So what `/path/to/file.ext` would be on a POSIX system, on S3 it would read `path/to/file.ext` and the bucket is referenced separately with boto3. - For saving Config as file, moved all the URI handling logic into Config and out of Butler makeRepo. The only logic there is root directory creation. * The call to save config as a file at the butler root directory is now done through dumpToUri which then resolves the appropriate backend method to call. - Improved(?) on the proposed scheme for checking if we are dealing with a file or a directory in absence of the trailing path separator. * Noted some differences between the requested generality of code in the review for writing to files and inits of config classes. For inits it seems as if there's an 2+ year old commit limiting the Config files to `yaml` type files only. However, the review implied that `dumpToFile` on Config class should be file format independent. Then, for `dumpTo*` methods, to check whether we have a dir or a file I only inspect whether the path ends on a 'something.something' style. However, since I can count on files having to be `yaml` type and having `yaml` extensions in inits I use a simplified logic to determine if we have a dir or file. It is possible to generalize inits to other filetypes, as long as they have an extension added to them. * I assume we do not want to force users to be mindfull of trailing separators. - Now raising errors on unrecognized schemes on all `if scheme ==` patterns. - Closed the StreamIO in Config.dumpToFile - fixed up the Formatters to return bytes instead of strings. The fromBytes methods now expect bytes as well. JSON and YAML were main culprints. Fixed the docs for them. At this point I am confident I just overwrite the fixes when rewinding changes on rebase by accident because I have done that before, twice. - Added a different way to check if files exist, cheaper but can be slower. From boto/botocore#1248 it is my understanding that this should not be an issue anymore. But the newer boto3 versions are slow to hit package managers.

- `__initFromS3File` now renamed `__initFromS3YamlFile` - `__initFromS3YamlFile` now uses the `__initFromYaml` method to make the format dependency it explicitly clear - changes to method docs - `dumpToS3` renamed to `dumpToS3File` to better match naming to existing functionality. - in `dumpToS3File` and `ButlerConfig` it is assumed that files must have extensions to be files in the cases where it's not possible to resolve whether a string points to a dir or a file.

Changes ------- - added `ospath` property to ButlerURI that localizes the posix-like uri.path - Updated docstrings in ButlerURI class to correctly describe what properties/methods do. - File vs Dir is now resolved by checking if "." is present in path for both Config and ButlerConfig. If not Dir, path is no longer forced to be the top level directory before updating file name. - Fixed various badly merged Formatter docstrings and file IO. - Removed file extension update call in 'to/from'Bytes' methods in Formatters. - Restored `super().setConfigRoot` call in SqliteRegistry. - Removed extra boto3 present check from test_butler.py - Fixed badly formatted docstrings in s3utils. - Renamed `cheap` kwarg to `slow` in `s3CheckFileExists`.

Changes ------- - Added relativeToNetloc to ButlerURI since we are forcing always existing netloc already. This removes some uri.path.lstrip calls. - Fixed s3utils docstrings - added Raises section to fileFormatter docstrings.

Reverted Location to not autofil 'localhost' and 'file'. Reverted Location tests to match. Removed value from log message, was a mistake after all. Replaced all single quotes with double quotes in my PR.

Schemeless URIs handled more smoothly in ButlerURI. Removed s3CheckFileExistsGET and s3CheckFileExistsLIST for a singular s3CheckFileExists function, as botocore is up to date now on coda and on pip so GET/HEAD requests make no performance difference anymore. Live testing shows possibility that HTTP 403 response is returned for HEAD requests when file does not exist and user does not have s3:ListBucket permissions. Added a more informative error message. Changed the s3utils function signatures to look more alike. Changed an instance of os.path to posixpath join in test_butler to be more consistent with what is expected from used URIs. Corrected Config.__initFromS3Yaml to use an URL instead of path and Config.__initFromYaml to use ospath.

Updated formatters to use fileDescriptors as attributes. Updated S3Datastores to match the new formatters. Fixed mistakes in S3Datastore ingest functionality. Removed hardcoded path manipulations. Fixes to S3Datastore put functionality, missing constraints checks added in. Prettification and correction of comments in get functionality. Additional chekcs added. Fixes to getUri functionality, somehow it seems it never got updated when ButlerURI class was written. Implemented review requests in ButlerURI.

DinoBektesevic changed the title ~~U/dino bektesevic/object store~~ S3Datastore May 17, 2019

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from 88de41e to c41d199 Compare May 20, 2019 22:46

DinoBektesevic mentioned this pull request May 21, 2019

Test fix of Location and LocationFactory. #160

Closed

timj changed the title ~~S3Datastore~~ Add support for S3 Datastore May 30, 2019

timj reviewed May 30, 2019

View reviewed changes

timj mentioned this pull request Jun 7, 2019

DM-19916: Rewrite URI handling within Location and LocationFactory #167

Merged

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch 2 times, most recently from 4696e1c to a10a396 Compare June 27, 2019 02:32

timj reviewed Jun 27, 2019

View reviewed changes

python/lsst/daf/butler/core/config.py Outdated Show resolved Hide resolved

timj reviewed Jun 27, 2019

View reviewed changes

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from de65225 to b0901f4 Compare June 27, 2019 17:47

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from 0e4a452 to 5886708 Compare July 12, 2019 19:21

timj reviewed Jul 16, 2019

View reviewed changes

timj reviewed Jul 24, 2019

View reviewed changes

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from 3dd5dd7 to e7bc780 Compare July 29, 2019 23:04

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from e7bc780 to 01d7289 Compare July 30, 2019 00:30

timj reviewed Jul 30, 2019

View reviewed changes

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from 37084e0 to 964e55d Compare August 4, 2019 22:50

timj reviewed Aug 5, 2019

View reviewed changes

python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved

timj reviewed Aug 5, 2019

View reviewed changes

python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved

timj reviewed Aug 5, 2019

View reviewed changes

python/lsst/daf/butler/core/location.py Outdated Show resolved Hide resolved

timj approved these changes Aug 5, 2019

View reviewed changes

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from 964e55d to 743401f Compare August 6, 2019 01:02

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from 743401f to 75156aa Compare August 6, 2019 01:16

DinoBektesevic added 21 commits August 6, 2019 16:19

Added mocking for S3.

6426863

Resolving review fixes lsst#2

564abf6

- refactored Formatters, read/write/File is now implemented over read/write/Bytes. - Removed all things RDS related from the commit. - refactored test_butler and test_butlerFits

Review fixes lsst#3

b51743f

- converted the S3 datastore records to lowercase - rebased to most recent changes and survived merger - fixed formatters.yaml (again) - removed my own fix for config overwrites and am currently using the overwrite keyword functionality from a recent ticket.

s3CheckFileExists now accepts Location and ButlerURI.

ef6a9b3

ButlerURI can now treat paths as relative to netloc.

ba7284d

Rebased onto master.

1dd1b5f

Added relativeToNetloc to ButlerURI, other review fixes.

e46be66

Changes ------- - Added relativeToNetloc to ButlerURI since we are forcing always existing netloc already. This removes some uri.path.lstrip calls. - Fixed s3utils docstrings - added Raises section to fileFormatter docstrings.

Value added back to log output. Could not reproduce the error again.

77bf145

Add test skip to test_butlerFits when boto3 not present.

d35baec

Review fixes lsst#7

b26a555

Reverted Location to not autofil 'localhost' and 'file'. Reverted Location tests to match. Removed value from log message, was a mistake after all. Replaced all single quotes with double quotes in my PR.

Renamed relativeToNetloc to relativeToPathRoot

2b4cab0

Rebase on master, fix .

0be5af4

DinoBektesevic force-pushed the u/DinoBektesevic/object_store branch from 75156aa to 0be5af4 Compare August 6, 2019 23:33

DinoBektesevic mentioned this pull request Aug 6, 2019

DM-13361: Add support for S3 backed Datastore #179

Merged

DinoBektesevic closed this Aug 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for S3 Datastore #159

Add support for S3 Datastore #159

DinoBektesevic commented May 17, 2019

timj left a comment

timj May 30, 2019

DinoBektesevic Jun 27, 2019 •

edited

Loading

timj Jul 16, 2019

timj left a comment

timj left a comment

timj Jul 16, 2019

DinoBektesevic Jul 23, 2019

timj Jul 25, 2019

timj Jul 16, 2019

timj Jul 24, 2019

DinoBektesevic Jul 29, 2019

timj commented Jul 25, 2019

DinoBektesevic commented Jul 30, 2019 •

edited

Loading

timj commented Jul 30, 2019

timj left a comment

timj left a comment

DinoBektesevic commented Aug 6, 2019 •

edited

Loading

timj commented Aug 6, 2019

Add support for S3 Datastore #159

Add support for S3 Datastore #159

Conversation

DinoBektesevic commented May 17, 2019

Changes

Unresolved/Other

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DinoBektesevic Jun 27, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj left a comment

Choose a reason for hiding this comment

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj commented Jul 25, 2019

DinoBektesevic commented Jul 30, 2019 • edited Loading

timj commented Jul 30, 2019

timj left a comment

Choose a reason for hiding this comment

timj left a comment

Choose a reason for hiding this comment

DinoBektesevic commented Aug 6, 2019 • edited Loading

timj commented Aug 6, 2019

DinoBektesevic Jun 27, 2019 •

edited

Loading

DinoBektesevic commented Jul 30, 2019 •

edited

Loading

DinoBektesevic commented Aug 6, 2019 •

edited

Loading