-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for S3 Datastore #159
Add support for S3 Datastore #159
Conversation
88de41e
to
c41d199
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much much better. Thank you for working on this. My main comments are associated with URI handling. I will try to help you out by augmenting #167 such that you can simplify your code here.
|
||
# format the downloaded bytes into appropriate object directly, or via | ||
# tempfile (when formatter does not support to/from/Bytes) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix this later, but I'm worried that we are setting ourselves up here to read in multiple gigabytes of data, then decided that we can't actually convert from bytes and then writing it all to disk, when if we knew that fromBytes was not going to work we could do the efficient incrementally write to disk, then call the formatter to read from disk (which might result in only a subset of the file being read if we wanted a component) with much less memory overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how big of a problem this would actually be, in the sense that the fix seems really easy and that, currently, the only Formatters supporting direct download from bytes are YAML, JSON, and Pickle. I'm not seeing how they would be GB in size.
On the other hand I agree that there is a lot of place for optimization here, especially when downloading files to disk because boto3
offers variety of options to do very fast large file downloads. For example there's the overarching configuration in TransferConfig
that sets various chunk sizes, number of concurrent download threads etc. when using boto3's TransferManager
.
And of course the same applies for any uploads, which would ideally probably be batched together after certain number of new files are created (or some such) and the uploaded with multipart_upload
as the speedups seem significant.
This was discussed in AWS POC meetings, where it was decided to table the issue until we get the POC done first and then it was decided to experiment whether the uploads should be optimized from within boto3 or via condor uploads/downloads and at which sizes of data might one be better than other, if at all.
But a very good point to make for sure.
EDIT: A point I forgot to make - solution is trivial because we know the filesize in advance because the way s3CheckFileExists
works is that it returns True/False for file existence and its size if it does exist. That information is in response['ContentLength']
which you can see couple of lines above. So we know the filesize in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could support FITS files as bytes so who knows what could happen. It's up to the formatter which is outside the control of the datastore.
4696e1c
to
a10a396
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an early review on the new code. I've not covered everything but I'm about to go out for the day on vacation so want to send something.
de65225
to
b0901f4
Compare
0e4a452
to
5886708
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing the requested fixes. My main concern is still mainly with the URI mangling. There is also a huge amount of duplication between PosixDatastore and S3Datastore that I have to think about at some point.
return rndstr + '/' | ||
|
||
def setUp(self): | ||
config = Config(self.configFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a ButlerConfig
? (So that the doubled up datastore.datastore
are removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how to make this ButlerConfig
since I need to call makeRepo
first line of which is
if isinstance(config, (ButlerConfig, ConfigSubset)):
raise ValueError("makeRepo must be passed a regular Config without defaults applied.")
I guess I could try and recast it as Config
when I make the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave it for now. I'll have a think. We shouldn't be doing datastore.datastore
because it's confusing.
|
||
# format the downloaded bytes into appropriate object directly, or via | ||
# tempfile (when formatter does not support to/from/Bytes) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we could support FITS files as bytes so who knows what could happen. It's up to the formatter which is outside the control of the datastore.
@@ -894,7 +894,7 @@ def updateParameters(configType, config, full, toUpdate=None, toCopy=None, overw | |||
for key in toCopy: | |||
if key in localConfig and not overwrite: | |||
log.debug("Not overriding key '%s' from defaults in config %s", | |||
key, localConfig.__class__.__name__) | |||
key, value, localConfig.__class__.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message has to placeholders so you can not put three parameters in the call here (ie I think you were right the first time -- or you need to edit the log message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed value here, I can make a different PR if you want the changes made immediately. Or just make them yourself if that's easier and faster.
There's a lot to disentangle in the URI discussion and it is likely this is not the best venue for that discussion at this point. The important point though is that you can not ever represent a relative file path using a ButlerURI was written specifically to deal with the case where you want a file scheme-less file system path to be usable as well as I'm not sure I understand the issue with It might make life simpler for you if ButlerUri had an Does this help? |
3dd5dd7
to
e7bc780
Compare
(I couldn't figure how to reply to your comment. so I'm making a new one). I have reverted the ButlerURI changes as you requested. The thing that confused me is having a original/current
proposed file | localhost | rootDir/absolute/file.ext | /rootDir/absolute/file.ext | /rootDir/absolute/file.ext | file://localhost/rootDir/absolute/file.ext
file | localhost | rootDir/absolute/file.ext | /rootDir/absolute/file.ext | /rootDir/absolute/file.ext | file://localhost/rootDir/absolute/file.ext
file | localhost | rootDir/absolute/file.ext | /rootDir/absolute/file.ext | /rootDir/absolute/file.ext | file://localhost/rootDir/absolute/file.ext
file | localhost | rootDir/absolute/ | /rootDir/absolute/ | /rootDir/absolute/ | file://localhost/rootDir/absolute/
file | localhost | tmp/relative/file.ext | /tmp/relative/file.ext | /tmp/relative/file.ext | file://localhost/tmp/relative/file.ext
file | localhost | tmp/relative/directory/ | /tmp/relative/directory/ | /tmp/relative/directory/ | file://localhost/tmp/relative/directory/
file | relative | file.ext | /file.ext | /file.ext | file://relative/file.ext
file | localhost | absolute/directory/ | /absolute/directory/ | /absolute/directory/ | file://localhost/absolute/directory/
file | localhost | tmp/relative/file.ext | /tmp/relative/file.ext | /tmp/relative/file.ext | file://localhost/tmp/relative/file.ext
| | relative/file.ext | relative/file.ext | relative/file.ext | relative/file.ext
s3 | bucketname | rootDir/relative/file.ext | /rootDir/relative/file.ext | /rootDir/relative/file.ext | s3://bucketname/rootDir/relative/file.ext
file | localhost | home/dinob/relative/file.ext | /home/dinob/relative/file.ext | /home/dinob/relative/file.ext | file://localhost/home/dinob/relative/file.ext
file | localhost | home/dinob/relative/file.ext | /home/dinob/relative/file.ext | /home/dinob/relative/file.ext | file://localhost/home/dinob/relative/file.ext
file | localhost | tmp/relative/file.ext | /tmp/relative/file.ext | /tmp/relative/file.ext | file://localhost/tmp/relative/file.ext
| | relative/file.ext | relative/file.ext | relative/file.ext | relative/file.ext which does exactly what you describe
and the 3 cases where the relative path was provided were the three that were not parsed as URIs or parsed correctly (as they would have been intended by imaginary user, correct in the sense of the rules). The rest were promoted to completely qualified URIs. In any case, I changed it back as requested and am dropping further discussion. I kept the The other changes were adding support for |
e7bc780
to
01d7289
Compare
Re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a quick look at the URI stuff and it's looking better now. I've made some more comments since I'm not entirely sure how you are handling scheme-less relative paths still. You should consider adding some tests for relativeToPathRoot
.
37084e0
to
964e55d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you deal with the scheme-less vs file vs other comments I just made I'm happy to approve this change. Any further issues I can clean up later. I wonder if we should rename the branch to be tickets/DM-13361
(which I can assign to you I think and would better reflect your work on this). This would require a new PR of course.
One minor complication is that #176 just merged and that changed the Formatter
API a little. This will cause you some merge conflicts with your read/write bytes API. Sorry about that. Let me know if you need help disentangling things.
964e55d
to
743401f
Compare
The Formatters were not a big deal. I saw the PR and I was ready for the changes. It actually helped me notice how out of date the When I separated Draft PR #147 the advice I got from @r-owen was that "PRs are cheap". If you want me I can close this PR. EDIT: also if you list back through all review comments there are some I have left open, and even though it states "outdate" there's still comments/answers/pings that could be made to appropriate people, whom I assume you would know, under them. |
743401f
to
75156aa
Compare
@DinoBektesevic - I think it's ready for you to push this directly to the daf_butler repo on a ticket branch. That would allow us to run tests on jenkins to make sure nothing is broken. Yes, that means a new PR. I've added you as a collaborator so you can push directly. |
… Dimensions. Changes ------- Added path parsing utilities that figure out if a path is an S3 URI or filesystem path to the utilities. ButlerConfig checks at instantiation, in the case its not being instantiated from another ButlerConfig instance, whether the given path is an S3 URI. If it is, the butler yaml configuration file is downloaded and a ButlerConfig is instantiated from that. The download location is hard-coded and won't work on different machine. Edited fromConfig instantiation method of Registry. If a registry is instantiated from general Config class it checks whether the path to db is an S3 URI or filesystem path. If it's S3 URI it will download the database locally to hardcoded path that will only work on my machine. The remainder of instantiation is then executed as normal from the downloaded file. The RegistryConfig is updated to reflect the local downloaded path. S3Datastore has a different config than the default PosixDatastore one. It manually appends the s3:// to default datastore templates. I do not know outfront if this is really required or not. Initialization of the S3DataStore is different compared to PosixDatastore. Attributes service, bucket and rootpath are added identifying the 's3://', bucket name and path to root of the repository elements of the object storage repository respectively. Attributes session and client provide access, through boto3 API, to upload and download functionality. The file exists and filesize checks were changed to query the S3 API. Currently this is performed by making a listing request to the S3 Bucket. This is allegedly the fastest, but for searching for the name of the file among many. It is also 12.5 times as expensive as directly making a request for the object and catching the exception. I do not know if there are data that can have multiple different paths returned. I suspect not, so this might not be worth it. There is some functionality that will create a bucket if one doesn't exists in the S3DataStore init method, but this is just showcase code - doesn't work or make sense because Registry has to be instantiated before Datastore is and we need that yaml config file to exists - so the bucket must exist too. The get method was not particularily drastically changed, because the mechanism was pushed to formatters. LocationFactory used in the DataStores was changed to return S3Location if it was instantiated from an S3 URI. The S3Location keeps the repository root, the bucket name and the full original bucket URI stored as attributes so its possible to figure out what is the bucket path from which to download the file in the Formatters. The S3FitsExposureFormatter readFull method uses the S3Location to download the exposure to a baked in path that will, again, only work for me. The mapped DatasetType is then used to instantiate from that file and return appropriate in-memory (hopefully) object. All downloaded Datasets are given the same name to avoid clutter, but this should be done differently anyhow so the hacking solution will do for now.
Supports a local Registry and an S3 backed DataStore or a RDS database service Registry and S3 backed datastore. Tests are not yet mocked! CHANGES ------- S3Datastore: - Call to ingest at the end of butler.put now takes in a location.uri - In ingest fixed several wrong if statements that would lead to errors being raised in correct scenarios. - Liberally added comments to help me remember . - TODO: Fix the checksum generation, fix the patch for parsePath2Uri. schema.yaml: - Increased maximum allowed length of instrument name. Default instrument name length was too short (8) which would raise errors on Instrument insert in PostgreSQL. Apparently, Oracle was only tested on 'HSC' and SQLite doesn't care so the short name went un-noticed. - Increased the length attribute on all instances of "name: instrument" entries. - Is there a better way to define global length values than manually changing a lot of entries? Butler: - Refactored the makeRepo code into two functions. Previusly this was one classmethod with a lot of if-s. It should be easier to see what are the steps that need to happen to create a new repo in both cases now. This is a temporary fix, as indicated by the comments in the code, untill someone tells me how I should solve the issue: 2 Butlers, 1 dynamically constructed Butler, multiple classmethods etc... - Removed any and all hard-coded download locations. - Added a check if boto3 import was succesfull but I don't know if this is the correct style to do that in. I left a comment. - Liberal application of comments. - TODO: BUTLER_ROOT_TAG is still being ignored in the case of S3 DataStore and an in-memory Registry. ButlerConfig: - Removed the hardcoded local paths where I used to download yaml config files. They are now downloaded as bytestr and then loaded with yaml. No need for tmpfiles. - There is a case to be made about splitting ButlerConfig into different ones, since there could be a proliferation of if-else statements in its __init__. Registry: - Changed instantiation logic: previously I would create a local SqlRegistry and then upload it to the S3 Bucket. This was deemed a useless idea so it was scratched. The current paradigm is to have an S3 backed DataStore and an in-memory SQLite or an S3 backed DataStore and and RDS backed Registry. Since SqlAlchemy is capable of always creating a new in-memory DB (no name clashes etc.) and considering that the RDS service must be created and exist, we have no need for checking if the Registry exists or creating persistent local SQLRegistries and uploading them. Utils: - Removed some unnecessary comments. - Tried fixing a mistake in parsePath2Uri. It seems like 's3://bucket/root' and 's3://bucket/root/' are parsed differently by the function. This is a mistake. I am sure its fixable here, but I confused myself so I patched it in S3Datastore. The problem is that it is impossible to discern if uri's like 's3://bucket/root' are pointing to a directory or a file. - TODO: Revisit the URI parsing issue. views.py: - added an PostgreSql compiler extensions to create views. In PostgreSql views are created via 'CREATE OR REPLACE VIEW'. FileFormatter: - Code refactoring: _assembleDataset method now contains most of the duplicated code that existed in read and readBytes YamlFormatter: - I think it was only the order of methods that changed, and some code beautification oracleRegistry.py: - In experimentation I kept the code here, since PostgreSQL seems to share a lot of similarities with Oracle. That code has now been excised and moved to postgresqlRegistry.py PostgreSqlRegistry: - Wrote a new class to separate the previously mixed OracleRegistry and PostgreSqlRegistry. - Wrote in additional code that will read `~/.rds/credentials` file where it expects to find a nick under which RDS username and password credentials are stored. The credentials are read at SqlAlchemy engine creation time so they should not be visible anywhere in the code. The connection string can now be given as: `dialect+driver://NICKNAME@host:port/database` as long as (case-sensitive) NICKNAME exists in the `~/.rds/credentials` and is given in the following format: [NICKNAME] username = username password = password - The class eventually ended up being an exact copy of Oracle again because its required for the username and password to be read as close to engine creation as possible, so there we go and here we are. butler-s3store.yaml: - Now includes an sqlRegistry.yaml configuration file. This configures the test_butler.py S3DatastoreButlerTestCase to use an S3 backed DataStore and an in-memory SQLite Registry. Practical for S3Datastore testing. sqlRegistry.yaml: - Targets the registry to an in-memory SQLite registry butler-s3rds.yaml: - Now includes an rdsRegistry.yaml configuration file. This configures the test_butler.py S3RdsButlerTestCase to use an S3 backed DataStore and an (existing) RDS backed Registry. rdsRegistry.yaml: - Targets the registry to an (existing) RDS database. The connection string used by default is: 'postgresql://[email protected]:5432/gen3registry'. This means that an RDS identifier with the name gen3registry must exist (the first 'gen3registry') and in it a **Schema that is on the search_path must exist and that Schema must be owned by the username under the DEFAULT nickname**. This is very important. test_butler.py: - Home to 2 new tests: S3DatastoreButlerTestCase and S3RdsButlerTestCase. Both tests passed at the time of this commit. - S3DatastoreButlerTestCase will use the butler-s3store.yaml to connect to a Bucket to which it will authorize by looking at `~/.aws/credentials` to find the aws_access_key_id and aws_secret_access_key. The name of the bucket to which it will connect is set by the S3DatastoreButlerTestCase bucketName class variable. The permRoot class varible sets the root directory in that Bucket only when useTempRoot is False. The Registry is a in-memory SQLite registry. This is very practical for testing the S3Datastore. This test seems mockable, I just haven't succeeded at it yet. - S3RdsButlerTestCase will use the butler-s3rds.yaml file to connect to a Bucket to which it will authorize by looking at `~/.aws/credentials` expecting to find the aws `access_key` and `secret_access_key`. The name of the bucket is set by S3RdsButlerTestCase bucketName class variable. The permRoot class varible is only used when useTempRoot variable is False. The Registry is an RDS service identified by a "generalized" connection string given in rdsRegistry.yaml configuration file in the test directory. The DEFAULT represents a nickname defined in the `~/.rds/credentials` file under which the username and password of a user with sufficient priviledges (enough to create and drop databases) is expected. The tests are conducted by creating many DBs, each of which is assigned to a particular Registry instance tied to a particular test. This test seems absolutely impossible to mock. test_butlerfits.py: - New S3DatastoreButlerTestCase added. The test is an S3 backed DataStore using a local in-memory Registry. Obviosuly, extends trivially to the S3RdsButlerTestCase since only the S3Datastore is really being tested here - but no such test case exists because I suspect the way the setUp and tearDown is performed in that case will cause a lot of consternation so I'll defer that till a later time when I know what people want/expect.
General: - updated to incorporate recent master branch changes - Removed NO_BOTO and replaced imports with a nicer boto3=None style YAML config files: - Fixed S3 related default config files to lowercase tablenames - Refactored the formatters section from Posix and S3 default yaml files into a formatters.yaml Butler: - Renamed parsePath2Uri --> parsePathToUriElements - Removed example makeRepo functions: - Uploading is now handled from within Config class - makeButler is once again a method of Butler class - Removed del and close methods ButlerConfig: - Removed the code that downloaded butler.yaml file in init Config: - Added a dumpToS3 method that uploads a new config file to Bucket - Added a initFromS3File method - Modified initFromFile method to check whether file is a local one or S3 one Location: - Removed unnecessary comments - Fixed awkward newline mistakes S3Location: - Removed unnecessary comments, corrected all references from Location to S3Location in the docstrings utils.py: - Removed all s3 related utility functionality into s3utils.py - Added more docstrings, removed stale comments and elaborated on unclear ones parsePathToUriElements: - refactored the if-nastiness into something more legible and correct test_butler: - Moved the testPutTemplates into generic Butler test class as a not-tested for method - Added tested-for versions of that method to both PosixDatastoreButlerTestCase and S3DatastoreButlerTestCase - Added a more generic checkFileExists functionality that discerns between S3 and POSIX files - removed a lot of stale comments - improved on the way S3DatastoreButlerTestCase does tear-down - Added mocked no-op functionality and test skipping for the case whne boot3 does not exist.
- refactored Formatters, read/write/File is now implemented over read/write/Bytes. - Removed all things RDS related from the commit. - refactored test_butler and test_butlerFits
- added JSON from and to bytes methods. - fixed all the flake8 errors. - wrote tests for s3utils. - rewrote parsedPathToUriElements to make more sense. Added examples, better documentation, split on root capability - refactored how read/write to file methods work for formatters. No more serialization code duplication. - Fix in ingest S3Datastore functionality and path-uri changes made it possible to kick out the duplicate path removing code from S3LocationFactory.
All Formatters now have both read/write and to/from/Bytes methods. Those specific formatter implementations that also implement the _from/_to/Bytes method will default to directly downloading the bytes to memory. The rest will be downloaded/uploaded to/from a temporary file. Checks if file exists or does not, in S3Datastore, were changed since now we definitely incurr a GET charge for a Key's header everytime so there is no need to duplicate the checks with s3CheckFileExists calls.
Hardcoded Butler.get from S3 storage for simplest of Datasets with no - rebased to include newest changes to daf_butler (namely, ButlerURI) - removed parsePathToUriElements, S3Location and S3LocationFactory, and replaced all path handling with ButlerURI and its test cases. - added transaction checks back into S3Datastore. Unsure on proper usage. - added more path manipulation methods/properties to LocationFactory and Location. * bucketName returns the name of the bucket in the Location/ LocationFactory * Location has a pathInBucket property that will convert posix style paths to S3 protocol style paths. The main difference is with respect to leading and trailing separators. S3 interprets `/path/`, `/path`, `path/` and `path` keys differently, even though some of them are equivalent on a POSIX compliant system. So what `/path/to/file.ext` would be on a POSIX system, on S3 it would read `path/to/file.ext` and the bucket is referenced separately with boto3. - For saving Config as file, moved all the URI handling logic into Config and out of Butler makeRepo. The only logic there is root directory creation. * The call to save config as a file at the butler root directory is now done through dumpToUri which then resolves the appropriate backend method to call. - Improved(?) on the proposed scheme for checking if we are dealing with a file or a directory in absence of the trailing path separator. * Noted some differences between the requested generality of code in the review for writing to files and inits of config classes. For inits it seems as if there's an 2+ year old commit limiting the Config files to `yaml` type files only. However, the review implied that `dumpToFile` on Config class should be file format independent. Then, for `dumpTo*` methods, to check whether we have a dir or a file I only inspect whether the path ends on a 'something.something' style. However, since I can count on files having to be `yaml` type and having `yaml` extensions in inits I use a simplified logic to determine if we have a dir or file. It is possible to generalize inits to other filetypes, as long as they have an extension added to them. * I assume we do not want to force users to be mindfull of trailing separators. - Now raising errors on unrecognized schemes on all `if scheme ==` patterns. - Closed the StreamIO in Config.dumpToFile - fixed up the Formatters to return bytes instead of strings. The fromBytes methods now expect bytes as well. JSON and YAML were main culprints. Fixed the docs for them. At this point I am confident I just overwrite the fixes when rewinding changes on rebase by accident because I have done that before, twice. - Added a different way to check if files exist, cheaper but can be slower. From boto/botocore#1248 it is my understanding that this should not be an issue anymore. But the newer boto3 versions are slow to hit package managers.
- `__initFromS3File` now renamed `__initFromS3YamlFile` - `__initFromS3YamlFile` now uses the `__initFromYaml` method to make the format dependency it explicitly clear - changes to method docs - `dumpToS3` renamed to `dumpToS3File` to better match naming to existing functionality. - in `dumpToS3File` and `ButlerConfig` it is assumed that files must have extensions to be files in the cases where it's not possible to resolve whether a string points to a dir or a file.
Changes ------- - added `ospath` property to ButlerURI that localizes the posix-like uri.path - Updated docstrings in ButlerURI class to correctly describe what properties/methods do. - File vs Dir is now resolved by checking if "." is present in path for both Config and ButlerConfig. If not Dir, path is no longer forced to be the top level directory before updating file name. - Fixed various badly merged Formatter docstrings and file IO. - Removed file extension update call in 'to/from'Bytes' methods in Formatters. - Restored `super().setConfigRoot` call in SqliteRegistry. - Removed extra boto3 present check from test_butler.py - Fixed badly formatted docstrings in s3utils. - Renamed `cheap` kwarg to `slow` in `s3CheckFileExists`.
Changes ------- - Added relativeToNetloc to ButlerURI since we are forcing always existing netloc already. This removes some uri.path.lstrip calls. - Fixed s3utils docstrings - added Raises section to fileFormatter docstrings.
Schemeless URIs handled more smoothly in ButlerURI. Removed s3CheckFileExistsGET and s3CheckFileExistsLIST for a singular s3CheckFileExists function, as botocore is up to date now on coda and on pip so GET/HEAD requests make no performance difference anymore. Live testing shows possibility that HTTP 403 response is returned for HEAD requests when file does not exist and user does not have s3:ListBucket permissions. Added a more informative error message. Changed the s3utils function signatures to look more alike. Changed an instance of os.path to posixpath join in test_butler to be more consistent with what is expected from used URIs. Corrected Config.__initFromS3Yaml to use an URL instead of path and Config.__initFromYaml to use ospath.
Updated formatters to use fileDescriptors as attributes. Updated S3Datastores to match the new formatters. Fixed mistakes in S3Datastore ingest functionality. Removed hardcoded path manipulations. Fixes to S3Datastore put functionality, missing constraints checks added in. Prettification and correction of comments in get functionality. Additional chekcs added. Fixes to getUri functionality, somehow it seems it never got updated when ButlerURI class was written. Implemented review requests in ButlerURI.
75156aa
to
0be5af4
Compare
As was suggested to me by @r-owen and @timj the draft PR for
S3Datastore
andPostgreSqlRegistry
together was too large and clunky to be reasonable. Draft PR #147 is now closed and substituted by two PRs, separatingS3Datastore
andPostgreSqlRegistry
.Changes
Based on review comments from the Draft-PR the following was changed compared to the draft PR.
toBytes
andfromBytes
functionality toJsonFormatter
toFile
andfromFile
methods to internally use theto/from/Bytes
on all formatters that support itS3Datastore
now downloads all files as bytes and attempts to read them. If it fails it stores them in a temporary file and then attempts to read them from there.Config
class was made aware of S3 and the code inbutlerConfig
andButler.makeRepo
that was used as a work-around to that was removed.Butler.makeRepo
was re-factored back to (almost) the way it was and is now pretty againButler.computeChecksum
as I never made it to work.daf.butler.core.utils
todaf.butler.core.s3utils
s3utils
.S3Location
by fixingparsePathToUriElements
boto3
imports protection, mocks and skips for tests whenmoto
orboto3
is missing...)Unresolved/Other
<butlerRoot>
tag is still not supported for the currentS3Datastore
as, from what I can see, the main effect of the butler relocation code is thatsqlite:///:memory:
registry location is replaced by'sqlite:///<butlerRoot>/gen3.sqlite3
which just does not make practical sense when working withS3Datastore
. We can not talk to sqlite DB in a Bucket, while, forS3Datastore
tests at least, it does make sense to have an ephemeral Registry. This makes sense for cases when thedb
string in the registry YAML config can be specified directly and won't be changed, so I'm deferring this issue for later.S3Datastore.ingest
as was requested. There are several things slightly unclear to me here so I haven't quite polished up the large block of if-else code containing a lot of path and string manipulations to figure out what type of ingest needs to happen and how. It is on my todo list.boto3.client.delete_object(bucket_name, object_name)
is just bad. Deleting objects from a Bucket deletes the current version (S3 versions file when upload overwrites them) so technically a delete can just revert to an older file version when versioned Buckets are used. But that's not as bad since almost no files are allowed to be rewritten. The problem is the following, still unresolved, issue Boto3 delete_objects response does not match documented response boto/boto3#507 . On testing I find that the only failure I can get fromdelete_object
is the one when I give it an non-existing Bucket. In all other cases I just get a HTTP 200 OK response, even when files don't exist.S3Datastore.remove
is that I need to find the location of the file before I can remove it, so while at it why not check for errors. This is an exact duplicate ofPosixDatastore
except for the remove call.The rest of the comments were either fixed, or enough has changed, I believe, that they are not relevant anymore. Thanks to all the input I got, I think this is a much better iteration of
S3Datastore
than what was in the original PR - so, many thanks to everyone who inputted.