Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing fixity auditing outcomes #847

Open
mjordan opened this issue Jun 21, 2018 · 59 comments
Open

Storing fixity auditing outcomes #847

mjordan opened this issue Jun 21, 2018 · 59 comments
Labels
Type: question asks for support (asks a question)

Comments

@mjordan
Copy link
Contributor

mjordan commented Jun 21, 2018

I would like to start work on fixity auditing (checksum verification) in CLAW. In 7.x, we have the Checksum and Checksum Checker modules, plus the PREMIS module, which serializes the results of Checksum Checker into PREMIS XML and HTML. Now is a good time to start thinking about how we will carry this functionality over to CLAW so that on migration, we can move PREMIS event data from the source 7.x to CLAW.

In 7.x, we rely on FCREPO 3.x's ability to verify a checksum. In a Drupal or server-side cron job, Checksum Checker issues a query to each datastream's validateChecksum REST endpoint (/objects/{pid}/datastreams/{dsID} ? [asOfDateTime] [format] [validateChecksum]) and we store this fixity event outcome in the object's AUDIT datastream. The Fedora API Specification, on the other hand, does not require validation of a binary resource's fixity but instead requires implementations to return a binary resource's checksum to the requesting client, allowing the checksum value to "be used to infer persistence fixity by comparing it to previously-computed values."

Therefore, in CLAW, to perform fixity validation, we need to store the previously-computed value ourselves. In order to ensure long-term durability and portability of the data, we should avoid managing it using implementation-specific features. Two general options for storing fixity auditing event data that should apply to all implementations of the Fedora spec are

  • store the outcome data within the Fedora repository as RDF
  • store the outcome data external to the repository in a triplestore, RDMS, or key-value store

Fixity event data can accumulate rapidly. The 7.x Checksum Checker module's README documents the effects of adding event outcome to an object's AUDIT datastream, but in general, each fixity verification event on a binary resource will generate one outcome, which includes the timestamp of the event and a binary value (passed/failed). For example, in a repository that contains 100,000 binary resources, each verfication cycle will generate 100,000 new outcomes that need to be persisted somewhere. In our largest Islandora instance, which contains over 600,000 newspaper page objects, we have completed 14 full fixity verification cycles, resulting in approximately 8,400,000 outcome entries.

I would like to know what people think are the pros and cons of storing this data both within the repository as RDF and external to the repository using the triplestore or a database. My initial take on this question is:

  • Internal to repo
    • Pro: data is encapsulated as close to resources as possible, increasing long-term durability
    • Con: over time, will result in a large increase in number of RDF resources in repo, which could impact performance
  • In triplestore
    • Pro: will be stored with other data about resources, allowing consistent and simpler querying by external clients, etc.
    • Con: over time, will result in a large increase in number of triples, possibly impacting CLAW's performance
    • Con: is "external" to resources, reducing long-term durability
    • Con: if triplestore is lost, fixity audit trail cannot be rebuilt from repository
  • In external database
    • Pro: will keep number of resources in repo, and triples in triplestore, low
    • Pro: large amount of data should not impact CLAW's performance
    • Con: is "external" to resources, reducing long-term durability
    • Con: if database is lost, fixity audit trail cannot be rebuilt from repository

One possible mitigation against the loss of an RDBMS is to periodically dump the data as a text file and persist it into the repository; that way, if the database is lost, it can be recovered easily. The same strategy could be applied to data stored in the triplestore.

If we can come to consensus on where we should store this data, we can then move on to migration of fixity event data, implementing periodic validation ("checking"), serialization, etc.

@ajs6f
Copy link

ajs6f commented Jun 21, 2018

Couple of simple thoughts:

In order to ensure long-term durability and portability of the data, we should avoid managing it using implementation-specific features.

+1000.

As for where to put it, if this checksum info is stored in support of durability, it should be treated at least as well as other durable information: stored in multiple places, for each operation (as opposed to occasional bulk updates from one location to another). Which location is used as an authoritative source seems to me mostly to depend on pragmatic considerations (i.e. how to keep the architecture simple and performant).

@mjordan
Copy link
Contributor Author

mjordan commented Jun 21, 2018

Thinking this through a bit more, if we store event outcome data as RDF in either FCREPO or the triplestore, we'd need not just one triple but two for each event, the timestamp, and the outcome (passed/failed), assuming the trusted checksum value need only be stored once, not per verification event. So a verification cycle of 100,000 resources would result in 200,000 new triples.

If that's the case, maybe storing both pieces of info in one row in a database table is more efficient, replicating the info by periodically persisting the db table(s) into Fedora as a binary resource. Doing that at least ensures there are two copies, although not necessarily robustly distributed copies. But databases are pretty easy to replicate, so if we want distributed replication, that's also an option.

@ajs6f
Copy link

ajs6f commented Jun 21, 2018

The problem seems to me to be that a fixity check is transactional information. But the pattern suggested here persists it long after the transaction is done. Why store the events at all? Why not publish them like any other events in the system, and if a given site wants to store them or do something else with them, cool, they can figure out how to do that together with other sites that share that interest. But why make everyone pay the cost of persisting fixity check events? Speaking only for SI, we certainly don't need or want to do that.

Does anything other than a single checksum per non-RDF resource actually need to be stored?

@mjordan
Copy link
Contributor Author

mjordan commented Jun 21, 2018

Currently, in 7.x, enabling the Checksum and Checksum Checker modules is optional, and I'm not suggesting that similar functionality in CLAW is any different. Sorry I didn't state that explicitly. Any functionality in CLAW to generate, manage, and report on fixity auditing would be implemented as Drupal contrib modules.

We would want to store events so we can express the history of a resource in PREMIS (for example). In our use case, we want to be able to document that uninterupted history across migrations between repositories, from 7.x to CLAW.

@dannylamb
Copy link
Contributor

dannylamb commented Jun 21, 2018

I think what's getting wrapped up in here is the auditing functionality. If we just need to check fixity, stick it on as a Drupal field. It'll wind up in the drupal db, the triplestore, and fedora. If you want to persist audit events, I'd model that as a content type in drupal and it'll get persisted to all three as well by default. Of course, you could filter it with context and make it populate only what you want (e.g. just the triplestore and not fedora).

@mjordan
Copy link
Contributor Author

mjordan commented Jun 21, 2018

@dannylamb I hadn't thought about modelling fixity events as a Drupal content type. One downside to doing that is adding all those nodes to Drupal. I'm concerned that over time, the number of events will grow very large, with dire effects on performance.

@dannylamb
Copy link
Contributor

@mjordan And after thinking about this some more, if you're worried about performance, your best bet is usually something like a finely tuned postgres. Just putting it in Drupal, and not Fedora or the triplestore may be your best bet. I'd just be sure to snapshot the db. That's a perfectly acceptable form of durability if you ask me.

@dannylamb
Copy link
Contributor

Ha, needed to refresh.

@dannylamb
Copy link
Contributor

@mjordan Yes, that's certainly a concern. That threshold of "how much is too much for Drupal" is looming out there. It'd be nice to find out where that really is.

@mjordan
Copy link
Contributor Author

mjordan commented Jun 21, 2018

I agree with @ajs6f's characterization of fixity verification as transactional, which is why I'm resisting modelling the events as Drupal nodes.

We should do some thorough scalability testing, for sure. Maybe we should open an issue for that now, as a placeholder?

@dannylamb
Copy link
Contributor

I see what you're saying. It's not like you're going to be scouring that history all the time, so there's no point in having it bog down everything else. If it's too hamfisted to model them as nodes, then having a drupal module just to emit them onto a queue is super sensible. And sandboxing it to its own storage is even more so. As for what that is/should be?

I guess that depend on what you're going to do with it and how you want to access it. I presume you'd want to be able to query it? That at least narrows down the choice to either sql or the triplestore if you wanna stay with the systems already in the stack.

@dannylamb
Copy link
Contributor

...or Solr.

@mjordan
Copy link
Contributor Author

mjordan commented Jun 21, 2018

Yeah, we're going to want to query it. If we store the SHA1 checksum as a field in the binary node (which sounds like a great idea), we'll want to query the events to serialize them as PREMIS, for example ("give me all the fixity verification events for the resource X, sorted by timestamp would be nice").

@seth-shaw-unlv
Copy link
Contributor

We weren't necessarily planning on using CLAW to manage fixity. I'm actually interested in what UMD proposed which includes using a separate graph in the triple-store specifically for audit data. Even if you were using the same Triple-store for both, placing them in separate graphs should preserve performance on the CLAW one.

@ajs6f
Copy link

ajs6f commented Jun 21, 2018

I guess that depend on what you're going to do with it and how you want to access it.

Can't agree enough!

@seth-shaw-unlv Did you mean separate datasets? Because in most triplestores (depends a bit, but Jena is a good example) putting them in separate named graphs in one dataset isn't going to do anything for performance. (Putting one in the default graph and one in a named graph would do a little, but not anything much compared to putting them in separate datasets.)

Generally, my experience has been that in non-SQL stores (be they denormalized like BigTable descendants or "hypernormalized" like RDF stores) query construction makes the biggest difference in performance, and should dictate data layout.

@mjordan Sorry about the misunderstanding-- I thought you were talking about workflow to which every install would have to subscribe. Add-on/optional stuff, no problem!

@seth-shaw-unlv
Copy link
Contributor

@ajs6f, yes, you are right. I was, admittedly, speaking based on an assumption that separate graphs would improve performance due to a degree of separation. I don't have experience scaling them yet .

@ajs6f
Copy link

ajs6f commented Jun 21, 2018

@seth-shaw-unlv I think we're all going to learn a bunch in the next few years about managing huge piles of RDF!

@mjordan
Copy link
Contributor Author

mjordan commented Jun 22, 2018

@seth-shaw-unlv the UMD strategy looks good, but it's specific to fcrepo. I think it's important that Islandora not rely on features of a specific Fedora API implementation. Also, I'm hoping that we can implement fixity auditing in a single Drupal module, without any additional setup (which is what we have in Islandora 7.x).

@ajs6f no problem, we're all so focussed on getting core functionality right that I should have made it clear I was shifting to optional functionality.

@whikloj
Copy link
Member

whikloj commented Jun 22, 2018

I think the UMD plan could be simplified to:

  1. Do fixity (as defined in the Fedora API) on some scheduled process.
  2. Store fixity check result somewhere.*
  • could be abstracted in a way to store the information as a repository object, triples in a triplestore, entries in a Redis store or the SQL/PostgreSQL DB.

I'd like to keep the processing of fixity off of the Drupal server if possible as this is a process that for large repositories could be always running.

@mjordan
Copy link
Contributor Author

mjordan commented Jun 22, 2018

@whikloj yes I was starting to think about abstracting the storage out so individual sites could store it where they want. About keeping the processing off the Drupal server, you're right, the process would be running constantly. But I don't see where issueing a bunch of requests for checksums, then comparing them to the previous value, then persisting the results somewhere would put a huge load on the Drupal server. It's the Fedora server that I think will get the biggest hit, since, if my understanding is correct, it needs to read the resource into memory to respond to the request for the checksum. A while back I did some tests on a Fedora 3.x server to see how long it took to verify a checksum and found that "the length of time it takes to validate a checksum is proportionate to the size of the datastream"; I assume this is also true to a certain extent with regard to RAM usage although I didn't test for that.

@mjordan
Copy link
Contributor Author

mjordan commented Jul 3, 2018

Following up on @whikloj suggestion of moving the fixity checking service off the Drupal server, would implementing it an external microservice be an option? That way, non-Islandora sites might be able to use it as well. Kind of complicates where the data is stored (maybe that could be abstracted such that Islandora stores it in Drupal db, Samvera stores it somewhere else, etc. Such a microservice could be containerizied if desired.

@dannylamb
Copy link
Contributor

👍 Doing it as a microservice will indeed abstract away all those details. The web interface you design for it will allow individual implementors to use whatever internal storage they want.

@mjordan
Copy link
Contributor Author

mjordan commented Jul 3, 2018

Sounds like a plan - anyone object to moving forward on this basis? The "Islandora" version of this would be a module that consumed data generated by the microservice to provide reports, checksum mismatch warnings, etc.

@dannylamb
Copy link
Contributor

The "Islandora" version of this would be a module that consumed data generated by the microservice to provide reports, checksum mismatch warnings, etc.

Reading this, my gut is telling me the microservice should stuff everything into its own SQL db and we point views at it in Drupal to generate reports/dashboards.

@jonathangreen
Copy link
Contributor

I totally agree with the microservice idea for doing fixity checks.

Not sure if we should handle it in this issue, or in another issue, but one thing we are missing (and missing completely in 7.x) is the ability to provide a checksum on ingest, and have it verified once the object is in storage, failing the upload if the fixity check fails.

This is the most common fixity feature I'm asked for in Islandora 7.x, and it covers the statistically most likely case of the file getting mangled in transit, rather then when sitting on disk.

@dannylamb
Copy link
Contributor

dannylamb commented Jul 3, 2018

@jonathangreen We're halfway there on transmission fixity. We cover it on the way into Fedora from Drupal, but not from upload into Drupal. We can add that as a separate issue to add it to the REST api and wherever else (text field on upload form?).

@jonathangreen
Copy link
Contributor

@dannylamb sounds good to me.

@ajs6f
Copy link

ajs6f commented Jul 3, 2018

Just to be clear, this would be a service that produces checksums for the frontend via its own path to persistence, not a service to which the binaries are transmitted over HTTP for a checksum, right?

@dannylamb
Copy link
Contributor

@jonathangreen #867

@mjordan
Copy link
Contributor Author

mjordan commented Jul 3, 2018

Thanks @DiegoPino, I'll pass that advice on to him when I see him.

@DiegoPino
Copy link
Contributor

@jonathangreen for the 7.x version Ticket, if feel it could be good put somewhere in that ticker, for whomever ends writing that, that some chunked transmissions implementations like plupload could have issues with a user provided hash.. via form (like where to put it and how to trigger-it or when to trigger it since assembling of the final upload is happening somewhere else...)

@jonathangreen
Copy link
Contributor

@DiegoPino here is the ticket for 7.x if you want to add some notes: https://jira.duraspace.org/projects/ISLANDORA/issues/ISLANDORA-2261

@mjordan
Copy link
Contributor Author

mjordan commented Jul 11, 2018

During 2018-07-11 CLAW tech call, @rosiel asked about checksums on CLAW binaries stored external to Fedora, e.g. stored in s3, Dropbox, etc. Getting Fedora to provide a checksum on these binaries could be quite expensive since it pulls down the content to run a fixity check on it. One idea that come up in the discussion was that if we are using an external microservice to manage/store fixity checking, we could set up rules to verify checksums on those remote binaries. The microservice would need to pull it down to do its check, but not if the storage storage service provided an API to get a checksum on a binary, our microservice could query that API.

@DiegoPino
Copy link
Contributor

DiegoPino commented Jul 11, 2018 via email

@mjordan
Copy link
Contributor Author

mjordan commented Jul 11, 2018

@DiegoPino yes, that's what I meant. Sorry if that was not clear.

@DiegoPino
Copy link
Contributor

Sorry, my fault! Re-reading and totally agree, sorry again @mjordan

@jonathangreen
Copy link
Contributor

I'd like to see a way here to have a trust but verify approach, where you can pull checksums from an external API, but maybe at some lower frequency you still want to pay for the bandwidth to do some verification of the checksums. Could just be some configuration options.

@DiegoPino
Copy link
Contributor

@jonathangreen i agree it could be useful to provide a more compliant to what is expected preservation platform, but in terms of implementation, how would you propose we have no false positives of corruption because of timed out/stalled or even failed downloads? Not something that keeps me awake at night right now, but http(s) which is what most API provide for downloading assets tends to be hit and miss in that aspect. As said, i agree this is needed, just don´t know how to deal with it at an implementation level in a safe and reliable way

@mjordan
Copy link
Contributor Author

mjordan commented Jul 11, 2018

One approach to handling false mismatches is to retry the request if it fails, and see what the results are. A one-off failure can be discarded, but if they all fail, the problem is probably legit.

@mjordan
Copy link
Contributor Author

mjordan commented Jul 11, 2018

Not to increase scope here too much, but keep the use and edge cases coming. I'll be working on this (and related preservation stuff in CLAW) pretty much full time this fall.

@ajs6f
Copy link

ajs6f commented Jul 11, 2018

This is starting to sound like an application on top of something like iRODS. I'm not seriously suggesting that; I'm wondering whether for the MVP, would it be enough to have a simple µservice that just retrieves a checksum from the backend in use, on the assumption that such a checksum is available?

I'm not at all trying to discourage people from recording use cases and I think it's awesome that @mjordan is thinking through this, it's just that when you're facing problems like: the character of reliable transport for mass data across networks of unknown character... that's a pretty big scope.

@mjordan
Copy link
Contributor Author

mjordan commented Jul 11, 2018

@ajs6f MVP is a good way to frame it. I don't have the cycles to propose one this week but need to prepare a poster for iPres so will need to do that soon (next couple weeks?). We can build it in a way that can expand on the MVP.

@rosiel
Copy link
Member

rosiel commented Jul 12, 2018

Maybe this could be one consideration in a repo manager's choice of storage solution. Since now we have all these options, we're going to need to make educated decisions on which one to use. Just because something's cool, doesn't mean it's the right tool for your job, and if you need reliable, regular, automated, locally-performed checksumming (and maybe that's a preservation best practice?) then S3 might not be the ideal storage location for you?

@bradspry
Copy link

What if it was stored as a standard Islandora datastream, like any other derivative, with each outcome appended to the same versioned file?

@mjordan
Copy link
Contributor Author

mjordan commented Jul 20, 2018

@bradspry so each binary resource would have an accompanying text/JSON/XML or other resource containing its validation history... Doing that would avoid having triples for each event and also store the fixity checking history in Fedora. Definitely worth exploring.

@ajs6f
Copy link

ajs6f commented Jul 20, 2018

That can be a relatively lean pattern, but I would argue that it shouldn't be a default because it creates new resources in the repository. For sites like SI which won't be relying on this layer of technology for fixity functions, that's a huge number of empty useless resources. Maybe as an optional function in the proposed µservice?

@mjordan
Copy link
Contributor Author

mjordan commented Jul 20, 2018

@ajs6f totally. I'm currently leaning to an external microservice that does the actual auditing but that can talk to a variety of optional (site-configurable) storage options, e.g. relational db, triplestore, or @bradspry's approach.

@rtilla1
Copy link

rtilla1 commented Aug 23, 2018

I did not read this whole thread, but there is a schema for putting technical metadata into RDF: https://spdx.org/rdf/spdx-terms-v2.1/ Are there x-paths to the technical metadata datastream elements that we can use to map to these terms?

@mjordan
Copy link
Contributor Author

mjordan commented Aug 23, 2018

@rtilla1, that spec looks very useful but it appears to be specific to describing software packages. That said, there's no reason we couldn't use some of its properites for all types of content.

The plan so far for fixity verification event data is to provide options for storing the data in several ways, e.g., in the fedora repository as entities, in a relational database, in CSV binary resource associated with the primary entity @bradspry is suggesting, etc. (we'll probably start off with a relational database managed by an external microservice). One advantage of storing the data in a db or CSV file is that it can be converted to specific schema on demand. But, same is true even if it is stored as RDF. For example, if a consumer of fixity data wanted PREMIS, we should be able to provide it to them in that vocabulary; same goes for spdx. So I don't think we're required to having to choose a specific ontology or fixity data right now.

@mjordan
Copy link
Contributor Author

mjordan commented Oct 3, 2018

I think my implementation of a fixity checking microservice is ready to get some additional eyes on:

https://github.com/mjordan/riprap

This is only a Minimum Viable Product, but does take the following requirements from this issue into account:

  • it's an independent microservice that can run anywhere
  • it's implemented as a Symfony console command so you can run it in server-side cron
  • it uses plugins (which are themselves Symfony console commands) to get a list of resources to check, to ask an external service (e.g., Fedora) for a digest, to persist the results of the check (to a database, back into Fedora, etc.) and to do stuff after the check (email an admin, migrate legacy fixity events, etc.)
  • it has a basic REST API for getting fixity events for a resource (works now) and for adding / updating events

I've already started some issues... but it's fairly far along. I'd love some feedback on the direction Riprap is taking. I'll also probably need some help getting Symfony's ActiveMQ listener working at some point...

@DiegoPino
Copy link
Contributor

I was just thinking about you guys @mjordan @dannylamb while doing some preservation action releated work. Have you all seen this?
https://github.com/medusa-project/medusa-fixity

@mjordan
Copy link
Contributor Author

mjordan commented Oct 10, 2018

Interesting but doesn't offer the features we want. From what I can tell it only logs digests and doesn't compare logged results with new fixity checks. Also doesn't support fetching a digest via HTTP, which what the Fedora specification requires.

@kstapelfeldt kstapelfeldt added Type: question asks for support (asks a question) and removed question labels Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: question asks for support (asks a question)
Projects
Development

No branches or pull requests