Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ODRC: implement Document creation/upload #19

Closed
2 tasks done
Tracked by #2
sergei-maertens opened this issue Sep 3, 2024 · 6 comments · Fixed by #106, #117 or #127
Closed
2 tasks done
Tracked by #2

ODRC: implement Document creation/upload #19

sergei-maertens opened this issue Sep 3, 2024 · 6 comments · Fixed by #106, #117 or #127

Comments

@sergei-maertens
Copy link
Contributor

sergei-maertens commented Sep 3, 2024

Acceptance criteria

Copied/moved from #2

  • Het format wordt gevalideerd wanneer het bestand geüpload wordt.
  • Via een API kan een document (data-object) gecreëerd, geraadpleegd, gemuteerd en verwijderd worden.
  • Bij een document moet exact één bestand (bijvoorbeeld een PDF) worden opgeslagen.
  • De creatie, mutatie of verwijdering van een document wordt gelogd (zie ODRC: Set up foundation for logging/audit trails #16)
  • De API specificatie is bijgewerkt (ReDoc, Swagger)
  • De documentatie is bijgewerkt (Read the Docs)

Document creation + uploads

We make use of the file parts mechanism of the Documenten API, always.

  • ODRC will proxy to the underlying documenten API

Preparing a document upload

The client must pass the necessary metadata and then based on the response of that,
cut up the file upload in parts that can be submitted individually.

The request body schema of a Document POST would look something like:

Document:
  type: object
  required:
    - identifier
  properties:
    publicatie:  # UUID of the publication it belongs to
      type: string
      format: uuid
    identifier:  # the 'primary' identifier
      type: string
    creatiedatum:
      type: string
      format: date
    officieleTitel:  # DiWoo doesn't seem to apply a max length
      type: string
    verkorteTitel:
      type: string
    omschrijving:
      type: string
    # from waardelijst -> expose options in separate endpoint (!)
    # POST (write) operations should be able to just provide the identifier IRI instead of this complex object if ICATT desires this
    bestandsformaat:  
      type: object
      properties:
        identifier:  # IRI from waardelijst
          type: string
          format: uri
        mimeType:  # e.g. application/pdf
          type: string
        naam:
          type: string  # e.g. "PDF"

This translates to a request of ODRC -> Documenten API with schema:

EnkelvoudigInformatieobject:
  type: object
  properties:
    identificatie:  # primary identifier of Document OR let the Documenten API generate one?
      type: string
    bronorganisatie:  # fixed, global configuration parameter in ODRC initially, could become 'smart' in the future
      type: string
    creatiedatum:  # taken from Document.creatiedatum
      type: string
      format: date
    titel:
      type: string
      minLength: 1
      maxLength: 200
    auteur:  # taken from Document.publicatie.organization
      type: string
    status:
      const: definitief  # archiving will move this to gearchiveerd, later
    formaat:  # derived from Document.bestandsformaat
      type: string
    taal:  # derived from Document.taal -> convert to/from ISO 639-2/B
      type: string
      enum:
        - dut
        - eng
    bestandsnaam:  # taken from Document.bestandsnaam
      type: string
    bestandsomvang:  # taken from Document.bestandsomvang, prepares the file parts
      type: number
    indicatieGebruiksrecht:
      const: false
    informatieobjecttype:  # points to /catalogi/api/v1/informatieobjecttypen/:uuid for the Document.publicatie.informatiecategorie
      type: string
      format: uri

The Documenten API will return a lock and list of BestandsDelen for upload, each
bestandsdeel will have the shape:

BestandsDeel:
  type: object
  properties:
    url:  # URL to PUT to
      type: string
      format: uri
    volgnummer:
      type: integer
    omvang:
      type: integer
    voltooid:
      const: false
    lock: # ??
      type: string

The ODRC will then expose endpoints for these part uploads so the publication component
can upload the parts:

URL: /api/v1/documenten/:uuid/bestandsdeel/:index

A bestandsdeel wil simply be multipart/form-data, with the API key as auth header.

The request will be transformed by the ODRC, which adds the lock ID & JWT for the
Documenten API, and streams the file part down to the Documenten API.

Once all parts are received, we unlock the created document.

Tasks

  • Implement API endpoint/serializer for POST /api/v1/documenten
    • Part of the metadata is stored in our database
    • Other metadata that we can store in Documenten API, we store there
    • Record the requested bestandsdelen from the Documenten API
    • Record the lock ID to finalize the upload/creation
    • Resolve the internal "Catalogi API" endpoint for the informatieobjecttype
  • Implement the API endpoint/serializer for PUT /api/v1/documenten/:uuid/bestandsdelen/:index
    • Use the stored lock ID & other configuration/metadata to all the Documenten API /api/v1/bestandsdelen/:uuid
    • If all parts are completely uploaded, unlock the document
    • Emit in the response body documentFinalized: true|false so that the client can be informed that they can refresh the document resource if needed, as file uploads will likely happen in parallel
    • Nice to have: check if we can make use of proxying/streaming (timebox: 1 day)
  • The first part of the document contains the magic bytes that can be used to validate against the format in the metadata. Check the upload validation in Open Forms for inspiration (and edge cases). postponed to ODRC: add model/admin interface for 'formaat' waardelijst #36
@sergei-maertens sergei-maertens changed the title Adding a document prepares the metadata + bestandsdeel URLs (TODO: spec ticket!) ODRC: implement Document creation/upload Sep 3, 2024
@sergei-maertens sergei-maertens added this to the ODRC: Basis milestone Sep 3, 2024
@MarcoKlerks MarcoKlerks moved this to Backlog in GPP-Woo Sep 12, 2024
@MarcoKlerks MarcoKlerks moved this from Backlog to Refinement in GPP-Woo Sep 17, 2024
@MarcoKlerks MarcoKlerks moved this from Refinement to Ready in GPP-Woo Oct 9, 2024
sergei-maertens added a commit that referenced this issue Oct 30, 2024
The Documents API integration fields do not hold any semantic meaning,
so it's easier to understand them when grouped together with related
fields and track them separately from the 'normal' model fields.

Eventually, the lock ID will need to get stored too, the code is
currently commented out since PR #96 needs to be merged first, as it
also has new migrations.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
Rather than embedding this everywhere and repeating the credentials,
we can simplify the test setup by using a factory with a trait for
our Open Zaak docker compose configuration. If/when we alter the
credentials/fixture, we only have a single place to update, while
keeping the semantic meaning in tests clear enough through the
parameter name.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
Added the initial step to create a document in the Documents API,
using the large file upload parts mechanism.

This sets up the public API to use in the future API endpoints,
allowing us to keep reasoning (mostly) in our own Python domain without
worrying about implementation details and API versions.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
The file uploads that we will receive in our endpoint can be proxied
to the Documents API from the client, which has the API credentials
and receives the lock value from our local state.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
This is the file step to be able to 'use' the document - once all
document parts are uploaded (via our proxy endpoint), we can ensure
that the document is unlocked and then ready for use.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
The Documents API integration fields do not hold any semantic meaning,
so it's easier to understand them when grouped together with related
fields and track them separately from the 'normal' model fields.

Eventually, the lock ID will need to get stored too, the code is
currently commented out since PR #96 needs to be merged first, as it
also has new migrations.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
Rather than embedding this everywhere and repeating the credentials,
we can simplify the test setup by using a factory with a trait for
our Open Zaak docker compose configuration. If/when we alter the
credentials/fixture, we only have a single place to update, while
keeping the semantic meaning in tests clear enough through the
parameter name.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
Added the initial step to create a document in the Documents API,
using the large file upload parts mechanism.

This sets up the public API to use in the future API endpoints,
allowing us to keep reasoning (mostly) in our own Python domain without
worrying about implementation details and API versions.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
The file uploads that we will receive in our endpoint can be proxied
to the Documents API from the client, which has the API credentials
and receives the lock value from our local state.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
This is the file step to be able to 'use' the document - once all
document parts are uploaded (via our proxy endpoint), we can ensure
that the document is unlocked and then ready for use.
sergei-maertens added a commit that referenced this issue Oct 30, 2024
Since multiple requests will be involved in the document creation and
file uploads, we need some persistent storage for the document lock
value, created by the Document create call.
sergei-maertens added a commit that referenced this issue Oct 31, 2024
The Documents API integration fields do not hold any semantic meaning,
so it's easier to understand them when grouped together with related
fields and track them separately from the 'normal' model fields.

Eventually, the lock ID will need to get stored too, the code is
currently commented out since PR #96 needs to be merged first, as it
also has new migrations.
sergei-maertens added a commit that referenced this issue Oct 31, 2024
Rather than embedding this everywhere and repeating the credentials,
we can simplify the test setup by using a factory with a trait for
our Open Zaak docker compose configuration. If/when we alter the
credentials/fixture, we only have a single place to update, while
keeping the semantic meaning in tests clear enough through the
parameter name.
sergei-maertens added a commit that referenced this issue Oct 31, 2024
Added the initial step to create a document in the Documents API,
using the large file upload parts mechanism.

This sets up the public API to use in the future API endpoints,
allowing us to keep reasoning (mostly) in our own Python domain without
worrying about implementation details and API versions.
sergei-maertens added a commit that referenced this issue Nov 8, 2024
* Test for invalid upload data, leading to validation error
* Test for generic IO-layer error (can't connect to host)
* Test for error response returned from Documenten API
sergei-maertens added a commit that referenced this issue Nov 8, 2024
* After registration of the metata, we expect to be able to upload the
  binary content of all file parts
* The uploads should be forwarded to the underlying Documents API
* Once all parts are complete, the document must be unlocked
* When we download the file data from the Documents API, this must
  match what we uploaded.
sergei-maertens added a commit that referenced this issue Nov 8, 2024
Using the new paths + path converters makes it easier to deal with
viewset actions for additional routes without needing to validate
the path parameter manually, and keeps everything consistent.

This requires disabling the 'format' endpoint variations, otherwise
they show up in the API specification (looks like drf-spectacular can't
filter those out properly yet), which requires setting the attribute
on the router as there is no __init__ kwarg for it.

For proper type casting and annotations, we need to specify the
converter on the viewset too.
sergei-maertens added a commit that referenced this issue Nov 8, 2024
Wire up the detail route to accept PUT requests with file part upload
data. The parser must be set to multipart to properly handle the file
uploads from the request data, and accordingly the serializer must
mark the appropriate fields as read/write only.

We opt to return a generic status code of 204, as the response data
of the upstream request has little value and this simplifies having to
relay the upstream 200 or 201 depending on whether the part data was
created or updated.
sergei-maertens added a commit that referenced this issue Nov 8, 2024
To write tests with multiple chunks that still remain fast, we need to
significantly lower the chunk size from 4GB to 100B (for example) so
that we don't have to send multiple GB of data and record that in the
VCR cassettes.
sergei-maertens added a commit that referenced this issue Nov 8, 2024
Added a test for an upload with more than one part and assert that the
response data of the chunk upload properly communicates the document
status as a whole.
sergei-maertens added a commit that referenced this issue Nov 8, 2024
* Test for invalid upload data, leading to validation error
* Test for generic IO-layer error (can't connect to host)
* Test for error response returned from Documenten API
@github-project-automation github-project-automation bot moved this from In review to Done in GPP-Woo Nov 8, 2024
@github-project-automation github-project-automation bot moved this from Done to In progress in GPP-Woo Nov 8, 2024
@sergei-maertens
Copy link
Contributor Author

@MarcoKlerks enige dat jij kan nalopen is de API documentatie, het technische testen zal voor @felixcicatt en collega's zijn :)

@sergei-maertens sergei-maertens moved this from In review to Testing in GPP-Woo Nov 8, 2024
@MarcoKlerks
Copy link
Contributor

@sergei-maertens Moeten we nog aparte user stories aanmaken voor de identificatie en validatie van het bestandformaat?

@felixcicatt Wil je nu al testen of pas wanneer jullie het ODPC hieraan gaan koppelen?

@felixcicatt
Copy link

@MarcoKlerks als het voor @sergei-maertens geen probleem is als de feedback iets later komt, heeft het mijn voorkeur om te wachten met testen tot we ODPC #42 oppakken

@sergei-maertens
Copy link
Contributor Author

@MarcoKlerks ze staan als task/checkbox in #36, dus ik denk het niet

@MarcoKlerks
Copy link
Contributor

Het is in DiWoo inderdaad een optioneel veld, dus een lagere prio dan P0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment