ODRC: implement Document creation/upload #19

sergei-maertens · 2024-09-03T13:12:09Z

Acceptance criteria

Copied/moved from #2

Het format wordt gevalideerd wanneer het bestand geüpload wordt.
Via een API kan een document (data-object) gecreëerd, geraadpleegd, gemuteerd en verwijderd worden.
Bij een document moet exact één bestand (bijvoorbeeld een PDF) worden opgeslagen.
De creatie, mutatie of verwijdering van een document wordt gelogd (zie ODRC: Set up foundation for logging/audit trails #16)
De API specificatie is bijgewerkt (ReDoc, Swagger)
De documentatie is bijgewerkt (Read the Docs)

Document creation + uploads

We make use of the file parts mechanism of the Documenten API, always.

ODRC will proxy to the underlying documenten API

Preparing a document upload

The client must pass the necessary metadata and then based on the response of that,
cut up the file upload in parts that can be submitted individually.

The request body schema of a Document POST would look something like:

Document:
  type: object
  required:
    - identifier
  properties:
    publicatie:  # UUID of the publication it belongs to
      type: string
      format: uuid
    identifier:  # the 'primary' identifier
      type: string
    creatiedatum:
      type: string
      format: date
    officieleTitel:  # DiWoo doesn't seem to apply a max length
      type: string
    verkorteTitel:
      type: string
    omschrijving:
      type: string
    # from waardelijst -> expose options in separate endpoint (!)
    # POST (write) operations should be able to just provide the identifier IRI instead of this complex object if ICATT desires this
    bestandsformaat:  
      type: object
      properties:
        identifier:  # IRI from waardelijst
          type: string
          format: uri
        mimeType:  # e.g. application/pdf
          type: string
        naam:
          type: string  # e.g. "PDF"

This translates to a request of ODRC -> Documenten API with schema:

EnkelvoudigInformatieobject:
  type: object
  properties:
    identificatie:  # primary identifier of Document OR let the Documenten API generate one?
      type: string
    bronorganisatie:  # fixed, global configuration parameter in ODRC initially, could become 'smart' in the future
      type: string
    creatiedatum:  # taken from Document.creatiedatum
      type: string
      format: date
    titel:
      type: string
      minLength: 1
      maxLength: 200
    auteur:  # taken from Document.publicatie.organization
      type: string
    status:
      const: definitief  # archiving will move this to gearchiveerd, later
    formaat:  # derived from Document.bestandsformaat
      type: string
    taal:  # derived from Document.taal -> convert to/from ISO 639-2/B
      type: string
      enum:
        - dut
        - eng
    bestandsnaam:  # taken from Document.bestandsnaam
      type: string
    bestandsomvang:  # taken from Document.bestandsomvang, prepares the file parts
      type: number
    indicatieGebruiksrecht:
      const: false
    informatieobjecttype:  # points to /catalogi/api/v1/informatieobjecttypen/:uuid for the Document.publicatie.informatiecategorie
      type: string
      format: uri

The Documenten API will return a lock and list of BestandsDelen for upload, each
bestandsdeel will have the shape:

BestandsDeel:
  type: object
  properties:
    url:  # URL to PUT to
      type: string
      format: uri
    volgnummer:
      type: integer
    omvang:
      type: integer
    voltooid:
      const: false
    lock: # ??
      type: string

The ODRC will then expose endpoints for these part uploads so the publication component
can upload the parts:

URL: /api/v1/documenten/:uuid/bestandsdeel/:index

A bestandsdeel wil simply be multipart/form-data, with the API key as auth header.

The request will be transformed by the ODRC, which adds the lock ID & JWT for the
Documenten API, and streams the file part down to the Documenten API.

Once all parts are received, we unlock the created document.

Tasks

Implement API endpoint/serializer for POST /api/v1/documenten
- Part of the metadata is stored in our database
- Other metadata that we can store in Documenten API, we store there
- Record the requested bestandsdelen from the Documenten API
- Record the lock ID to finalize the upload/creation
- Resolve the internal "Catalogi API" endpoint for the informatieobjecttype
Implement the API endpoint/serializer for PUT /api/v1/documenten/:uuid/bestandsdelen/:index
- Use the stored lock ID & other configuration/metadata to all the Documenten API /api/v1/bestandsdelen/:uuid
- If all parts are completely uploaded, unlock the document
- Emit in the response body documentFinalized: true|false so that the client can be informed that they can refresh the document resource if needed, as file uploads will likely happen in parallel
- Nice to have: check if we can make use of proxying/streaming (timebox: 1 day)
~~The first part of the document contains the magic bytes that can be used to validate against the format in the metadata. Check the upload validation in Open Forms for inspiration (and edge cases).~~ postponed to ODRC: add model/admin interface for 'formaat' waardelijst #36

The text was updated successfully, but these errors were encountered:

The Documents API integration fields do not hold any semantic meaning, so it's easier to understand them when grouped together with related fields and track them separately from the 'normal' model fields. Eventually, the lock ID will need to get stored too, the code is currently commented out since PR #96 needs to be merged first, as it also has new migrations.

Rather than embedding this everywhere and repeating the credentials, we can simplify the test setup by using a factory with a trait for our Open Zaak docker compose configuration. If/when we alter the credentials/fixture, we only have a single place to update, while keeping the semantic meaning in tests clear enough through the parameter name.

Added the initial step to create a document in the Documents API, using the large file upload parts mechanism. This sets up the public API to use in the future API endpoints, allowing us to keep reasoning (mostly) in our own Python domain without worrying about implementation details and API versions.

The file uploads that we will receive in our endpoint can be proxied to the Documents API from the client, which has the API credentials and receives the lock value from our local state.

This is the file step to be able to 'use' the document - once all document parts are uploaded (via our proxy endpoint), we can ensure that the document is unlocked and then ready for use.

The Documents API integration fields do not hold any semantic meaning, so it's easier to understand them when grouped together with related fields and track them separately from the 'normal' model fields. Eventually, the lock ID will need to get stored too, the code is currently commented out since PR #96 needs to be merged first, as it also has new migrations.

Rather than embedding this everywhere and repeating the credentials, we can simplify the test setup by using a factory with a trait for our Open Zaak docker compose configuration. If/when we alter the credentials/fixture, we only have a single place to update, while keeping the semantic meaning in tests clear enough through the parameter name.

Added the initial step to create a document in the Documents API, using the large file upload parts mechanism. This sets up the public API to use in the future API endpoints, allowing us to keep reasoning (mostly) in our own Python domain without worrying about implementation details and API versions.

The file uploads that we will receive in our endpoint can be proxied to the Documents API from the client, which has the API credentials and receives the lock value from our local state.

This is the file step to be able to 'use' the document - once all document parts are uploaded (via our proxy endpoint), we can ensure that the document is unlocked and then ready for use.

Since multiple requests will be involved in the document creation and file uploads, we need some persistent storage for the document lock value, created by the Document create call.

The Documents API integration fields do not hold any semantic meaning, so it's easier to understand them when grouped together with related fields and track them separately from the 'normal' model fields. Eventually, the lock ID will need to get stored too, the code is currently commented out since PR #96 needs to be merged first, as it also has new migrations.

Rather than embedding this everywhere and repeating the credentials, we can simplify the test setup by using a factory with a trait for our Open Zaak docker compose configuration. If/when we alter the credentials/fixture, we only have a single place to update, while keeping the semantic meaning in tests clear enough through the parameter name.

Added the initial step to create a document in the Documents API, using the large file upload parts mechanism. This sets up the public API to use in the future API endpoints, allowing us to keep reasoning (mostly) in our own Python domain without worrying about implementation details and API versions.

…izer

* Test for invalid upload data, leading to validation error * Test for generic IO-layer error (can't connect to host) * Test for error response returned from Documenten API

* After registration of the metata, we expect to be able to upload the binary content of all file parts * The uploads should be forwarded to the underlying Documents API * Once all parts are complete, the document must be unlocked * When we download the file data from the Documents API, this must match what we uploaded.

Using the new paths + path converters makes it easier to deal with viewset actions for additional routes without needing to validate the path parameter manually, and keeps everything consistent. This requires disabling the 'format' endpoint variations, otherwise they show up in the API specification (looks like drf-spectacular can't filter those out properly yet), which requires setting the attribute on the router as there is no __init__ kwarg for it. For proper type casting and annotations, we need to specify the converter on the viewset too.

Wire up the detail route to accept PUT requests with file part upload data. The parser must be set to multipart to properly handle the file uploads from the request data, and accordingly the serializer must mark the appropriate fields as read/write only. We opt to return a generic status code of 204, as the response data of the upstream request has little value and this simplifies having to relay the upstream 200 or 201 depending on whether the part data was created or updated.

To write tests with multiple chunks that still remain fast, we need to significantly lower the chunk size from 4GB to 100B (for example) so that we don't have to send multiple GB of data and record that in the VCR cassettes.

Added a test for an upload with more than one part and assert that the response data of the chunk upload properly communicates the document status as a whole.

…izer

* Test for invalid upload data, leading to validation error * Test for generic IO-layer error (can't connect to host) * Test for error response returned from Documenten API

sergei-maertens · 2024-11-08T16:24:29Z

@MarcoKlerks enige dat jij kan nalopen is de API documentatie, het technische testen zal voor @felixcicatt en collega's zijn :)

MarcoKlerks · 2024-11-12T16:21:50Z

@sergei-maertens Moeten we nog aparte user stories aanmaken voor de identificatie en validatie van het bestandformaat?

@felixcicatt Wil je nu al testen of pas wanneer jullie het ODPC hieraan gaan koppelen?

felixcicatt · 2024-11-13T07:57:49Z

@MarcoKlerks als het voor @sergei-maertens geen probleem is als de feedback iets later komt, heeft het mijn voorkeur om te wachten met testen tot we ODPC #42 oppakken

sergei-maertens · 2024-11-13T09:37:13Z

@MarcoKlerks ze staan als task/checkbox in #36, dus ik denk het niet

MarcoKlerks · 2024-11-13T12:59:04Z

Het is in DiWoo inderdaad een optioneel veld, dus een lagere prio dan P0.

sergei-maertens mentioned this issue Sep 3, 2024

ODRC: Publicatie & Document API resources #2

Closed

11 tasks

sergei-maertens changed the title ~~Adding a document prepares the metadata + bestandsdeel URLs (TODO: spec ticket!)~~ ODRC: implement Document creation/upload Sep 3, 2024

sergei-maertens added this to the ODRC: Basis milestone Sep 3, 2024

sergei-maertens added this to GPP-Woo Sep 3, 2024

MarcoKlerks moved this to Backlog in GPP-Woo Sep 12, 2024

felixcicatt mentioned this issue Sep 17, 2024

ODPC: Minimaal document in publicatieformulier GPP-Woo/GPP-app#22

Closed

MarcoKlerks assigned sergei-maertens Sep 17, 2024

MarcoKlerks moved this from Backlog to Refinement in GPP-Woo Sep 17, 2024

sergei-maertens mentioned this issue Sep 18, 2024

ODRC: Add Document data model #37

Closed

7 tasks

MarcoKlerks mentioned this issue Sep 18, 2024

ODPC: minimaal document versturen naar Registratiecomponent GPP-Woo/GPP-app#42

Closed

6 tasks

MarcoKlerks moved this from Refinement to Ready in GPP-Woo Oct 9, 2024

MarcoKlerks mentioned this issue Oct 16, 2024

ODBP: Minimale integratie Woo-index GPP-Woo/GPP-burgerportaal#9

Closed

4 tasks

sergei-maertens added the MVP label Oct 17, 2024

This was referenced Oct 22, 2024

ODRC: publicatie raadplegen (beheer) #50

Closed

ODRC: Set up foundation for logging/audit trails #16

Closed

sergei-maertens mentioned this issue Oct 30, 2024

Document creation and upload - implement Documents API client #106

Merged

sergei-maertens added a commit that referenced this issue Nov 8, 2024

🍱 [#19] Update generated API schema

bbf5a7f

sergei-maertens added a commit that referenced this issue Nov 8, 2024

✨ [#19] Ensure the right endpoint is included in the file part serial…

dc49564

…izer

sergei-maertens added a commit that referenced this issue Nov 8, 2024

✅ [#19] Add test for error flows

878c5f8

* Test for invalid upload data, leading to validation error * Test for generic IO-layer error (can't connect to host) * Test for error response returned from Documenten API

sergei-maertens added a commit that referenced this issue Nov 8, 2024

🍱 [#19] Update generated API schema

69bda93

sergei-maertens added a commit that referenced this issue Nov 8, 2024

✨ [#19] Ensure the right endpoint is included in the file part serial…

314bced

…izer

sergei-maertens added a commit that referenced this issue Nov 8, 2024

✅ [#19] Add test for error flows

b462977

* Test for invalid upload data, leading to validation error * Test for generic IO-layer error (can't connect to host) * Test for error response returned from Documenten API

sergei-maertens closed this as completed in #127 Nov 8, 2024

github-project-automation bot moved this from In review to Done in GPP-Woo Nov 8, 2024

sergei-maertens reopened this Nov 8, 2024

github-project-automation bot moved this from Done to In progress in GPP-Woo Nov 8, 2024

sergei-maertens assigned MarcoKlerks and unassigned sergei-maertens Nov 8, 2024

sergei-maertens assigned felixcicatt Nov 8, 2024

sergei-maertens moved this from In review to Testing in GPP-Woo Nov 8, 2024

MarcoKlerks closed this as completed Nov 13, 2024

github-project-automation bot moved this from Testing to Done in GPP-Woo Nov 13, 2024

felixcicatt mentioned this issue Nov 20, 2024

ODBP: endpoints voor robots.txt, sitemap index en DiWoo-sitemap GPP-Woo/GPP-burgerportaal#21

Closed

MarcoKlerks unassigned MarcoKlerks and felixcicatt Dec 16, 2024

MarcoKlerks added this to the Plateau 0 (MVP) milestone Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ODRC: implement Document creation/upload #19

ODRC: implement Document creation/upload #19

sergei-maertens commented Sep 3, 2024 •

edited

Loading

sergei-maertens commented Nov 8, 2024

MarcoKlerks commented Nov 12, 2024

felixcicatt commented Nov 13, 2024

sergei-maertens commented Nov 13, 2024

MarcoKlerks commented Nov 13, 2024

ODRC: implement Document creation/upload #19

ODRC: implement Document creation/upload #19

Comments

sergei-maertens commented Sep 3, 2024 • edited Loading

Document creation + uploads

Preparing a document upload

sergei-maertens commented Nov 8, 2024

MarcoKlerks commented Nov 12, 2024

felixcicatt commented Nov 13, 2024

sergei-maertens commented Nov 13, 2024

MarcoKlerks commented Nov 13, 2024

sergei-maertens commented Sep 3, 2024 •

edited

Loading