Add an ADR to keep document history in sync #9666

pezholio · 2024-11-28T09:24:05Z

This adds an ADR to document our decision to add a RabbitMQ consumer to Whitehall to allow us to keep a record of when document updates have been triggered by an update to a dependent content block.

Harriethw · 2024-11-28T10:03:44Z

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md

@@ -0,0 +1,64 @@
+# 5. Keep Document history in sync with RabbitMQ


I wonder if it's worth grouping or labelling this and our last ADR somehow, to show that these are changes relating specifically to the content block manager? Or referencing them in our engine? Just aware that if the engine was moved out of Whitehall they might lose their context

I think in this case, it's more related to actual Whitehall code. I do think anything related directly to Content Block Manager should go elsewhere though

ryanb-gds

Looks good! Couple of suggestions from me

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md

ChrisBAshton

Thanks for writing this up. A few comments below.

More broadly - I wonder if there's a simpler alternative. If we consider two sources of 'history':

Changenotes/editorial remarks in Whitehall, pushed to Publishing API
Republishing of content via content-block-updates in Publishing API, pushed back to Whitehall

Have we considered moving all history into Publishing API? On the surface of it, that seems cleaner to me (albeit we'd have the complexity of a one-off data migration from Whitehall to Publishing API):

Public changenotes already exist in Publishing API - we'd just need to find a way of pushing 'internal' changenotes up too (editorial remarks etc)
Let Publishing API worry about consolidating the history of public/internal/content-block-driven updates
On page load of a document, Whitehall can pull a document's entire history from Publishing API

(This idea, or any others you've considered, would make a good "Alternatives considered" section at the bottom of the ADR, as we often do in RFCs 🙏 ).

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md

ChrisBAshton · 2024-11-28T10:33:56Z

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md

+of when a change to a document has been triggered by an update to a content block.
+
+In order to do this, we need to update the Publishing API to record an event when a document has been
+republished as a result to a change to a content block. We can then have an endpoint that allows us to


Worth being explicit that this would be a second change to Publishing API (the first to record the UpdateContentBlock events, and the second to expose a new API route for fetching UpdateContentBlock events for a given document by its content ID)?

(You refer to the events endpoint in the "Decision" section - is that what you're proposing to call it? If so, would you also be looking to expose other event types?)

I've got a draft PR here, but I didn't want to date it by linking to a PR that will have either been closed or merged by then alphagov/publishing-api#2993. Will add a bit more context.

ChrisBAshton · 2024-11-28T10:36:12Z

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md

+However, we still need a way to include these events in the history. Whitehall is particularly complex as
+the document history is stored in the database and [paginated][1]. This means we can't fetch the events and
+weave them into the history, as we don't have the entire history to hand to ensure we add the events to the
+right place within the history.


Did you consider changing Whitehall to make it possible to retrieve its entire history in one call? The PR that introduces the pagination doesn't really explain why they went down the pagination route. One could assume performance optimisation, but do we know if the query would be particularly slow without pagination?

ChrisBAshton · 2024-11-28T10:54:37Z

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md

+
+Included in the events payload will be information about the triggering content block. We did consider
+sending this information as part of the payload, but concluded that we should make the effort to make
+the payload as small as possible, minimising bandwidth and reducing complexity in the Publishing API


I don't quite follow this? Just thinking it through to see if I understand:

A "content block update" call is made to Publishing API, sending full payload info of the nature of the change

All documents consuming that content block are republished with the new "UpdatedViaContentBlock" event described earlier...

...and put onto the "published_documents" queue

Whitehall, subscribed to the queue, grabs each item on the queue that has been republished as a result of a content block update

For each item, make an API call to the /events endpoint in Publishing API to grab the full content block update details, and generate an EditorialRemark.

I'm confused about step 3. Line 56 sounds like you're wanting to put a minimal payload on the queue, but line 41 suggests you're going to reuse the existing queue (& its presumably large / unaltered payload)?

ryanb-gds · 2024-11-28T12:45:32Z

Have we considered moving all history into Publishing API? On the surface of it, that seems cleaner to me (albeit we'd have the complexity of a one-off data migration from Whitehall to Publishing API):

Public changenotes already exist in Publishing API - we'd just need to find a way of pushing 'internal' changenotes up too (editorial remarks etc)

Let Publishing API worry about consolidating the history of public/internal/content-block-driven updates

On page load of a document, Whitehall can pull a document's entire history from Publishing API

FWIW this was my first suggestion when I discussed it with Stu as well. I quite like the idea of Publishing API being responsible for all the data about, well, publishing. There would probably be some challenges though, e.g. change notes are part of Whitehall model validation at the moment

ChrisBAshton · 2024-12-02T13:26:47Z

To add another voice to the suggestion of changenotes living in Publishing API: we'e just had a request for internal changenotes to Manuals Publisher. Could obviously replicate the internal changenotes feature in Whitehall, but feels like there's a more widespread need here that Publishing API could accommodate.

pezholio · 2024-12-02T14:53:06Z

I'm definitely of the opinion that this should be the direction of travel going forward, but given that Content Modelling is a small team, and the Whitehall team already have their own OKRs to focus on, I think it might be too big a piece of work to bite off at this point. It'd be good for us to find a way forward that addresses the need in this ADR, but also gets us towards the goal of having a single source of truth for document history...

ChrisBAshton · 2024-12-02T16:22:52Z

@pezholio we've just chatted this through at our team's Dev Huddle and we think there's a way forward that addresses the need in the ADR, that also goes down an architectural route that's more scaleable long term.

The nice thing about the below approach is that you could stop at the MVP stage and it wouldn't be too bad, but there's also a nice end state in sight without too much additional effort (he says optimistically!). Shall we see about having a chat this week (perhaps with our TA or Lead Dev in attendance to advise)?

MVP

All that would need doing right now to deliver an MVP:

Add a new endpoint to Publishing API to see the events for a particular document (as you've suggested in the ADR).
The endpoint would expose all public changenote history, including the new 'content block update' changenotes.
Query the endpoint when loading the Edition screen in Whitehall, and merge the document histories.
NB, we'd only really care about merging the first page's worth of document history, if that makes it any simpler (rather than worrying about pagination).

It does mean making an extra call to Publishing API on document load, but that's the direction we feel would be beneficial going all-in on further down the line (and has been done a lot elsewhere, eg Specialist Publisher is all Publishing API calls).

Future work

The remaining work could be tackled in three phases (and not necessarily by the Content Modelling team!).

First, supporting editorial remarks / internal changenotes in Publishing API:

Make a change to Publishing API so that one can send both 'public' and 'internal' changenotes...
...and ensure both kinds are included in the endpoint added earlier.

Second, sending all new Whitehall changenote history to Publishing API:

Start sending internal changenotes to Publishing API, and...
...stop storing any new changenotes (public or internal) in Whitehall.

Then (finally) the last phase would be to consolidate and then remove all unused code:

Migrate all historic public/internal changenotes out of Whitehall and into Publishing API
Delete local changenotes from Whitehall, and remove the 'changenote merging' logic added earlier, as now all changenotes should be being pulled from Publishing API

pezholio · 2024-12-10T15:46:31Z

@ChrisBAshton Sorry, was deep in user research last week, but picking this up now. Happy to give that approach a go, but the only issue I get with the approach of only merging the first page of results is we won't necessarily know what date window(s) to cover. For example:

A document has the following range of event datetimes for the first page:

2024-03-23T09:23:00
.....
2023-12-10T11:13:00

And a range of event datetimes for the second page

2023-11-22T12:27:00
...
2023-09-12T15:17:00

If we have an event that happens between 2023-11-22T12:27:00 (the newest event for the second page) and 2023-12-10T11:13:00 (the oldest even for the first page) it won't get picked up because it doesn't occur within that range of events. I guess there's something we could do with getting the timestamp of the first item of the next page, but it'd be good to have a chat anyway.

pezholio · 2024-12-11T12:17:18Z

I've updated this ADR spelling out the issues with interleaving the events, as well as tweaked the proposal for RabbitMQ to propose a brand new topic (rather than using an existing topic). I'm still leaning towards the RabbitMQ solution, but I'm keeping an open mind. Thoughts welcome!

ryanb-gds

Thanks for adding to this Stu. Would be interested to get your thoughts on the below comment.

ryanb-gds · 2024-12-11T16:27:07Z

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md

+This would involve setting up a new RabbitMQ message topic in Publishing API that sends 
+messages when a content block update triggers a change to a document. This would be a brand new 
+topic that contains a thin message that includes the `content_id` of the document that has
+been updated, when it was updated and information about the content block that triggered the update:


This interesting. In my imagination we were going to send just a message for the content block and then leave Whitehall to do the lookup of all the documents that link to the block. However, I realise now that would mean searching through govspeak because Whitehall doesn't have a structured reference to content blocks. I can see now why the original plan was to use the existing message, because we'd now have to send two messages for each document. Hmm.

Whitehall does a thing where it parses out the references from Govspeak to contacts and to other editions each time an edition is saved (link). We could do the same with content block references perhaps, but I have not really thought through the consequences of that (performance etc.). It would allow us to just send the content block ID to Whitehall and update the document history though.

We have got a similar thing (contained within the Content Block Tools Gem) that we use to extract references from Govspeak (It's actually used in Whitehall at the mo too), so that's entirely a thing we could do

pezholio requested review from richardTowers, Harriethw and ryanb-gds November 28, 2024 09:24

Harriethw reviewed Nov 28, 2024

View reviewed changes

ryanb-gds reviewed Nov 28, 2024

View reviewed changes

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md Outdated Show resolved Hide resolved

docs/adr/0005-keep-document-history-in-sync-with-rabbit-mq.md Outdated Show resolved Hide resolved

pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch from 3932918 to c868f93 Compare November 28, 2024 10:45

pezholio requested a review from ryanb-gds November 28, 2024 10:46

ChrisBAshton reviewed Nov 28, 2024

View reviewed changes

pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch from c868f93 to 73a0177 Compare November 28, 2024 14:46

pezholio requested a review from ChrisBAshton November 28, 2024 14:46

pezholio mentioned this pull request Dec 11, 2024

Refactor paginated timeline #9730

Merged

pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch 2 times, most recently from 3e6e422 to 905691f Compare December 11, 2024 12:13

Add an ADR to keep document history in sync

7a1b27f

pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch from 905691f to 7a1b27f Compare December 11, 2024 13:54

ryanb-gds reviewed Dec 11, 2024

View reviewed changes

pezholio mentioned this pull request Dec 19, 2024

(688) Add content block updates to a document's timeline #9754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an ADR to keep document history in sync #9666

Add an ADR to keep document history in sync #9666

pezholio commented Nov 28, 2024

Harriethw Nov 28, 2024

pezholio Nov 28, 2024

ryanb-gds left a comment

ChrisBAshton left a comment

ChrisBAshton Nov 28, 2024

pezholio Nov 28, 2024

ChrisBAshton Nov 28, 2024

ChrisBAshton Nov 28, 2024

ryanb-gds commented Nov 28, 2024

ChrisBAshton commented Dec 2, 2024

pezholio commented Dec 2, 2024

ChrisBAshton commented Dec 2, 2024

pezholio commented Dec 10, 2024

pezholio commented Dec 11, 2024

ryanb-gds left a comment

ryanb-gds Dec 11, 2024

pezholio Dec 11, 2024

		@@ -0,0 +1,64 @@
		# 5. Keep Document history in sync with RabbitMQ

Add an ADR to keep document history in sync #9666

Are you sure you want to change the base?

Add an ADR to keep document history in sync #9666

Conversation

pezholio commented Nov 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanb-gds left a comment

Choose a reason for hiding this comment

ChrisBAshton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ryanb-gds commented Nov 28, 2024

ChrisBAshton commented Dec 2, 2024

pezholio commented Dec 2, 2024

ChrisBAshton commented Dec 2, 2024

MVP

Future work

pezholio commented Dec 10, 2024

pezholio commented Dec 11, 2024

ryanb-gds left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment