Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an ADR to keep document history in sync #9666

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pezholio
Copy link
Contributor

This adds an ADR to document our decision to add a RabbitMQ consumer to Whitehall to allow us to keep a record of when document updates have been triggered by an update to a dependent content block.

@@ -0,0 +1,64 @@
# 5. Keep Document history in sync with RabbitMQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's worth grouping or labelling this and our last ADR somehow, to show that these are changes relating specifically to the content block manager? Or referencing them in our engine? Just aware that if the engine was moved out of Whitehall they might lose their context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case, it's more related to actual Whitehall code. I do think anything related directly to Content Block Manager should go elsewhere though

Copy link
Contributor

@ryanb-gds ryanb-gds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Couple of suggestions from me

@pezholio pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch from 3932918 to c868f93 Compare November 28, 2024 10:45
@pezholio pezholio requested a review from ryanb-gds November 28, 2024 10:46
Copy link
Contributor

@ChrisBAshton ChrisBAshton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up. A few comments below.

More broadly - I wonder if there's a simpler alternative. If we consider two sources of 'history':

  1. Changenotes/editorial remarks in Whitehall, pushed to Publishing API
  2. Republishing of content via content-block-updates in Publishing API, pushed back to Whitehall

Have we considered moving all history into Publishing API? On the surface of it, that seems cleaner to me (albeit we'd have the complexity of a one-off data migration from Whitehall to Publishing API):

  1. Public changenotes already exist in Publishing API - we'd just need to find a way of pushing 'internal' changenotes up too (editorial remarks etc)
  2. Let Publishing API worry about consolidating the history of public/internal/content-block-driven updates
  3. On page load of a document, Whitehall can pull a document's entire history from Publishing API

(This idea, or any others you've considered, would make a good "Alternatives considered" section at the bottom of the ADR, as we often do in RFCs 🙏 ).

of when a change to a document has been triggered by an update to a content block.

In order to do this, we need to update the Publishing API to record an event when a document has been
republished as a result to a change to a content block. We can then have an endpoint that allows us to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth being explicit that this would be a second change to Publishing API (the first to record the UpdateContentBlock events, and the second to expose a new API route for fetching UpdateContentBlock events for a given document by its content ID)?

(You refer to the events endpoint in the "Decision" section - is that what you're proposing to call it? If so, would you also be looking to expose other event types?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a draft PR here, but I didn't want to date it by linking to a PR that will have either been closed or merged by then alphagov/publishing-api#2993. Will add a bit more context.

However, we still need a way to include these events in the history. Whitehall is particularly complex as
the document history is stored in the database and [paginated][1]. This means we can't fetch the events and
weave them into the history, as we don't have the entire history to hand to ensure we add the events to the
right place within the history.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider changing Whitehall to make it possible to retrieve its entire history in one call? The PR that introduces the pagination doesn't really explain why they went down the pagination route. One could assume performance optimisation, but do we know if the query would be particularly slow without pagination?


Included in the events payload will be information about the triggering content block. We did consider
sending this information as part of the payload, but concluded that we should make the effort to make
the payload as small as possible, minimising bandwidth and reducing complexity in the Publishing API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite follow this? Just thinking it through to see if I understand:

  1. A "content block update" call is made to Publishing API, sending full payload info of the nature of the change
  2. All documents consuming that content block are republished with the new "UpdatedViaContentBlock" event described earlier...
  3. ...and put onto the "published_documents" queue
  4. Whitehall, subscribed to the queue, grabs each item on the queue that has been republished as a result of a content block update
  5. For each item, make an API call to the /events endpoint in Publishing API to grab the full content block update details, and generate an EditorialRemark.

I'm confused about step 3. Line 56 sounds like you're wanting to put a minimal payload on the queue, but line 41 suggests you're going to reuse the existing queue (& its presumably large / unaltered payload)?

@ryanb-gds
Copy link
Contributor

Have we considered moving all history into Publishing API? On the surface of it, that seems cleaner to me (albeit we'd have the complexity of a one-off data migration from Whitehall to Publishing API):

  1. Public changenotes already exist in Publishing API - we'd just need to find a way of pushing 'internal' changenotes up too (editorial remarks etc)
  2. Let Publishing API worry about consolidating the history of public/internal/content-block-driven updates
  3. On page load of a document, Whitehall can pull a document's entire history from Publishing API

FWIW this was my first suggestion when I discussed it with Stu as well. I quite like the idea of Publishing API being responsible for all the data about, well, publishing. There would probably be some challenges though, e.g. change notes are part of Whitehall model validation at the moment

@pezholio pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch from c868f93 to 73a0177 Compare November 28, 2024 14:46
@ChrisBAshton
Copy link
Contributor

To add another voice to the suggestion of changenotes living in Publishing API: we'e just had a request for internal changenotes to Manuals Publisher. Could obviously replicate the internal changenotes feature in Whitehall, but feels like there's a more widespread need here that Publishing API could accommodate.

@pezholio
Copy link
Contributor Author

pezholio commented Dec 2, 2024

I'm definitely of the opinion that this should be the direction of travel going forward, but given that Content Modelling is a small team, and the Whitehall team already have their own OKRs to focus on, I think it might be too big a piece of work to bite off at this point. It'd be good for us to find a way forward that addresses the need in this ADR, but also gets us towards the goal of having a single source of truth for document history...

@ChrisBAshton
Copy link
Contributor

@pezholio we've just chatted this through at our team's Dev Huddle and we think there's a way forward that addresses the need in the ADR, that also goes down an architectural route that's more scaleable long term.

The nice thing about the below approach is that you could stop at the MVP stage and it wouldn't be too bad, but there's also a nice end state in sight without too much additional effort (he says optimistically!). Shall we see about having a chat this week (perhaps with our TA or Lead Dev in attendance to advise)?

MVP

All that would need doing right now to deliver an MVP:

  1. Add a new endpoint to Publishing API to see the events for a particular document (as you've suggested in the ADR).
    The endpoint would expose all public changenote history, including the new 'content block update' changenotes.
  2. Query the endpoint when loading the Edition screen in Whitehall, and merge the document histories.
    NB, we'd only really care about merging the first page's worth of document history, if that makes it any simpler (rather than worrying about pagination).

It does mean making an extra call to Publishing API on document load, but that's the direction we feel would be beneficial going all-in on further down the line (and has been done a lot elsewhere, eg Specialist Publisher is all Publishing API calls).

Future work

The remaining work could be tackled in three phases (and not necessarily by the Content Modelling team!).

First, supporting editorial remarks / internal changenotes in Publishing API:

  1. Make a change to Publishing API so that one can send both 'public' and 'internal' changenotes...
  2. ...and ensure both kinds are included in the endpoint added earlier.

Second, sending all new Whitehall changenote history to Publishing API:

  1. Start sending internal changenotes to Publishing API, and...
  2. ...stop storing any new changenotes (public or internal) in Whitehall.

Then (finally) the last phase would be to consolidate and then remove all unused code:

  1. Migrate all historic public/internal changenotes out of Whitehall and into Publishing API
  2. Delete local changenotes from Whitehall, and remove the 'changenote merging' logic added earlier, as now all changenotes should be being pulled from Publishing API

@pezholio
Copy link
Contributor Author

@ChrisBAshton Sorry, was deep in user research last week, but picking this up now. Happy to give that approach a go, but the only issue I get with the approach of only merging the first page of results is we won't necessarily know what date window(s) to cover. For example:

A document has the following range of event datetimes for the first page:

2024-03-23T09:23:00
.....
2023-12-10T11:13:00

And a range of event datetimes for the second page

2023-11-22T12:27:00
...
2023-09-12T15:17:00

If we have an event that happens between 2023-11-22T12:27:00 (the newest event for the second page) and 2023-12-10T11:13:00 (the oldest even for the first page) it won't get picked up because it doesn't occur within that range of events. I guess there's something we could do with getting the timestamp of the first item of the next page, but it'd be good to have a chat anyway.

@pezholio pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch 2 times, most recently from 3e6e422 to 905691f Compare December 11, 2024 12:13
@pezholio
Copy link
Contributor Author

I've updated this ADR spelling out the issues with interleaving the events, as well as tweaked the proposal for RabbitMQ to propose a brand new topic (rather than using an existing topic). I'm still leaning towards the RabbitMQ solution, but I'm keeping an open mind. Thoughts welcome!

@pezholio pezholio force-pushed the content-modelling/688-add-adr-for-adding-a-message-queue-consumer branch from 905691f to 7a1b27f Compare December 11, 2024 13:54
Copy link
Contributor

@ryanb-gds ryanb-gds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding to this Stu. Would be interested to get your thoughts on the below comment.

Comment on lines +97 to +100
This would involve setting up a new RabbitMQ message topic in Publishing API that sends
messages when a content block update triggers a change to a document. This would be a brand new
topic that contains a thin message that includes the `content_id` of the document that has
been updated, when it was updated and information about the content block that triggered the update:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interesting. In my imagination we were going to send just a message for the content block and then leave Whitehall to do the lookup of all the documents that link to the block. However, I realise now that would mean searching through govspeak because Whitehall doesn't have a structured reference to content blocks. I can see now why the original plan was to use the existing message, because we'd now have to send two messages for each document. Hmm.

Whitehall does a thing where it parses out the references from Govspeak to contacts and to other editions each time an edition is saved (link). We could do the same with content block references perhaps, but I have not really thought through the consequences of that (performance etc.). It would allow us to just send the content block ID to Whitehall and update the document history though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have got a similar thing (contained within the Content Block Tools Gem) that we use to extract references from Govspeak (It's actually used in Whitehall at the mo too), so that's entirely a thing we could do

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants