AFD Parsing, round one #1563

eric-gade · 2024-08-14T20:15:10Z

What does this PR do? 🛠️

This PR addresses #1526, parsing the AFD raw text product into markup we can use.

It does not deal with styling that markup, which is slated to be handled in #1527

What does the reviewer need to know? 🤔

Summary of structure

The AFD is divided into the following sections:

Codes and description (we call this the "preamble").
These lines appear before the first AFD header
The main body (we call this the body).
These lines are made up of a combination of headers,
text paragraphs, and optionally sub-headers.
Headers are lines that begin with a . and contain
uppercase labels, with optional ellipses and post-header
text.
Subheaders are lines whose text is surrounded on both
ends by ellipses
NOTE: Header sections are generally separated at the end by
a line with just "&&", but we do not use those in this parser.
Everything else (we call this the "epilogue").
The main part of the body ends with "$$". There is sometimes
extra text after this point, usually authorship attribution.

Parsing strategy

The general idea is to split the text up into "paragraphs,"
defined as any contiguous chunks of text separated by two
newlines.

We then attempt to parse out headers, subheaders, and guess at the
structure of the subsequent body text.

The parser will set a kind of mode called currentContentType when
it encounters a type of paragraph based on an encountered header
type/name or other indicator token. Currently we have the following types:

preamble (the initial setting / default)
wwa (Watches/Warnings/Advisories header content)
generic (All other header-type content)
epilogue (the epilogue content)

Output

Like our other parsers, the expected output of the overall parse()
method is an array of "nodes" (associative arrays with a type property).
These nodes are then given to Teig to render based on their type and
other attributes.

This parser contains an additional helper method that structures
the nodes into separate sections, to make rendering in Twig more
logical.

Alternative stream-based parser

I experimented with an alternative stream-based parser at first. There is possibly some merit to it (especially the LineStream class). I have stashed it in a separate branch that we should keep around in case we need it. Both the (unfinished) version of the parser and the LineStream class have comprehensive passing tests.

Screenshots (if appropriate): 📸

-- What When it comes to standard routes, Drupal attempts to cache the response _even if_ a querystring param has changed in the incoming URL.

-- What All of the tests should now be passing. We have also added some parsing for the so-called 'epilogue' section

This is for our future selves!

This, once again, seems to be some Github auto-merge error

Once again, merge hell has undone some work. Here we add back the inclusion of the (parsed) afd partial

greg-does-weather

A couple minor questions that I will investigate myself tomorrow, but if you happen to already know, then merge at-will.

Anyway, this looks really good. The code and test quality are superb.

greg-does-weather · 2024-08-20T22:31:05Z

web/modules/weather_data/src/Service/AFDParser.php

+        }
+
+        // See if this paragraph contains a top level header
+        $headerRegex = "/^\.(?<header>[^\.]+)[\.]{3}?(?<after>.*)\n/mU";


Are there ever lines that start with a dot and are not top-level headers? In the description, you noted that the headers start with a dot and are in all uppercase. If other kinds of lines can't also start with a dot, then this is all fine, but if it could happen, I'm wondering if we should have a little bit stricter check here.

The subheaders are lines that begin with three dots (they might have whitespace preceding the dots) and end with three dots. That is the only other case I can think of where a line might start with a dot. Subheaders would not be matched by this rule, I don't think.

It's true that headers tend to be \.[A-Z]+. However, sometimes they have other symbols in them, like

.HEADER /IMPORTANT TEXT/...More text after ellipses for some reason

Definitely open to suggestions on how to improve the rx here. Unfortunately there seems to be a lot of behavior "in the wild" that is not documented in the official spec

Let's leave it be for now. This could be useful for finding outliers and then we can address those directly instead of hypothetically.

greg-does-weather · 2024-08-20T22:44:54Z

web/modules/weather_data/src/Service/AFDParser.php

+        if (preg_match("/SYNOPSIS/", $currentString)) {
+            $test = true;
+        }


What's this for?

Good catch! This is an annoying thing I have to do in the debugger in order to get it to break inside a statement block. If there is a blank line inside this if statement, and you put a breakpoint in there, the debugger doesn't break! So annoying. Anyway, fixed in 2b93110

Oh yeah, I hate that. The debugger also doesn't always stop where you expect it to. Very cool. Good debug. Much fun. Wow.

greg-does-weather · 2024-08-20T22:47:54Z

web/modules/weather_data/src/Service/AFDParser.php

+        $indentRegex = "/\n    +/";
+        $currentString = preg_replace($indentRegex, "", $currentString);


I haven't pulled this down and tested it yet, but just one question on observation: does this remove the space between the two lines, so we could end up with

THIS IS A WARNING ABOUT A STORM

becoming

THIS IS A WARNINGABOUT A STORM

I'll pull it down locally tomorrow morning.

I think there is a test that covers it here but it's possible I've overlooked how this gets used and/or over-fit the case to one specific example

Still haven't checked it locally (I'm working on it, I promise), but I think that test covers the epilogue text, which preserves the newlines rather than concatenates them into a single string? Also 100% open to the probability that I'm misreading this. 😅

So I jammed in this text, to see how it would handle it:

And this was the output:

So it looks like it is eating a space character. However, what makes it tricky is that the preceding line break ends with a - and so leaving out the space works well for that (GMZ730-755-765-775 as an unbroken string). I dunno which is the better outcome or how likely either one is, though.

Co-authored-by: Greg Walker <[email protected]>

eric-gade and others added 20 commits August 6, 2024 09:28

Adding initial route, template, and theme handler

5ff61de

First working version

7ad833b

Updating bundler to handle products comprehensively

2753a3c

Adding e2e and test data for sample AFD product structures

69e3e15

Swapping order of markup

c0e38ba

js lints

b9fd732

php lints

1f9b5be

comments / explanations

8618ef4

Merge branch 'main' into eg-1525-initial-afd-page

06d9591

Removing cruft

745f235

Removing caching from the route.

2b67ca6

-- What When it comes to standard routes, Drupal attempts to cache the response _even if_ a querystring param has changed in the incoming URL.

Initial commit, saving progress

e6c774e

Initial experimental line stream based parser

d761124

Updating with testing parity for old/new parser versions

249c64a

Isolating to single normal parser and updating

3662d1d

-- What All of the tests should now be passing. We have also added some parsing for the so-called 'epilogue' section

formatting

92805e1

Removing LineStream and tests

6b9ad52

lints

0b85d94

Adding a very long descriptive comment

8160f0b

This is for our future selves!

Cleaning up file

cd3ff03

eric-gade requested review from greg-does-weather and partly-igor August 14, 2024 20:15

eric-gade and others added 6 commits August 14, 2024 16:41

Merge branch 'main' into eg-1526-afd-markup

e9d5c2f

Fixing call to parse util method plus lints

22ed82c

Merge branch 'main' into eg-1526-afd-markup

3fde603

Removing dupe function declaration

795c0a1

This, once again, seems to be some Github auto-merge error

GRRRR

dbc2cd3

Addign back the include

c57c1ae

Once again, merge hell has undone some work. Here we add back the inclusion of the (parsed) afd partial

greg-does-weather approved these changes Aug 20, 2024

View reviewed changes

Removing debug code

2b93110

eric-gade and others added 2 commits August 21, 2024 13:17

Removing bad regex and updating tests

70c3e2b

Co-authored-by: Greg Walker <[email protected]>

Merge branch 'main' into eg-1526-afd-markup

9d830bb

eric-gade enabled auto-merge August 21, 2024 19:42

Merge branch 'main' into eg-1526-afd-markup

4f62e97

eric-gade merged commit 0f2d88e into main Aug 21, 2024
17 checks passed

eric-gade deleted the eg-1526-afd-markup branch August 21, 2024 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AFD Parsing, round one #1563

AFD Parsing, round one #1563

eric-gade commented Aug 14, 2024

greg-does-weather left a comment

greg-does-weather Aug 20, 2024

eric-gade Aug 21, 2024

greg-does-weather Aug 21, 2024

greg-does-weather Aug 20, 2024

eric-gade Aug 21, 2024

greg-does-weather Aug 21, 2024

greg-does-weather Aug 20, 2024

eric-gade Aug 21, 2024

greg-does-weather Aug 21, 2024

greg-does-weather Aug 21, 2024

		$indentRegex = "/\n +/";
		$currentString = preg_replace($indentRegex, "", $currentString);

AFD Parsing, round one #1563

AFD Parsing, round one #1563

Conversation

eric-gade commented Aug 14, 2024

What does this PR do? 🛠️

What does the reviewer need to know? 🤔

Summary of structure

Parsing strategy

Output

Alternative stream-based parser

Screenshots (if appropriate): 📸

greg-does-weather left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment