Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AFD Parsing, round one #1563

Merged
merged 30 commits into from
Aug 21, 2024
Merged

AFD Parsing, round one #1563

merged 30 commits into from
Aug 21, 2024

Conversation

eric-gade
Copy link
Collaborator

What does this PR do? 🛠️

This PR addresses #1526, parsing the AFD raw text product into markup we can use.

It does not deal with styling that markup, which is slated to be handled in #1527

What does the reviewer need to know? 🤔

Summary of structure

The AFD is divided into the following sections:

  1. Codes and description (we call this the "preamble").
    These lines appear before the first AFD header
  2. The main body (we call this the body).
    These lines are made up of a combination of headers,
    text paragraphs, and optionally sub-headers.
    Headers are lines that begin with a . and contain
    uppercase labels, with optional ellipses and post-header
    text.
    Subheaders are lines whose text is surrounded on both
    ends by ellipses
    NOTE: Header sections are generally separated at the end by
    a line with just "&&", but we do not use those in this parser.
  3. Everything else (we call this the "epilogue").
    The main part of the body ends with "$$". There is sometimes
    extra text after this point, usually authorship attribution.

Parsing strategy

The general idea is to split the text up into "paragraphs,"
defined as any contiguous chunks of text separated by two
newlines.

We then attempt to parse out headers, subheaders, and guess at the
structure of the subsequent body text.

The parser will set a kind of mode called currentContentType when
it encounters a type of paragraph based on an encountered header
type/name or other indicator token. Currently we have the following types:

  • preamble (the initial setting / default)
  • wwa (Watches/Warnings/Advisories header content)
  • generic (All other header-type content)
  • epilogue (the epilogue content)

Output

Like our other parsers, the expected output of the overall parse()
method is an array of "nodes" (associative arrays with a type property).
These nodes are then given to Teig to render based on their type and
other attributes.

This parser contains an additional helper method that structures
the nodes into separate sections, to make rendering in Twig more
logical.

Alternative stream-based parser

I experimented with an alternative stream-based parser at first. There is possibly some merit to it (especially the LineStream class). I have stashed it in a separate branch that we should keep around in case we need it. Both the (unfinished) version of the parser and the LineStream class have comprehensive passing tests.

Screenshots (if appropriate): 📸

localhost_8080_afd_wfo=OKX

eric-gade and others added 6 commits August 14, 2024 16:41
This, once again, seems to be some Github auto-merge error
Once again, merge hell has undone some work. Here we add back the
inclusion of the (parsed) afd partial
Copy link
Collaborator

@greg-does-weather greg-does-weather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor questions that I will investigate myself tomorrow, but if you happen to already know, then merge at-will.

Anyway, this looks really good. The code and test quality are superb.

}

// See if this paragraph contains a top level header
$headerRegex = "/^\.(?<header>[^\.]+)[\.]{3}?(?<after>.*)\n/mU";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there ever lines that start with a dot and are not top-level headers? In the description, you noted that the headers start with a dot and are in all uppercase. If other kinds of lines can't also start with a dot, then this is all fine, but if it could happen, I'm wondering if we should have a little bit stricter check here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subheaders are lines that begin with three dots (they might have whitespace preceding the dots) and end with three dots. That is the only other case I can think of where a line might start with a dot. Subheaders would not be matched by this rule, I don't think.

It's true that headers tend to be \.[A-Z]+. However, sometimes they have other symbols in them, like

.HEADER /IMPORTANT TEXT/...More text after ellipses for some reason

Definitely open to suggestions on how to improve the rx here. Unfortunately there seems to be a lot of behavior "in the wild" that is not documented in the official spec

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave it be for now. This could be useful for finding outliers and then we can address those directly instead of hypothetically.

Comment on lines 241 to 243
if (preg_match("/SYNOPSIS/", $currentString)) {
$test = true;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This is an annoying thing I have to do in the debugger in order to get it to break inside a statement block. If there is a blank line inside this if statement, and you put a breakpoint in there, the debugger doesn't break! So annoying. Anyway, fixed in 2b93110

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I hate that. The debugger also doesn't always stop where you expect it to. Very cool. Good debug. Much fun. Wow.

Comment on lines 244 to 245
$indentRegex = "/\n +/";
$currentString = preg_replace($indentRegex, "", $currentString);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't pulled this down and tested it yet, but just one question on observation: does this remove the space between the two lines, so we could end up with

THIS IS A WARNING
ABOUT A STORM

becoming

THIS IS A WARNINGABOUT A STORM

I'll pull it down locally tomorrow morning.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a test that covers it here but it's possible I've overlooked how this gets used and/or over-fit the case to one specific example

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still haven't checked it locally (I'm working on it, I promise), but I think that test covers the epilogue text, which preserves the newlines rather than concatenates them into a single string? Also 100% open to the probability that I'm misreading this. 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I jammed in this text, to see how it would handle it:

Screenshot 2024-08-21 at 9 47 46 AM

And this was the output:
Screenshot 2024-08-21 at 10 26 29 AM

So it looks like it is eating a space character. However, what makes it tricky is that the preceding line break ends with a - and so leaving out the space works well for that (GMZ730-755-765-775 as an unbroken string). I dunno which is the better outcome or how likely either one is, though.

@eric-gade eric-gade merged commit 0f2d88e into main Aug 21, 2024
17 checks passed
@eric-gade eric-gade deleted the eg-1526-afd-markup branch August 21, 2024 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants