-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AFD Parsing, round one #1563
AFD Parsing, round one #1563
Conversation
-- What When it comes to standard routes, Drupal attempts to cache the response _even if_ a querystring param has changed in the incoming URL.
-- What All of the tests should now be passing. We have also added some parsing for the so-called 'epilogue' section
This is for our future selves!
This, once again, seems to be some Github auto-merge error
Once again, merge hell has undone some work. Here we add back the inclusion of the (parsed) afd partial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor questions that I will investigate myself tomorrow, but if you happen to already know, then merge at-will.
Anyway, this looks really good. The code and test quality are superb.
} | ||
|
||
// See if this paragraph contains a top level header | ||
$headerRegex = "/^\.(?<header>[^\.]+)[\.]{3}?(?<after>.*)\n/mU"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there ever lines that start with a dot and are not top-level headers? In the description, you noted that the headers start with a dot and are in all uppercase. If other kinds of lines can't also start with a dot, then this is all fine, but if it could happen, I'm wondering if we should have a little bit stricter check here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subheaders are lines that begin with three dots (they might have whitespace preceding the dots) and end with three dots. That is the only other case I can think of where a line might start with a dot. Subheaders would not be matched by this rule, I don't think.
It's true that headers tend to be \.[A-Z]+
. However, sometimes they have other symbols in them, like
.HEADER /IMPORTANT TEXT/...More text after ellipses for some reason
Definitely open to suggestions on how to improve the rx here. Unfortunately there seems to be a lot of behavior "in the wild" that is not documented in the official spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it be for now. This could be useful for finding outliers and then we can address those directly instead of hypothetically.
if (preg_match("/SYNOPSIS/", $currentString)) { | ||
$test = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! This is an annoying thing I have to do in the debugger in order to get it to break inside a statement block. If there is a blank line inside this if statement, and you put a breakpoint in there, the debugger doesn't break! So annoying. Anyway, fixed in 2b93110
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I hate that. The debugger also doesn't always stop where you expect it to. Very cool. Good debug. Much fun. Wow.
$indentRegex = "/\n +/"; | ||
$currentString = preg_replace($indentRegex, "", $currentString); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't pulled this down and tested it yet, but just one question on observation: does this remove the space between the two lines, so we could end up with
THIS IS A WARNING
ABOUT A STORM
becoming
THIS IS A WARNINGABOUT A STORM
I'll pull it down locally tomorrow morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a test that covers it here but it's possible I've overlooked how this gets used and/or over-fit the case to one specific example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still haven't checked it locally (I'm working on it, I promise), but I think that test covers the epilogue text, which preserves the newlines rather than concatenates them into a single string? Also 100% open to the probability that I'm misreading this. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I jammed in this text, to see how it would handle it:
So it looks like it is eating a space character. However, what makes it tricky is that the preceding line break ends with a -
and so leaving out the space works well for that (GMZ730-755-765-775
as an unbroken string). I dunno which is the better outcome or how likely either one is, though.
Co-authored-by: Greg Walker <[email protected]>
What does this PR do? 🛠️
This PR addresses #1526, parsing the AFD raw text product into markup we can use.
It does not deal with styling that markup, which is slated to be handled in #1527
What does the reviewer need to know? 🤔
Summary of structure
The AFD is divided into the following sections:
These lines appear before the first AFD header
These lines are made up of a combination of headers,
text paragraphs, and optionally sub-headers.
Headers are lines that begin with a . and contain
uppercase labels, with optional ellipses and post-header
text.
Subheaders are lines whose text is surrounded on both
ends by ellipses
NOTE: Header sections are generally separated at the end by
a line with just "&&", but we do not use those in this parser.
The main part of the body ends with "$$". There is sometimes
extra text after this point, usually authorship attribution.
Parsing strategy
The general idea is to split the text up into "paragraphs,"
defined as any contiguous chunks of text separated by two
newlines.
We then attempt to parse out headers, subheaders, and guess at the
structure of the subsequent body text.
The parser will set a kind of mode called currentContentType when
it encounters a type of paragraph based on an encountered header
type/name or other indicator token. Currently we have the following types:
Output
Like our other parsers, the expected output of the overall
parse()
method is an array of "nodes" (associative arrays with a type property).
These nodes are then given to Teig to render based on their type and
other attributes.
This parser contains an additional helper method that structures
the nodes into separate sections, to make rendering in Twig more
logical.
Alternative stream-based parser
I experimented with an alternative stream-based parser at first. There is possibly some merit to it (especially the LineStream class). I have stashed it in a separate branch that we should keep around in case we need it. Both the (unfinished) version of the parser and the
LineStream
class have comprehensive passing tests.Screenshots (if appropriate): 📸