Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[moebooru] extract 'notes' #3094

Merged
merged 2 commits into from
Oct 28, 2022
Merged

[moebooru] extract 'notes' #3094

merged 2 commits into from
Oct 28, 2022

Conversation

enduser420
Copy link
Contributor

related to #3093

$ py -m gallery_dl -K https://lolibooru.moe/post/show/281305/ -o "notes=True"

notes[translation][]
  - Heheh
  - I'm the tallest one here.
  - Ah, then again... Perhaps it's because I'm wearing boots.
  - It seems that I'm the taller one?

  • maybe instead of notes[translation][], a simple notes_translation[] be enough for this
  • I have separated the <p> tags using space, but the website uses \n\n

@mikf
Copy link
Owner

mikf commented Oct 24, 2022

Again, thank you for your quick response, but why not return notes in the same format as gelbooru or danbooru?

# https://gelbooru.com/index.php?page=post&s=view&id=5997331
"notes": [
    {
        "height": 553,
        "body": "Look over this way when you talk~",
        "width": 246,
        "x": 35,
        "y": 72
    },
    {
        "height": 557,
        "body": "Hey~\nAre you listening~?",
        "width": 246,
        "x": 1233,
        "y": 109
    }
]

Width/height/x/y can be taken from CSS attributes, I'd think.
This example from https://yande.re/post/show/993156 has also no <p> tags ...

          <div class="note-box" style="width: 314px; height: 588px; top: 438px; left: 900px;" id="note-box-7095">
            <div class="note-corner" id="note-corner-7095"></div>
          </div>
           <div class="note-body" id="note-body-7095" title="Click to edit">The facts that I love playing games</div>

@enduser420
Copy link
Contributor Author

help, doing extract_iter(notes_data, 'class="note-box"', " </div>") only lets me get w/h/x/y

"""
                                                                              how do I extract_iter this '</div>' and then
remove_html from 'title="Click to edit">' to the end (now that the '</div>' is gone after the iter)?         |
                                                                                                             |
                                                                                                             |
<div class="note-box" style="width: 314px; height: 588px; top: 438px; left: 900px;" id="note-box-7095">      |
    <div class="note-corner" id="note-corner-7095"></div>                                                    |
</div>                                                                                                       |
<div class="note-body" id="note-body-7095" title="Click to edit">The facts that I love playing games</div> <-|
"""

@mikf
Copy link
Owner

mikf commented Oct 25, 2022

You can fetch the entire note_container, split that into single notes, and extract the values from there. Here's some test code that seems to work:

    note_container = text.extract(page, 'id="note-container"', '<img alt=')[0]
    if not note_container:
        return

    notes = []
    for note in note_container.split('class="note-box"')[1:]:
        extr = text.extract_from(note)
        notes.append({
            "width" : int(extr("width:", "p")),
            "height": int(extr("height:", "p")),
            "y"     : int(extr("top:", "p")),
            "x"     : int(extr("left:", "p")),
            "id"    : int(extr('id="note-box-', '"')),
            "body"  : extr('class="note-body', "</div>").partition(">")[2],
        })

@enduser420
Copy link
Contributor Author

since some sites contain <p> tags

- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "body"  : text.remove_html(extr('class="note-body', "</div>").partition(">")[2]),

we can also do this, since note-box- and note-body- have the same id

- "id"    : int(extr('id="note-box-', '"')),
- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "id"    : int(extr('id="note-body-', '"')),
+ "body"  : extr(">", "</div>"),
... id="note-box-5225">
<div class="note-body" id="note-body-5225" title="Click to edit"><p>Heheh</p>

@mikf
Copy link
Owner

mikf commented Oct 25, 2022

since some sites contain <p> tags

- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "body"  : text.remove_html(extr('class="note-body', "</div>").partition(">")[2]),

Sure. It seems that only lolibooru uses HTML tags inside its notes, and only <p>…</p> to seperate lines, so just removing all HTML should be fine.

we can also do this, since note-box- and note-body- have the same id

- "id"    : int(extr('id="note-box-', '"')),
- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "id"    : int(extr('id="note-body-', '"')),
+ "body"  : extr(">", "</div>"),

Yeah, that's better. Good catch.

@enduser420 enduser420 changed the title [moebooru] extract 'translation' note [moebooru] extract 'notes' Oct 26, 2022
@mikf mikf merged commit fb2dbb0 into mikf:master Oct 28, 2022
mikf pushed a commit that referenced this pull request Oct 29, 2022
@enduser420 enduser420 deleted the extractor/moebooru branch November 5, 2022 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants