[moebooru] extract 'notes' #3094

enduser420 · 2022-10-23T16:23:41Z

related to #3093

$ py -m gallery_dl -K https://lolibooru.moe/post/show/281305/ -o "notes=True"

notes[translation][]
  - Heheh
  - I'm the tallest one here.
  - Ah, then again... Perhaps it's because I'm wearing boots.
  - It seems that I'm the taller one?

maybe instead of notes[translation][], a simple notes_translation[] be enough for this
I have separated the tags using space, but the website uses \n\n

mikf · 2022-10-24T14:03:14Z

Again, thank you for your quick response, but why not return notes in the same format as gelbooru or danbooru?

# https://gelbooru.com/index.php?page=post&s=view&id=5997331
"notes": [
    {
        "height": 553,
        "body": "Look over this way when you talk~",
        "width": 246,
        "x": 35,
        "y": 72
    },
    {
        "height": 557,
        "body": "Hey~\nAre you listening~?",
        "width": 246,
        "x": 1233,
        "y": 109
    }
]

Width/height/x/y can be taken from CSS attributes, I'd think.
This example from https://yande.re/post/show/993156 has also no  tags ...

          <div class="note-box" style="width: 314px; height: 588px; top: 438px; left: 900px;" id="note-box-7095">
            <div class="note-corner" id="note-corner-7095"></div>
          </div>
           <div class="note-body" id="note-body-7095" title="Click to edit">The facts that I love playing games</div>

enduser420 · 2022-10-25T04:48:43Z

help, doing extract_iter(notes_data, 'class="note-box"', " </div>") only lets me get w/h/x/y

"""
                                                                              how do I extract_iter this '</div>' and then
remove_html from 'title="Click to edit">' to the end (now that the '</div>' is gone after the iter)?         |
                                                                                                             |
                                                                                                             |
<div class="note-box" style="width: 314px; height: 588px; top: 438px; left: 900px;" id="note-box-7095">      |
    <div class="note-corner" id="note-corner-7095"></div>                                                    |
</div>                                                                                                       |
<div class="note-body" id="note-body-7095" title="Click to edit">The facts that I love playing games</div> <-|
"""

mikf · 2022-10-25T09:48:48Z

You can fetch the entire note_container, split that into single notes, and extract the values from there. Here's some test code that seems to work:

    note_container = text.extract(page, 'id="note-container"', '<img alt=')[0]
    if not note_container:
        return

    notes = []
    for note in note_container.split('class="note-box"')[1:]:
        extr = text.extract_from(note)
        notes.append({
            "width" : int(extr("width:", "p")),
            "height": int(extr("height:", "p")),
            "y"     : int(extr("top:", "p")),
            "x"     : int(extr("left:", "p")),
            "id"    : int(extr('id="note-box-', '"')),
            "body"  : extr('class="note-body', "</div>").partition(">")[2],
        })

enduser420 · 2022-10-25T12:00:02Z

since some sites contain tags

- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "body"  : text.remove_html(extr('class="note-body', "</div>").partition(">")[2]),

we can also do this, since note-box- and note-body- have the same id

- "id"    : int(extr('id="note-box-', '"')),
- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "id"    : int(extr('id="note-body-', '"')),
+ "body"  : extr(">", "</div>"),

... id="note-box-5225">
<div class="note-body" id="note-body-5225" title="Click to edit"><p>Heheh</p>

mikf · 2022-10-25T17:09:57Z

since some sites contain  tags

- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "body"  : text.remove_html(extr('class="note-body', "</div>").partition(">")[2]),

Sure. It seems that only lolibooru uses HTML tags inside its notes, and only … to seperate lines, so just removing all HTML should be fine.

we can also do this, since note-box- and note-body- have the same id

- "id"    : int(extr('id="note-box-', '"')),
- "body"  : extr('class="note-body', "</div>").partition(">")[2],
+ "id"    : int(extr('id="note-body-', '"')),
+ "body"  : extr(">", "</div>"),

Yeah, that's better. Good catch.

[moebooru] extract 'translation' note

776ce58

[moebooru] update 'notes' extraction

042b35e

enduser420 changed the title ~~[moebooru] extract 'translation' note~~ [moebooru] extract 'notes' Oct 26, 2022

mikf merged commit fb2dbb0 into mikf:master Oct 28, 2022

mikf pushed a commit that referenced this pull request Oct 29, 2022

[moebooru] extract 'notes' (#3094)

de3828e

enduser420 deleted the extractor/moebooru branch November 5, 2022 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[moebooru] extract 'notes' #3094

[moebooru] extract 'notes' #3094

enduser420 commented Oct 23, 2022

mikf commented Oct 24, 2022

enduser420 commented Oct 25, 2022

mikf commented Oct 25, 2022

enduser420 commented Oct 25, 2022

mikf commented Oct 25, 2022

[moebooru] extract 'notes' #3094

[moebooru] extract 'notes' #3094

Conversation

enduser420 commented Oct 23, 2022

mikf commented Oct 24, 2022

enduser420 commented Oct 25, 2022

mikf commented Oct 25, 2022

enduser420 commented Oct 25, 2022

mikf commented Oct 25, 2022