Skip to content
gamebeaker edited this page Nov 25, 2024 · 4 revisions

How to select a sub-range of URLs to include in an EPUB?

  1. Make sure all the URLs you want in the EPUB are selected. (Easiest way to do this is click on the “Select All” button, which will select ALL URLs.)
  2. Click on the “Edit Chapter URLs” button.
  3. A text editor with a list of hyperlinks, one hyperlink for each selected URL will open. Each hyperlink looks like this:<a href="URL_TO_DOWNLOAD">Title</a> Where href is the URL to obtain a chapter to put in the EPUB, and “Title” is the title that will be given to the chapter in the table of contents. The chapters will be placed in the EPUB in the same order they appear on this page.
  4. Edit the hyperlinks. e.g. Delete the URLs that are not wanted. You can also change the order of the hyperlinks or their titles. Note, you may find it easier to copy/paste the hyperlinks into a text editor to modify them, then paste back the edited list.
  5. Click the “Apply Changes” button.

How to convert a new site using the Default Parser?

Sometimes WebToEpub is unable to figure out which content on a web page should be packed into the EPUB. When this happens, WebToEpub asks you to tell it which element on the web page has the content to pack. You use the default parser page to tell WebToEpub how to find the content by telling WebToEpub the Cascading Style Sheet Selector (CSS Selector) for the element containing the wanted content. (If you’re not familiar with CSS Selectors, they’re a shorthand notation for specifying elements on a HTML page. Please see https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors for an excellent description of CSS selectors.)

As seen in this screenshot, DefaultParserScreenshot the The Default Parser has 5 text inputs and 3 buttons. In order, the inputs for the control are:

  1. "Hostname" This is the hostname portion of the web site URL. It's automatically filled in, so just ignore it.
  2. "URL of first chapter" URL to first chapter of the story. If you don't want to test the selector(s) you supply, you can leave this blank.
  3. "CSS Selector for the content" This is where you tell WebToEpub the element that holds the content
  4. "CSS Selector for Title of Chapter" Sometimes the title of each chapter isn't in the same element that holds the rest of a chapter's text. e.g. http://www.ironteethserial.com/dark-fantasy-story/story-interlude/prologue/. When this happens, you can use this to tell WebToEpub which element holds the chapter's title and WebToEpub will include this element to the front of the text content that it fetches. Obviously, this field can be left blank if this isn't an issue.
  5. "CSS Selector for Elements to remove" Sometimes the content contains things that are not wanted in the EPUB. e.g. Advertisements, Share links, etc. This input is used to say which elements are to be removed from the content before packing into the EPUB. Obviously, this field can be left blank if this isn't an issue.

The buttons are:

  1. "Help" Brings up this web page.
  2. "Test" Will test the provided CSS Selectors. If you want to test that the CSS Selectors you provide provided work, clicking this button will get WebToEpub to fetch the first chapter from the internet and run the CSS Selectors against it. The resulting chapter that would appear in the EPUB will be shown in the box below the test button.
  3. "Finished" Tell WebToEpub you've finished configuring the CSS Selectors.

Worked Example

Let’s assume you want to convert “The Iron Teeth” into an EPUB. Looking at the above page, you can see that the first chapter of this story is at http://www.ironteethserial.com/dark-fantasy-story/story-interlude/prologue/

  1. The first step is to copy the URL of the first chapter into the control labelled "URL of first chapter".
  2. The next step is to discover the HTML element that contains the content to put into each chapter of the EPUB. To do this
  1. Open a chapter of the story (e.g. http://www.ironteethserial.com/dark-fantasy-story/story-interlude/prologue/) in your web browser of choice.
  2. Open the browser's DOM Inspector. E.g. On Firefox use CTRL+Shift+C, on Chrome open "Developer Tools" and select the "Elements" tab. Or press the F12 key. On Chrome, this looks like FindingContent
  3. Find the HTML element that encloses the entire text your want in the EPUB. The simplest way to do this is
  1. on the chapter page (NOT the DOM inspector) move the mouse to the first word of the chapter's text,
  2. Click the right mouse button, then select "Inspect" from the drop down menu that appears
  3. The DOM Inspector will then highlight the element holding this text.
  4. You can then look at the parent elements until you find the first element that holds all text you are interested in.
  5. If you follow the above, you will find that the element holding all the chapter text is <div class="post_content">
  1. Figure out the CSS Selector for the element. In this case it's a div element with a class, so the CSS Selector is div.post_content
  2. Put this CSS Selector into the relevant input.
  1. You can now test the CSS Selector to see if it works. To do this:
  1. Click the "Test" button
  2. Examine the text that appears in the scroll box below the buttons.
  3. If the output is not what is wanted/expected, either fix the CSS Selector (if wrong) or use the CSS Selector for a different element.
  1. You should now see that the chapter text has appeared, but it's missing the chapter title. If you wish to add the title,
  1. Go back to the browser's DOM inspector and find the CSS Selector for the element holding the chapter (in this case it's h2.post-title
  2. Copy the CSS Selector into the "Title of Chapter" input
  3. Run the test again
  1. If desired, a similar process can be used to remove any elements in the content that are unwanted.
  2. When satisfied with the test results, click the "Finished" button. 7.You will now go to the usual "WebToEpub" page and you can continue as normal.

How to see file downloading progress?

WebToEpub’s display of progress in downloading the requested URLs is limited. To obtain a much more detailed and more frequently updated download progress update, you can use the Network Monitoring facilities built into Chrome and Firefox. The basic procedure is:

  1. Open the respective developer tools.
  2. On the Developer tools pane, select the networking tab.
  3. Click WebToEpub’s “Pack EPUB” button.

For more detailed instructions, try the following links

Chrome or Firefox

Using Baka-Tsuki “Series Page” parser?

When originally created, WebToEpub only worked for “Full Text” web pages. That is, web pages that had the Full Text of a volume. e.g. https://www.baka-tsuki.org/project/index.php?title=Shinmai_Maou_no_Keiyakusha:Volume_7. In version 0.0.0.43, I’ve added a new parser that will work with “Series Pages” that show all the volumes in a series. e.g. https://www.baka-tsuki.org/project/index.php?title=Shinmai_Maou_no_Keiyakusha. However, there is a problem. I’ve not been able to discover a reliable way for WebToEpub to distinguish between the two page types.

So, by default WebToEpub will use the old “Full Text” parser when it encounters a Baka-Tsuki web page. However, if you browse to a Baka-Tsuki series page, you can use the “Manually Select Parser” drop down control under “Advanced Options” and select the “Baka-Tsuki Series Page” parser.

Alternately, you can check the “Automatic parser select includes Baka-Tsuki Series Page Parser”.option. When checked, WebToEpub will make a “best guess” at which of the two Baka-Tsuki parsers should be used for the currently selected Baka-Tsuki page. However, it won’t always pick the correct parser. When this happens, you will need to manually select the correct parser.

How to write a new Parser?

If you have basic knowledge of JavaScript and HTML then creating a new parser for a site that WebToEpub can’t currently handle may be as short as 10 minutes work. Basic steps are:

  1. Install from Source, using the instructions here
  2. Copy the file “Template.js” in the folder plugin/js/parsers.
  3. Rename the copied file, based on the site you want to parse.
  4. Add link to the new file to popup.html.
  5. Text replace “Template” in the file with the new Parser name.
  6. Uncomment the functions of the template you need, modifying the sample implementations as required. Refer to Customizing the Template Parser for a new Web Site for a worked example

Structure of WebToEpub code?

Overview of Files

  • popup.html and js/main.js provide WebToEpub’s core UI.
  • js/ChapterUrls.js, js/CoverImageUI.js, js/DefaultParserUI.js, js/ProgressBar.js and js/UserPreferences.js provide the rest of the UI functionality.
  • js/ParserFactory.js selects the Parser (derived from js/Parser.js) to use to process each web page.
  • js/ImageColletor.js (and js/Imgur.js) are used by the Parsers to handle processing images from the web page.
  • js/HttpClient.js is used to fetch web pages (and images, JSON or anything else) from the internet.
  • js/EpubPacker.js assembles the EPUB file
  • js/EpubItem.js and js/EpubItemSupplier.js are a “bridge” to convert the HTML collected by Parsers into items to put into an EPUB.
  • js/Download.js handles saving the EPUB file to the hard drive. this file is called “Download” because it uses the Download API to do the save. (Yup, it’s a hack.)

Files

File Description
popup.html HTML that provides the UI for WebToEpub
js/main.js Main logic behind popup.html's UI
js/ChapterUrlsUi.js Logic for the "List of Chapters" on the UI
js/CoverImageUI.js Logic for the "Select Cover Image" list for Baka-Tsuki on the UI
js/DefaultParserUI.js Logic for the "Default Parser" on the UI
js/Download.js Wraps Chrome's Download API. (Handles saving EPUB to hard drive.)
js/EpubItem.js An item to pack into an EPUB file. (e.g. an XHTML or image file)
js/EpubItemSupplier.js Converts "files" from internet into EpubItems to pack into an EPUB
js/EpubMetaInfo.js Container holding metadata for an EPUB.
js/EpubPacker.js Assembles EPUB using metadata and items from supplier
js/ErrorLog.js Records errors/warnings and displays to user and/or saves to file
js/Firefox.js Code that is only needed by Firefox version of WebToEpub
js/HttpClient.js Wraps making HTTP calls to internet. Retry, decode response, etc.
js/ImageColletor.js Fetch images from internet, remove duplicates, rewrite image tags for EPUB, etc.
js/Imgur.js Logic for fetching/processing images and galleries from Imgur
js/Parser.js Base class for reading a site's HTML and converting into EpubItems
js/ParserFactory.js Logic to figure out which parser to use for a web page
js/ProgressBar.js Code to manipulate the Progress Bar on the UI
js/Sanitize.js Code to cleanup converting HTML to XHMTL
js/UserPreferences.js UI logic for user to set Options
js/Util.js Library of miscellaneous functions

Algorithms

Basic steps to create an EPUB

  1. Figure out parser to use for web page(s)
  2. Get URLs of web pages that need to be fetched from internet
  3. For each web page
  1. Fetch from internet
  2. Find content to put in EPUB
  3. Find and fetch any images needed on page 4.Convert web pages into items for EPUB
  4. Find content to put in EPUB
  5. Remove junk (e.g. Scripts) from content
  6. Fixup hyperlinks (e.g. footnotes), remove next/previous chapter links (where possible)
  7. Rewrite image tags for EPUB
  8. Convert from HTML to XHTML
  1. Assemble the EPUB
  1. Generate Manifest
  2. Generate Table of Contents
  3. Pack items
  1. Save EPUB to hard drive

Note. due to need to fix up hyperlinks that may cross chapters, can’t convert web pages until all pages have been collected.

Choosing parser for web page

ToDo - include special case of EPUB that has multiple sites with different formats requiring different parsers.

Solutions to site issues that require special coding

Problem See Parser
Site does not use UTF8 encoding or inform of coding used 69shuParser.js
Each chapter spans multiple HTML pages YushuboParser.js
Chapters "links" in Table of Contents (ToC) are not hyperlinks ArchiveOfOurOwnParser.js
Walk multiple ToC pages to get all Chapters ScribblehubParser.js
Chapters "links" in Table of Contents (ToC) are not hyperlinks ArchiveOfOurOwnParser.js
ToC requires REST call(s) to list all chapters NovelsectParser.js
ToC across multiple HTML pages AsianHobbyist, Novelfull, Shinsori, ZenithNovels
Assemble chapter content from JSON NovelsectParser.js

Support latest Firefox for Android

At time of writing (2024-11-25), Firefox for Android is supported.

Add ability to reduce speed that WbeToEpub fetches pages

A couple of sites do rate limiting. That is, when they detect a browser making a lot of calls quickly, they tell the broswer to slow down. Providied the site responds with HTTP 429 messagtes, WebToEpub will slow down as requested. However, you can also reduce WebToEpub’s request rate by throttling using opening the Browser’s Object inspector and throttling the internet speed. This may give better overall performance than relying on the 429 backoff.

Please add support for getting chapter list from NovelUpdates

The NovelUpdate site owners have told me not to do this. So, sorry, I can’t do this.