Skip to content

Commit

Permalink
CLDR-17566 conversion process scripts and readme
Browse files Browse the repository at this point in the history
  • Loading branch information
chpy04 committed Jun 26, 2024
1 parent ed980db commit 390d041
Show file tree
Hide file tree
Showing 4 changed files with 374 additions and 0 deletions.
Binary file not shown.
84 changes: 84 additions & 0 deletions tools/scripts/web/conversion-scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# Scripts to help with CLDR → Markdown Conversion

Part of the [CLDR to Markdown Conversion Process](https://docs.google.com/document/d/1NoQX0zqSYqU4CUuNijTWKQaphE4SCuHl6Bej2C4mb58/edit?usp=sharing), aiming to automate steps 1-3.

NOTE: does not get rid of all manual work, images, tables, and general review are still required.

## File 1: cleanup.py

Objective: this file aims to correct some of the common mistakes that show up when using a html to markdown converter on the google sites CLDR site. It is not a comprehensive list, and there can still be mistakes, but it helps to correct some of the consistently seen errors that show up, particularly with the specific markdown converter used in pullFromCLDR.py. Most of the adjustments utilize regular expressions to find and replace specific text. The functions are as follows:

### Link Correction

- Removing redundant links, e.g. \(https://www.example.com)[https://www.example.com]https://www.example.com
- Correcting relative links, e.g. \(index)[/index]\(index)[https://cldr.unicode.org/index]
- Correcting google redirect links, e.g. \(people)[http://www.google.com/url?q=http%3A%2F%2Fcldr-smoke.unicode.org%2Fsmoketest%2Fv%23%2FUSER%2FPeople%2F20a49c6ad428d880&sa=D&sntz=1&usg=AOvVaw38fQLnn3h6kmmWDHk9xNEm]\(people)[https://cldr-smoke.unicode.org/cldr-apps/v#/fr/People/20a49c6ad428d880]
- Correcting regular redirect links

### Common Formatting Issues

- Bullet points and numbered lists have extra spaces after them
- Bullet points and numbered lists have extra lines between them
- Link headings get put in with headings and need to be removed

### Project specific additions

- Every page has --- title: PAGE TITLE --- at the top of the markdown file
- Every page has the unicode copyright "\!\[Unicode copyright](https://www.unicode.org/img/hb_notice.gif)" at the bottom of the markdown file

## File 2: pullFromCLDR.py

Objective: this file is used along side cleanup.py to automate the process of pulling html and text from a given CLDR page. It uses libraries to retrieve the htmal as well as plain text from a given page, convert the html into markdown, parse the markdown using the cleanup.py file, and create the .md file and the temporary .txt file in the cldr site location. There are a couple of things to note with this:

- The nav bar header are not relevant to each page for this conversion process, so only the html within \<div role="main" ... > is pulled from the page
- To convert the html into raw text, the script parses the text, and then seperates relevant tags with newlines to appear as text does when copy/pasted from the page.
- This will only work with "https://cldr.unicode.org" pages, without modifying line 12 of the file

## Usage

### Installation

To run this code, you must have python3 installed. You need to install the following Python libraries:

- BeautifulSoup (from `bs4`)
- markdownify
- requests

You can install them using pip:

```bash
pip install beautifulsoup4 markdownify requests
```

### Constants

Line 8 of cleanup.py should contain the url that will be appended to the start of all relative links (always https://cldr.unicode.org):
```
#head to place at start of all relative links
RELATIVE_LINK_HEAD = "https://cldr.unicode.org"
```

Line 7 of pullFromCLDR.py should contain your local location of the cloned CLDR site, this is where the files will be stored:
```
#LOCAL LOCATION OF CLDR
CLDR_SITE_LOCATION = "DIRECTORY TO CLDR LOCATION/docs/site"
```

### Running

Before running, ensure that the folders associated to the directory of the page you are trying to convert are within your cldr site directory, and there is a folder named TEMP-TEXT-FILES.

Run with:
```
python3 pullFromCLDR.py
```

You will then be prompted to enter the url of the site you are trying to convert, after which the script will run.

If you would like to run unit tests on cleanup, or use any of the functions indiviually, run
```
python3 cleanup.py
```



239 changes: 239 additions & 0 deletions tools/scripts/web/conversion-scripts/cleanup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,239 @@
import re
import requests
import urllib.parse
import unittest
from unittest.mock import patch

#head to place at start of all relative links
RELATIVE_LINK_HEAD = "https://cldr.unicode.org"

#sometimes the html --> md conversion puts extra spaces between bullets
def fixBullets(content):
#remove extra spaces after dash in bullet points
content = re.sub(r'-\s{3}', '- ', content)
#remove extra space after numbered bullet points
content = re.sub(r'(\d+\.)\s{2}', r'\1 ', content)
#process lines for list handling
processed_lines = []
in_list = False
for line in content.splitlines():
if re.match(r'^\s*[-\d]', line):
#check if the current line is part of a list
in_list = True
elif in_list and not line.strip():
#skip empty lines within lists
continue
else:
in_list = False
processed_lines.append(line)
processed_content = '\n'.join(processed_lines)

return processed_content

#html-->md conversion puts link headings into md and messes up titles
def fixTitles(content):
#link headings regex
pattern = re.compile(r'(#+)\s*\n*\[\n*\]\(#.*\)\n(.*)\n*')

#replace matched groups
def replaceUnwanted(match):
heading_level = match.group(1) #heading level (ex. ##)
title_text = match.group(2).strip() #capture and strip the title text
return f"{heading_level} {title_text}" #return the formatted heading and title on the same line

# Replace the unwanted text using the defined pattern and function
processed_content = re.sub(pattern, replaceUnwanted, content)
return processed_content

# add title at top and unicode copyright at bottom
def addHeaderAndFooter(content):
#get title from top of md file
title_match = re.search(r'(?<=#\s).*', content)
if title_match:
title = title_match.group(0).strip()
else:
title = "Default Title" #default if couldnt find

#header
header = f"---\ntitle: {title}\n---\n"
#footer
footer = "\n![Unicode copyright](https://www.unicode.org/img/hb_notice.gif)\n"

#look for existing title and copywrite in the YAML front matter
title_exists = re.search(r'^---\n.*title:.*\n---', content, re.MULTILINE)
footer_exists = footer.strip() in content

#add header
if not title_exists:
content = header + content

#add footer
if not footer_exists:
content = content + footer

return content

#html-->md sometimes produces double bullets on indented lists
def fixIndentedBullets(content):
#regex pattern to match the double hyphen bullets
pattern = re.compile(r'^-\s-\s(.*)', re.MULTILINE)

#split into lines
lines = content.split('\n')

#normalize bullets
normalized_lines = []
in_list = False

for line in lines:
#lines with double hyphens
match = pattern.match(line)
if match:
#normalize the double hyphen bullet
bullet_point = match.group(1)
normalized_lines.append(f'- {bullet_point.strip()}')
in_list = True
elif in_list and re.match(r'^\s*-\s', line):
#remove indentation from following bullets in the same list
normalized_lines.append(line.strip())
else:
normalized_lines.append(line)
in_list = False

#join back into a single string
processed_content = '\n'.join(normalized_lines)
return processed_content

#links on text that is already a link
def removeRedundantLinks(content):
#(link)[link] regex pattern
link_pattern = re.compile(r'\((https?:\/\/[^\s\)]+)\)\[\1\]')

#function to process unwanted links
def replace_link(match):
return match.group(1) #return only the first URL

#replace the links
processed_content = re.sub(link_pattern, replace_link, content)
return processed_content

#process links, google redirects, normal redirects, and relative links (takes in a url)
def convertLink(url):
#relative links
if url.startswith("/"):
return RELATIVE_LINK_HEAD + url
#google redirect links
elif "www.google.com/url" in url:
parsed_url = urllib.parse.urlparse(url)
query_params = urllib.parse.parse_qs(parsed_url.query)
if 'q' in query_params:
return query_params['q'][0]
return url
#redirects
else:
try:
response = requests.get(url)
return response.url
except requests.RequestException as e:
print(f"Error following redirects for {url}: {e}")
return url

#finds all links and runs them through converLink
def process_links(content):
#regex pattern for md links
pattern = re.compile(r'\[(.*?)\]\((.*?)\)')

#replace each link
def replace_link(match):
text = match.group(1)
url = match.group(2)
new_url = convertLink(url)
return f'[{text}]({new_url})'

return pattern.sub(replace_link, content)

#given a file path to an md file, run it through every cleanup function and write inot samle.md
def fullCleanup(file_path):
with open(file_path, 'r') as file:
content = file.read() # Read entire file as a string
content = addHeaderAndFooter(content)
content = fixTitles(content)
content = fixBullets(content)
content = removeRedundantLinks(content)
content = fixIndentedBullets(content)
content = process_links(content)
with open("sample.md", 'w') as file:
file.write(content)

#given a md string, run through every cleanup function and return result
def fullCleanupString(str):
content = addHeaderAndFooter(str)
content = fixTitles(content)
content = fixBullets(content)
content = removeRedundantLinks(content)
content = fixIndentedBullets(content)
content = process_links(content)
return content


#TESTS
class TestMarkdownLinkProcessing(unittest.TestCase):
def test_remove_redundant_links(self):
#standard use cases
markdown_content1 = '''
redundant link (https://mail.google.com/mail/u/1/#inbox)[https://mail.google.com/mail/u/1/#inbox].
not redundant link [example](https://www.example.com).
'''
expected_output1 = '''
redundant link https://mail.google.com/mail/u/1/#inbox.
not redundant link [example](https://www.example.com).
'''
self.assertEqual(removeRedundantLinks(markdown_content1), expected_output1)

#edge cases:
#If the link does not start with http:// or https:// it will not be picked up as a link
#if the two links are different, it does not get corrected
markdown_content2 = '''
not link [www.example.com](www.example.com).
Different links (https://mail.google.com/mail/u/1/#inbox)[https://emojipedia.org/japanese-symbol-for-beginner].
'''
expected_output2 = '''
not link [www.example.com](www.example.com).
Different links (https://mail.google.com/mail/u/1/#inbox)[https://emojipedia.org/japanese-symbol-for-beginner].
'''
self.assertEqual(removeRedundantLinks(markdown_content2), expected_output2)

@patch('requests.get')
def test_replace_links(self, mock_get):
#mock responses for follow_redirects function
def mock_get_response(url):
class MockResponse:
def __init__(self, url):
self.url = url
if url == 'http://www.google.com/url?q=http%3A%2F%2Fwww.typolexikon.de%2F&sa=D&sntz=1&usg=AOvVaw3SSbqyjrSIq8enzBt6Gltw':
return MockResponse('http://www.typolexikon.de/')
elif url == 'http://www.example.com/':
return MockResponse('http://www.example.com/')
return MockResponse(url)

mock_get.side_effect = mock_get_response

#standard use cases
markdown_content1 = '''
relative link [page](/relative-page).
Google redirect link [typolexikon.de](http://www.google.com/url?q=http%3A%2F%2Fwww.typolexikon.de%2F&sa=D&sntz=1&usg=AOvVaw3SSbqyjrSIq8enzBt6Gltw).
normal link [example.com](http://www.example.com/).
'''
expected_output1 = '''
relative link [page](https://cldr.unicode.org/relative-page).
Google redirect link [typolexikon.de](http://www.typolexikon.de/).
normal link [example.com](http://www.example.com/).
'''
cleaned_content = removeRedundantLinks(markdown_content1)
self.assertEqual(process_links(cleaned_content), expected_output1)

if __name__ == '__main__':
unittest.main()
#example usage for file path:
#fullCleanup(PATH_TO_FILE)

51 changes: 51 additions & 0 deletions tools/scripts/web/conversion-scripts/pullFromCLDR.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
import requests
from bs4 import BeautifulSoup
import markdownify
from cleanup import fullCleanupString

#LOCAL LOCATION OF CLDR
CLDR_SITE_LOCATION = "DIRECTORY TO CLDR SITE LOCATION"


#fetch HTML from the website
url = input("Enter link to convert: ")
#compute path in cldr using url
directoryPath = url.replace("https://cldr.unicode.org", "")
outputMDFile = CLDR_SITE_LOCATION + directoryPath + ".md"
#compute path for text file using name of page
outputTextFile = CLDR_SITE_LOCATION + "/TEMP-TEXT-FILES/" + url.rsplit('/', 1)[-1] + ".txt"

#get html content of page
response = requests.get(url)
html_content = response.text

#extract html inside <div role="main" ... >
soup = BeautifulSoup(html_content, 'html.parser')
main_div = soup.find('div', {'role': 'main'})
html_inside_main = main_div.encode_contents().decode('utf-8')

#convert html to md with markdownify and settings from conversion doc
markdown_content = markdownify.markdownify(html_inside_main, heading_style="ATX", bullets="-")
#clean md file using cleanup.py
cleaned_markdown = fullCleanupString(markdown_content)

#parse raw text from site
textParser = BeautifulSoup(html_inside_main, 'html.parser')

#add newlines to text content for all newline tags
for block in textParser.find_all(['p', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'li', 'br']):
block.append('\n')

#fet text content from the parsed HTML
rawText = textParser.get_text()

#remove unnecessary newlines
rawText = '\n'.join(line.strip() for line in rawText.splitlines() if line.strip())

#write files to cldr in proper locations
with open(outputMDFile, 'w', encoding='utf-8') as f:
f.write(cleaned_markdown)

with open(outputTextFile, 'w', encoding='utf-8') as f:
f.write(rawText)

0 comments on commit 390d041

Please sign in to comment.