Common Japanese Morphemes in News Latest Update: 16 July 2024
Common Japanese Words in News Latest Update: 16 July 2024
Showcase visualizations and code base about the common Japanese morphemes that appear in news.
Morphemes are the smallest units of meaning in a language.
Data was collected from 'https://www3.nhk.or.jp'
Data collecting period: 25 May 2024 - 4 July 2024
Visualizations Latest Update: 5 July 2024
Located in data folder
Contain Japanese morphemes data collected from the NHK News website.
Total morphemes collected: 1,015,285
Contain urls which link to the news that the morphemes were collected from.
Total Url collected: 896
Urls in this file should follow https://www3.nhk.or.jp if you want to see the source.
For example: https://www3.nhk.or.jp/news/html/20240523/k10014458551000.html
To web-scrape 'https://www3.nhk.or.jp'
- Go to main.py
- Adjust the SQLite database name as needed
sqlite_db = 'japan_news_test.db' # adjust as needed
- Run the script
Processes of main.py
- Fetch the urls which link to news articles in HNK News website.
- Check whether those urls are already in the database to ensure that the script doesn't scrape texts from the same source twice.
- Save a new set of urls to the database.
- Fetch news articles text from those new urls.
- Extract morphemes, Romanji, and Part of Speech.
- Clean data and transform them into a Pandas Dataframe.
- Save data and the news urls to a SQLite database.
jp_news_scraper_pipeline Package
- Contain web-scraping pipeline's functions.
- Contain functions about logging configurations.
jp_news_scraper Package
- Contain functions related to fetching the data from 'https://www3.nhk.or.jp'
- Contain functions related to extracting data about the Japanese language.
- Contain functions related to data transformation and cleaning.
- Contain functions related to SQLite database.
- Contain utility functions.
Scrape data from NHK News daily, automated with GitHub Action.
Showcase visualizations and code base about the common Japanese words that appear in news.
This project was built on top of Common Japanese Morphemes in News project.
Combining morphemes collected from Common Japanese Morphemes in News project into words by looking them up in the dictionary.
Words that aren't in the dictionary were filtered out.
The Japanese dictionary for word-lookup is based on JMdict: https://github.com/themoeway/jmdict-yomitan
Data collecting period: 25 May 2024 - 4 July 2024
Visualizations Latest Update: 16 July 2024
Located in data folder
Contain Japanese words data from NHK News.
Total Japanese Words: 426,217
- Contain functions that combine Japanese morphemes to words
- You need to have news urls from NHK News stored in the SQLite database first before running this script.
- Which means you should run main.py to scrape morphemes from the NHK News first.
- Go to morpheme_to_word.py
- Adjust the SQLite database name to be the same one you used for the main.py
sqlite_db = 'japan_news_test.db' # adjust as needed
- Run morpheme_to_word.py
Processes of morpheme_to_word.py
- It fetches the news urls stored in NewsUrls table in the database and scraped the news article.
- It extracts morphemes from the articles, clean non-Japanese characters, and combines them into words by looking up the words in the dictionary.
- Part of Speech and Romanji are added for each word before transform them into a Pandas dataframe.
- Load the dataframe into a SQLite database and clean the Part of Speech column.