Status

Common Japanese Morphemes in News: 🎉 Project Completed 🎉

Common Japanese Words in News: 🎉 Project Completed 🎉

Latest Update

Common Japanese Morphemes in News Latest Update: 16 July 2024

Common Japanese Words in News Latest Update: 16 July 2024

Common Japanese Morphemes in News

Showcase visualizations and code base about the common Japanese morphemes that appear in news.

Morphemes are the smallest units of meaning in a language.

Data was collected from 'https://www3.nhk.or.jp'

Data collecting period: 25 May 2024 - 4 July 2024

Visualizations

Visualizations Latest Update: 5 July 2024

Tableau

Instagram

Facebook

Data

Located in data folder

jp_morpheme_data_from_news_as_of_2024-07-04.parquet

Contain Japanese morphemes data collected from the NHK News website.

Total morphemes collected: 1,015,285

news_url_data_from_nhk_as_of_2024-07-04.parquet

Contain urls which link to the news that the morphemes were collected from.

Total Url collected: 896

Urls in this file should follow https://www3.nhk.or.jp if you want to see the source.

For example: https://www3.nhk.or.jp/news/html/20240523/k10014458551000.html

Codebase Details

To web-scrape 'https://www3.nhk.or.jp'

Go to main.py

Adjust the SQLite database name as needed

sqlite_db = 'japan_news_test.db' # adjust as needed

Run the script

Processes of main.py

Fetch the urls which link to news articles in HNK News website.
Check whether those urls are already in the database to ensure that the script doesn't scrape texts from the same source twice.
Save a new set of urls to the database.
Fetch news articles text from those new urls.
Extract morphemes, Romanji, and Part of Speech.
Clean data and transform them into a Pandas Dataframe.
Save data and the news urls to a SQLite database.

jp_news_scraper_pipeline Package

pipeline.py

Contain web-scraping pipeline's functions.

configure_logging.py

Contain functions about logging configurations.

jp_news_scraper Package

news_scraper.py

Contain functions related to fetching the data from 'https://www3.nhk.or.jp'

data_extractor.py

Contain functions related to extracting data about the Japanese language.

data_transformer.py

Contain functions related to data transformation and cleaning.

sqlite_functions.py

Contain functions related to SQLite database.

utils.py

Contain utility functions.

automated_news_scraper.py

Scrape data from NHK News daily, automated with GitHub Action.

Common Japanese Words in News

Showcase visualizations and code base about the common Japanese words that appear in news.

This project was built on top of Common Japanese Morphemes in News project.

Combining morphemes collected from Common Japanese Morphemes in News project into words by looking them up in the dictionary.

Words that aren't in the dictionary were filtered out.

The Japanese dictionary for word-lookup is based on JMdict: https://github.com/themoeway/jmdict-yomitan

Data collecting period: 25 May 2024 - 4 July 2024

Visualizations

Visualizations Latest Update: 16 July 2024

Tableau

Instagram

Facebook

Data

Located in data folder

jp_word_data_from_news_as_of_2024-07-04.parquet

Contain Japanese words data from NHK News.

Total Japanese Words: 426,217

Codebase Details

morpheme_to_word.py

Contain functions that combine Japanese morphemes to words
You need to have news urls from NHK News stored in the SQLite database first before running this script.
- Which means you should run main.py to scrape morphemes from the NHK News first.

To Combine Morphemes to Words

Go to morpheme_to_word.py
Adjust the SQLite database name to be the same one you used for the main.py
```
sqlite_db = 'japan_news_test.db' # adjust as needed
```
Run morpheme_to_word.py

Processes of morpheme_to_word.py

It fetches the news urls stored in NewsUrls table in the database and scraped the news article.
It extracts morphemes from the articles, clean non-Japanese characters, and combines them into words by looking up the words in the dictionary.
Part of Speech and Romanji are added for each word before transform them into a Pandas dataframe.
Load the dataframe into a SQLite database and clean the Part of Speech column.

Name		Name	Last commit message	Last commit date
Latest commit History 131 Commits
.github/workflows		.github/workflows
data		data
jmdict_eng		jmdict_eng
jp_news_scraper_pipeline		jp_news_scraper_pipeline
tests		tests
.gitignore		.gitignore
README.md		README.md
automated_news_scraper.py		automated_news_scraper.py
main.py		main.py
morpheme_to_word.py		morpheme_to_word.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Status

Common Japanese Morphemes in News: 🎉 Project Completed 🎉

Common Japanese Words in News: 🎉 Project Completed 🎉

Latest Update

Common Japanese Morphemes in News

Visualizations

Data

jp_morpheme_data_from_news_as_of_2024-07-04.parquet

news_url_data_from_nhk_as_of_2024-07-04.parquet

Codebase Details

To web-scrape 'https://www3.nhk.or.jp'

Processes of main.py

jp_news_scraper_pipeline Package

jp_news_scraper Package

automated_news_scraper.py

Common Japanese Words in News

Visualizations

Data

jp_word_data_from_news_as_of_2024-07-04.parquet

Codebase Details

To Combine Morphemes to Words

Processes of morpheme_to_word.py

About

Releases 5

Packages

Languages

sakan811/Find-Common-Japanese-Morphemes-From-News

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Status

Common Japanese Morphemes in News: 🎉 Project Completed 🎉

Common Japanese Words in News: 🎉 Project Completed 🎉

Latest Update

Common Japanese Morphemes in News

Visualizations

Data

jp_morpheme_data_from_news_as_of_2024-07-04.parquet

news_url_data_from_nhk_as_of_2024-07-04.parquet

Codebase Details

To web-scrape 'https://www3.nhk.or.jp'

Processes of main.py

jp_news_scraper_pipeline Package

jp_news_scraper Package

automated_news_scraper.py

Common Japanese Words in News

Visualizations

Data

jp_word_data_from_news_as_of_2024-07-04.parquet

Codebase Details

To Combine Morphemes to Words

Processes of morpheme_to_word.py

About

Topics

Resources

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages