Skip to content

Showcase visualizations about common Japanese morphemes that appear in the news

Notifications You must be signed in to change notification settings

sakan811/Find-Common-Japanese-Morphemes-From-News

Repository files navigation

Table of Contents

Status

Common Japanese Morphemes in News: 🎉 Project Completed 🎉

Common Japanese Words in News: 🎉 Project Completed 🎉

CodeQL
Scraper Test
Daily News Scraper

Latest Update

Common Japanese Morphemes in News Latest Update: 16 July 2024

Common Japanese Words in News Latest Update: 16 July 2024

Common Japanese Morphemes in News

Showcase visualizations and code base about the common Japanese morphemes that appear in news.

Morphemes are the smallest units of meaning in a language.

Data was collected from 'https://www3.nhk.or.jp'

Data collecting period: 25 May 2024 - 4 July 2024

Visualizations

Visualizations Latest Update: 5 July 2024

Tableau

Instagram

Facebook

Data

Located in data folder

Contain Japanese morphemes data collected from the NHK News website.

Total morphemes collected: 1,015,285

Contain urls which link to the news that the morphemes were collected from.

Total Url collected: 896

Urls in this file should follow https://www3.nhk.or.jp if you want to see the source.

For example: https://www3.nhk.or.jp/news/html/20240523/k10014458551000.html

Codebase Details

To web-scrape 'https://www3.nhk.or.jp'

  • Go to main.py
  • Adjust the SQLite database name as needed
    sqlite_db = 'japan_news_test.db' # adjust as needed
    
  • Run the script

Processes of main.py

  1. Fetch the urls which link to news articles in HNK News website.
  2. Check whether those urls are already in the database to ensure that the script doesn't scrape texts from the same source twice.
  3. Save a new set of urls to the database.
  4. Fetch news articles text from those new urls.
  5. Extract morphemes, Romanji, and Part of Speech.
  6. Clean data and transform them into a Pandas Dataframe.
  7. Save data and the news urls to a SQLite database.

pipeline.py

  • Contain web-scraping pipeline's functions.

configure_logging.py

  • Contain functions about logging configurations.

news_scraper.py

data_extractor.py

  • Contain functions related to extracting data about the Japanese language.

data_transformer.py

  • Contain functions related to data transformation and cleaning.

sqlite_functions.py

  • Contain functions related to SQLite database.

utils.py

  • Contain utility functions.

Scrape data from NHK News daily, automated with GitHub Action.

Common Japanese Words in News

Showcase visualizations and code base about the common Japanese words that appear in news.

This project was built on top of Common Japanese Morphemes in News project.

Combining morphemes collected from Common Japanese Morphemes in News project into words by looking them up in the dictionary.

Words that aren't in the dictionary were filtered out.

The Japanese dictionary for word-lookup is based on JMdict: https://github.com/themoeway/jmdict-yomitan

Data collecting period: 25 May 2024 - 4 July 2024

Visualizations

Visualizations Latest Update: 16 July 2024

Tableau

Instagram

Facebook

Data

Located in data folder

Contain Japanese words data from NHK News.

Total Japanese Words: 426,217

Codebase Details

morpheme_to_word.py

  • Contain functions that combine Japanese morphemes to words
  • You need to have news urls from NHK News stored in the SQLite database first before running this script.
    • Which means you should run main.py to scrape morphemes from the NHK News first.

To Combine Morphemes to Words

Processes of morpheme_to_word.py

  1. It fetches the news urls stored in NewsUrls table in the database and scraped the news article.
  2. It extracts morphemes from the articles, clean non-Japanese characters, and combines them into words by looking up the words in the dictionary.
  3. Part of Speech and Romanji are added for each word before transform them into a Pandas dataframe.
  4. Load the dataframe into a SQLite database and clean the Part of Speech column.