This repository contains source data collected from the Pravda Network, a collection of Russian news websites across multiple countries and languages. The data is automatically updated hourly, providing a comprehensive dataset for analysis.
Each CSV file in the data/
directory represents a sub-domain from the Pravda Network. The files contain metadata extracted from articles, including:
- URL
- Source title
- Source URL
- Canonical URL
- OG:Title
- OG:Description
- Alternate language versions
- Country
- Publication date
The data is collected using an automated web scraper that:
- Traverses all articles listed on each domain
- Extracts metadata from article pages
data/
├── abkhazia.news-pravda.com.csv.gz
├── albania.news-pravda.com.csv.gz
├── algeria.news-pravda.com.csv.gz
└── ...
This repository is updated hourly via an automated script. The update process:
- Checks each domain for new articles
- Appends new data to the appropriate CSV files
- Commits and pushes changes to this repository
All CSV files use the following header structure:
URL,Source Title,Source URL,Canonical,OG:Title,OG:Description,Alternates,Country,Publication Date
Example row:
https://domain.com/article/123.html,Original Source,https://source.com,https://domain.com/canonical,Title,Description,https://alt1.com(en);https://alt2.com(fr),Country,2024-03-15T14:30:00Z
This dataset is provided for research and analysis purposes. When using this data, please cite:
CheckFirst. Pravda Network Data Collection. GitHub Repository. https://github.com/CheckFirstHQ/pravda-network
This repository is automatically updated hourly. The last update timestamp can be found at the top of this README.
Maintained by CheckFirst