This is a simple analysis for the articles provided from LeMonde.
[ ] Easy filtering for the articles [ ] Create a list of all the articles ID, with the following fields: day, month, year, category, title [ ] Create a function that will select the articles, filtered by the previous points [ ] Make unified config files, and environment-dependant ones
- For example, I want to choose a specific data path in the VM
[x] Create a
config.py
file, to handle this logic. Replace the direct import of config files withconfig.py
[ ] Write a script to generate the config files - in a valid format, with valid initial values [x] Be able to recover from an abrupture in the connection/scrapping process [ ] Make a list of bad article (that returns 404). I am curious about them. [ ] Scrap the latest articles (most of March)
- N-gram analysis: what is the most
- TF-IDF: what are the most important words? Different from the most frequent words
- The most frequent words in all all the document is not important
- What is the average number of unique words per category? how did the numbers evolve over the years?
- Seasonality:
- Discover stop words
- Sync all of them in one place - preferably a local storage
- I am not clear on what the point from a database now
- It 'might be' interesting to explore BigQuery
- Need to re-think about the schema
- To make it easier to identify the articles, why not changing the choice of using a unique ID for the document, to a hashing of the URL?
- Maybe it is a bit too late now for this tbh.
- Let's do it anyway. I will have a big problem soon to sync all the data together. I would like to avoid repeated articles
- Consolidate all of that in one single place
- Do that locally for now.
- Remove any replications
- Rename the files to be a hashing from md5, thus, by default, no repetitions can happen.
- A DB or no? what to do? Local or not?
- The quickest and cleanest thing is firebase storage
- Very cheap
- This or a DB? or both?
- I am very tempted to use Supabase. It will take a bit to setup on the big machine, but it should resolve the issues
- Is it document based or SQL based? (I think the latter)
- Let's try. Use SQL-lite, put 50K articles on it. See how it goes from there
- To ease my queries, I need to find to build:
- An async data filtering --> The current data loading is sequential
- A random sampling mechanism: for the sake of testing and many other things, working with a small subset of this data is more than enough.
- Add the 'publication date' metadata to all the articles
- Make a general check that all articles have the same schema