This project is a continuation of the stale bookmarks_clustering project. It's updated to use newer embedding and generative models, mostly via BERTopic library.
- This project uses Task to run and manage tasks, so you'll need to first install that on your machine.
- This project uses OpenAI's API. You'll need an API key from OpenAI; place it in a
.env
file within this project's root directory. The key should beOPENAI_KEY
and the value is your API key. For example:
OPENAI_KEY=sk-proj-_mySuperSecretOpenAIkey
- Export your bookmarks to an HTML file. Note: this project used Google Chrome bookmarks.
Clone this repo and install the project and dependencies:
git clone https://github.com/chris-santiago/bookmarks-topics.git
cd bookmarks-topics
conda env create -f environment.yaml
pip install .
Once you've completed the prerequisites and setup the project environment, you can run the entire pipeline using the command:
task cluster-bookmarks -- "bookmarks.input_path=your/path/to/bookmarks.html"
This will parse your bookmarks file and fetch content from all the bookmarked URLs, before running the clustering algorithm. You may not want to organize ALL of your bookmarks, but rather a subset. In this case, you can pass a comma-separated list of specific folders:
task cluster-bookmarks -- "bookmarks.input_path=your/path/to/bookmarks.html" "bookmarks.folders=[My first folder,My second folder]"
Once complete, your re-organized bookmarks are placed within a newly-created ouputs/topics/
directory, within this project's root directory. That directory is organized by date and time; find the folder that corresponds with your most recent run and import the new_bookmarks.html
file back into your browser. You can also view a breakdown of bookmarks and topics in the bookmarks_topics.json
file, within that same directory.
Note: If you haven't added task
to your PATH then you can replace that command with ./bin/task
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3>JavaScript D3.js</H3>
<DL><p>
<DT><A HREF="https://stackoverflow.com/questions/32205507/moving-the-axes-in-d3-js">javascript - Moving the axes in d3.js - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/25158688/d3-csv-accessor-function-for-loop">javascript - D3.csv accessor function for loop - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/33482812/javascript-take-every-nth-element-of-array">Javascript: take every nth Element of Array - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/23227991/how-to-add-in-zero-values-into-a-time-series-in-d3-js-javascript">How to add in zero values into a time series in d3.js / JavaScript - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/1187518/how-to-get-the-difference-between-two-arrays-in-javascript">How to get the difference between two arrays in JavaScript? - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/16179021/d3-js-specify-text-for-x-axis">javascript - d3.js Specify text for x-axis - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/43646573/d3-get-attributes-from-element/43646752">javascript - D3 get attributes from element - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/28572015/how-to-select-unique-values-in-d3-js-from-data/28572315">javascript - How to select unique values in d3.js from data - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/10644778/targeting-nested-elements-with-css">html - Targeting nested elements with CSS - Stack Overflow</A>
<DT><A HREF="https://math.meta.stackexchange.com/questions/5020/mathjax-basic-tutorial-and-quick-reference/5044#5044">MathJax basic tutorial and quick reference - Mathematics Meta Stack Exchange</A>
<DT><A HREF="https://stackoverflow.com/questions/46945784/how-to-debug-javascript-in-visual-studio-code-with-live-server-running">How to Debug JavaScript in Visual Studio Code with live-server Running - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/52788743/intellij-error-java-release-version-10-not-supported/54963753">jetbrains ide - IntelliJ: Error: java: release version 10 not supported - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/20197961/reversed-y-axis-d3">javascript - reversed Y-axis D3 - Stack Overflow</A>
<DT><A HREF="https://stackoverflow.com/questions/49281258/plot-multiple-lines-in-a-for-loop-in-d3">d3.js - Plot multiple lines in a for loop in d3 - Stack Overflow</A>
</DL><p>
[
{
"url": "https://appliedcausalinference.github.io/aci_book",
"title": "Applied Causal Inference",
"topic": "Bayesian Causal Inference"
},
{
"url": "https://astral.sh/blog/u",
"title": "uv: Python packaging in Rust",
"topic": "Python Development Tools"
},
{
"url": "https://bayesiancomputationbook.com/markdown/chp_01.htm",
"title": "1. Bayesian Inference \u2014 Bayesian Modeling and Computation in Python",
"topic": "Bayesian Causal Inference"
}
]
This project is configured using Hydra, and current configs are found in the conf
directory. You can modify behavior by changing these configs, directly, or by overriding on the command line.
Config | Use | Path |
---|---|---|
Main | Main configuration file. Use this to tune the topic model | conf/config.yaml |
Prompt | Configure LLM prompts. | conf/prompt/* |
Paths | Configure your local paths. | conf/paths/default.yaml |
Hydra | Configure hydra. | conf/hydra/default.yaml |
You can override much of the configuration directly from the command line by passing Hydra overrides after --
in the command. For example:
task cluster-bookmarks -- "topics.topic_model.top_n_words=5"
You can, of course, also run individual tasks that will execute corresponding Python modules. This is useful when tuning the topic model (task: topics
) and want to avoid fetching and parsing HTML from your bookmarked URLs.
task: Available tasks for this project:
* bookmarks: Read bookmarks file
* check-config: Check Hydra configuration
* cluster-bookmarks: Run entire bookmarks clustering pipeline.
* fetch-html: Get bookmarks raw html
* lint: Check source code for errors (will run before tasks)
* parse-html: Parse bookmarks raw html
* topics: Get topics