Skip to content

Commit f90d7d3

Browse files
authored
Merge pull request #29 from mfarragher/dev
v0.10
2 parents 66b3646 + 87ba56e commit f90d7d3

25 files changed

+1459
-209
lines changed

.github/workflows/ci.yml

+8-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
name: codecov
2-
on: [push, pull_request]
2+
on:
3+
pull_request:
4+
branches-ignore:
5+
- main
6+
push:
7+
branches:
8+
- main
39
jobs:
410
run:
511
runs-on: ${{ matrix.os }}
@@ -16,6 +22,7 @@ jobs:
1622
uses: actions/setup-python@v2
1723
with:
1824
python-version: ${{ matrix.python-version }}
25+
node-version: 16
1926
- name: Install dependencies
2027
run: |
2128
pip install -r requirements.txt --use-pep517

README.md

+14-12
Original file line numberDiff line numberDiff line change
@@ -4,50 +4,52 @@
44
# obsidiantools 🪨⚒️
55
**obsidiantools** is a Python package for getting structured metadata about your [Obsidian.md notes](https://obsidian.md/) and analysing your vault. Complement your Obsidian workflows by getting metrics and detail about all your notes in one place through the widely-used Python data stack.
66

7-
It's incredibly easy to explore structured data on your vault through this fluent interface. This is all the code you need to generate a `vault` object that stores the key data:
7+
It's incredibly easy to explore structured data on your vault through this fluent interface. This is all the code you need to generate a `vault` object that stores all the data:
88

99
```python
1010
import obsidiantools.api as otools
1111

1212
vault = otools.Vault(<VAULT_DIRECTORY>).connect().gather()
1313
```
1414

15-
These are the basics of the function calls:
16-
- `connect()`: connect your notes together in a graph structure and get metadata on links (e.g. wikilinks, backlinks, etc.)
15+
These are the basics of the method calls:
16+
- `connect()`: connect your notes together in a graph structure and get metadata on links (e.g. wikilinks, backlinks, etc.) There ais the option to support the inclusion of 'attachment' files in the graph.
1717
- `gather()`: gather the plaintext content from your notes in one place. This includes the 'source text' that represent how your notes are written. There are arguments to support what text you want to remove, e.g. remove code.
1818

1919
See some of the **key features** below - all accessible from the `vault` object either through a method or an attribute.
2020

21-
As this package relies upon note (file)names, it is only recommended for use on vaults where wikilinks are not formatted as paths and where note names are unique. This should cover the vast majority of vaults that people create.
21+
The package is built to support the 'shortest path when possible' option for links. This should cover the vast majority of vaults that people create. See the [wiki](https://github.com/mfarragher/obsidiantools/wiki) for more info on what sort of wikilink syntax is not well-supported and how the graph may be slightly different to what you see in the Obsidian app.
2222

2323
## 💡 Key features
2424
This is how **`obsidiantools`** can complement your workflows for note-taking:
2525
- **Access a `networkx` graph of your vault** (`vault.graph`)
2626
- NetworkX is the main Python library for network analysis, enabling sophisticated analyses of your vault.
2727
- NetworkX also supports the ability to export your graph to other data formats.
2828
- When instantiating a `vault`, the analysis can also be filtered on specific subdirectories.
29-
- **Get summary stats about your notes, e.g. number of backlinks and wikilinks, in a Pandas dataframe**
30-
- Get the dataframe via `vault.get_note_metadata()`
29+
- **Get summary stats about your notes & files, e.g. number of backlinks and wikilinks, in a Pandas dataframe**
30+
- Get the dataframe via `vault.get_note_metadata()` (notes / md files), `vault.get_media_file_metadata()` (media files that can be embedded in notes) and `vault.get_canvas_file_metadata()` (canvas files).
3131
- **Retrieve detail about your notes' links and metadata as built-in Python types**
32+
- The main indices of files are `md_file_index`, `media_file_index` and `canvas_file_index` (canvas files).
33+
- Check whether files included as links in the vault actually exist, via `vault` attributes like `nonexistent_notes`, `nonexistent_media_files` and `nonexistent_canvas_files`.
34+
- Check whether actual files are isolated in the graph ('orphans'), via `vault` attributes like `isolated_notes`, `isolated_media_files` and `isolated_canvas_files`.
35+
- You can access all the note & file links in one place, or you can load them for an individual note:
36+
- e.g. `vault.backlinks_index` for all backlinks in the vault
37+
- e.g. `vault.get_backlinks(<NOTE>)` for the backlinks of an individual note
3238
- **md note info:**
3339
- The various types of links:
3440
- Wikilinks (incl. header links, links with alt text)
3541
- Embedded files
3642
- Backlinks
3743
- Markdown links
38-
- You can access all the links in one place, or you can load them for an individual note:
39-
- e.g. `vault.backlinks_index` for all backlinks in the vault
40-
- e.g. `vault.get_backlinks(<NOTE>)` for the backlinks of an individual note
4144
- Front matter via `vault.get_front_matter(<NOTE>)` or `vault.front_matter_index`
4245
- Tags via `vault.get_tags(<NOTE>)` or `vault.tags_index`. Nested tags are supported.
4346
- LaTeX math via `vault.get_math(<NOTE>)` or `vault.math_index`
44-
- Check which notes are isolated (`vault.isolated_notes`)
45-
- Check which notes do not exist as files yet (`vault.nonexistent_notes`)
4647
- As long as `gather()` is called:
4748
- Get source text of note (via `vault.get_source_text(<NOTE>)`). This tries to represent how a note's text appears in Obsidian's 'source mode'.
4849
- Get readable text of note (via `vault.get_readable_text(<NOTE>)`). This tries to reduce note text to minimal markdown formatting, e.g. preserving paragraphs, headers and punctuation. Only slight processing is needed for various forms of NLP analysis.
4950
- **canvas file info:**
5051
- The JSON content of each canvas file is stored as a Python dict in `vault.canvas_content_index`
52+
- Data to recreate the layout of content in a canvas file via the `vault.canvas_graph_detail_index` dict
5153

5254
Check out the functionality in the demo repo. Launch the '15 minutes' demo in a virtual machine via Binder:
5355

@@ -58,7 +60,7 @@ There are other API features that try to mirror the Obsidian.md app, for your co
5860
The text from vault notes goes through this process: markdown → split out front matter from text → HTML → ASCII plaintext.
5961

6062
## ⏲️ Installation
61-
``pip install obsidiantools``
63+
`pip install obsidiantools`
6264

6365
Requires Python 3.9 or higher.
6466

obsidiantools/__init__.py

+1
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22
from . import md_utils
33
from . import html_processing
44
from . import canvas_utils
5+
from . import media_utils

obsidiantools/_constants.py

+21
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,24 @@
1313
# helpers:
1414
WIKILINK_AS_STRING_REGEX = r'\[[^\]]+\]\([^)]+\)'
1515
EMBEDDED_FILE_LINK_AS_STRING_REGEX = r'!?\[{2}([^\]\]]+)\]{2}'
16+
17+
# Sets of extensions via https://help.obsidian.md/How+to/Embed+files :
18+
# NB: file.ext and file.EXT can exist in same folder
19+
IMG_EXT_SET = {'.png', '.jpg', '.jpeg', '.gif', '.bmp', '.svg',
20+
'.PNG', '.JPG', '.JPEG', '.GIF', '.BMP', '.SVG'}
21+
AUDIO_EXT_SET = {'.mp3', '.webm', '.wav', '.m4a', '.ogg', '.3gp', '.flac',
22+
'.MP3', '.WEBM', '.WAV', '.M4A', '.OGG', '.3GP', '.FLAC'}
23+
VIDEO_EXT_SET = {'.mp4', '.webm', '.ogv', '.mov', '.mkv',
24+
'.MP4', '.WEBM', '.OGV', '.MOV', '.MKV'}
25+
PDF_EXT_SET = {'.pdf',
26+
'.PDF'}
27+
# canvas files:
28+
CANVAS_EXT_SET = {'.canvas',
29+
'.CANVAS'}
30+
31+
# metadata df cols order:
32+
METADATA_DF_COLS_GENERIC_TYPE = [
33+
'rel_filepath', 'abs_filepath',
34+
'file_exists',
35+
'n_backlinks', 'n_wikilinks', 'n_tags', 'n_embedded_files',
36+
'modified_time']

obsidiantools/_io.py

+29
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from pathlib import Path
22
from glob import glob
3+
import numpy as np
34

45

56
def get_relpaths_from_dir(dir_path: Path, *, extension: str) -> list[Path]:
@@ -81,3 +82,31 @@ def get_relpaths_matching_subdirs(dir_path: Path, *,
8182
extension=extension)
8283
if str(i.parent.as_posix())
8384
in include_subdirs_final]
85+
86+
87+
def _get_valid_filepaths_by_ext_set(dirpath: Path, *,
88+
exts: set[str]):
89+
all_files = [p.relative_to(dirpath)
90+
for p in Path(dirpath).glob("**/*")
91+
if p.suffix in exts]
92+
return all_files
93+
94+
95+
def _get_shortest_path_by_filename(relpaths_list: list[Path]) -> dict[str, Path]:
96+
# get filename w/ ext only:
97+
all_file_names_list = [f.name for f in relpaths_list]
98+
99+
# get indices of dupe 'filename w/ ext':
100+
_, inverse_ix, counts = np.unique(
101+
np.array(all_file_names_list),
102+
return_inverse=True,
103+
return_counts=True,
104+
axis=0)
105+
dupe_names_ix = np.where(counts[inverse_ix] > 1)[0]
106+
107+
# get shortest paths via mask:
108+
shortest_paths_arr = np.array(all_file_names_list, dtype=object)
109+
shortest_paths_arr[dupe_names_ix] = np.array(
110+
[str(fpath)
111+
for fpath in relpaths_list])[dupe_names_ix]
112+
return {fn: path for fn, path in zip(shortest_paths_arr, relpaths_list)}

0 commit comments

Comments
 (0)