-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/sql store #125
Feature/sql store #125
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very fucking cool, this looks awesome!
nomenklatura/db.py
Outdated
with engine.begin() as conn: | ||
yield conn | ||
finally: | ||
if conn is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, are we committing after an exception here? That feels a bit unhealty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nomenklatura/store/sql.py
Outdated
data: dict[str, Any] = stmt.to_row() | ||
data["target"] = as_bool(data["target"]) | ||
data["external"] = as_bool(data["external"]) | ||
data["first_seen"] = data["first_seen"] or datetime.utcnow() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should make up dates somewhere in the middle of the pipeline. That would lead to pretty random outcomes. Maybe we make those columns nullable instead?
Been struggling with the same issue: https://github.com/opensanctions/opensanctions/blob/main/zavod/zavod/tools/load_db.py#L42-L45
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it nullable: 2d0dd85
I don't know what side-effects this could have in other parts of your pipeline? Especially for existing statement tables... (migrations?)
table = self.store.table | ||
q = ( | ||
select(table) | ||
.where(table.c.prop_type == "entity", table.c.value == id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it needs to check self.store.resolver.connected(id)
not just id
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One extra thing we can consider is if any of the functions ( |
Found the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing :)
nomenklatura/store/sql.py
Outdated
def __init__(self, store: SqlStore[DS, CE]): | ||
self.store: SqlStore[DS, CE] = store | ||
self.batch: Optional[Set[Statement]] = None | ||
self.batch_size = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need batch_size
if we have a batch
array? I assume len(batch)
is more precise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this originally came from my believe that int > int
is more performant than len(foo) > int
but gosh, this would never be the bottleneck in the overall sql store implementation) 😂 🙈
|
||
def get_entity(self, id: str) -> Optional[CE]: | ||
table = self.store.table | ||
ids = [str(i) for i in self.store.resolver.connected(id)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting, so the idea here is that we're not believing the stmt.canonical_id
in the table? While that works, it's a a bit different from all the other store implementations we have now....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, no, i guess i just misunderstood your comment here: #125 (comment)
nomenklatura/store/sql.py
Outdated
q = ( | ||
select(table) | ||
.where(table.c.dataset.in_(self.dataset_names)) | ||
.order_by("entity_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be canonical_id
? Otherwise it'll return fragmented entities, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally right. fixed here: c1dc320
but, canonical_id
column was nullable in the sql table, which should not, right?
nomenklatura/store/sql.py
Outdated
|
||
def _iterate_stmts(self, q: Select) -> Generator[Statement, None, None]: | ||
with self.engine.connect() as conn: | ||
conn = conn.execution_options(stream_results=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has quite a bit of overhead. I wonder if we should pass an option into this func that can disable it for get_entity
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implements a SQLStore for statements following the general Store interface as seen in MemoryStore and LevelDBStore.
There might be a few things to discuss, that's why there are several commits.
33e5456
Added sql index for prop, prop_type and schema columns. As all these values have a relatively low cardinality, this should not be a massive overhead
84600b6
I am not sure if this is the right way, but during sqlite tests the transaction was not commited without the finally statement. But this breaks mypy -.-
c4f7503
A small fix, without it the LevelDBStore was not deleting all the statements it should
cebd4d7
A test that tests all the stores that they behave exactly the same ;) (they do.)