-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better XML format [RFC] #279
Comments
I don't know if this up-to-date or relevant, but just in case you hadn't seen it, I found this in the wiki today: https://wiki.gnome.org/Apps/GTG/DataModel |
So, my gut feeling (and I'm probably not the best person to comment on file format design; I believe @broussea, @ploum and @izidormatusov would be much more qualified than me to comment) is that your observations generally make sense but I have some reservations:
Anyhow, those are just my uneducated guesses. |
I think UUID instead of ID make sense here; it would avoid any practical clash possibilities, making things like merging two different GTG XML files together much more straightfoward. It can also act as the primary key in any potential hypothetical DB backed storage backends as well (instead of ID). Internally it can be handled as a proper uuid.UUID (128-bit number) too, not inefficient strings. I'd be careful about merging everything into a single big file, if they aren't yet. Megabyte sized XML file for "hardcore" users, especially if "completed" tasks are kept in there as well, doesn't sound like something that'd be very trouble-free either, or performant for a simple edit. Maybe a big file that gets logically split once big, but with everything in the background merged together seamlessly, but on edit only saving the files where the element that changed is? Then again, I'm not really sure of XML (lxml) performance here. Not knowing much about the context and bigger picture, those were my initial thoughts. |
Indeed, this is covered in the header of the proposed file. It would store both gtg's version and the xml. <gtg-data app-version="0.5" xml-version="2"> Benefits (off the top of my head):
That's interesting, though the tags are tightly related to the tasks. Most of the time if you are writing a tag, you are also doing something to a task. We would have to check this TBH.
Having everything in one files makes profiles even easier: just load a different file. Unless you want to share tags across profiles, I don't know how useful that would be tho. We could even have a command line parameter to pass gtg a path to any random xml file, and be able to load random profiles.
I have 367 active tasks, with another 236 done. I often paste a lot of text and urls into tasks (some are straight up notes lol). My numbers are:
So we are looking at 285kb in total. I would need about 4 times more tasks to reach 1MB. @nekohayo you are the ultimate GTG warrior, how big are your files? As for lxml, the website has benchmarks:
Parsing times shouldn't be a problem unless you are actually more busy than god, though building the Treemodel could take a while. But that would be the same with the current format. You do have a point with old closed tasks. Maybe we can detect if auto-purge is disabled and move closed tasks to a separate file. Though maybe the end result would be the same. We need to load everything, so we whether it's in one file or two it's going to take a while.
Good points guys, I hadn't thought about conflicts and storing it as a uuid. I'll update the proposal |
For what it's worth, the biggest filesize I've had for my tasks xml file has been 930 kB (recently I stopped pruning closed/done tasks for about 6 months for some particular reason), though it would be infinitely bigger if I hadn't used the task reaper plugin (now part of core) for all these years. That said, if lxml is as fast as it sounds, the performance problem will be negligible. I agree that having the closed tasks be a separate XML probably doesn't change much. Though, now that I think of it, it probably could allow some mega-optimization hack when the "closed tasks remover" feature is called (simpler search domain), in theory... but it might not be needed, as that kind of optimization might be dwarfed by the performance gains of lxml. Again... I don't think I'm the right person to have an opinion on the "proper" way to structure data within the XML format ;) |
I don't worry about the performance of it too much, if you don't end up with 5+ MB things. I'd think more about the aspects of just having to write a 1MB-10MB all the time when one little change is made, and having those queued up to be done constantly during active GTG use. |
GTG had big plans to support many backends. You could theoretically keep your
AFAIK only the template start tags are using this. These style ids were used in the past and got replaced by uuids. Have not been removed completely.
go for uuid
There are "fuzzy" dates: now, soon, someday. People are using these quite a bit
Sounds good.
There is a reason for the nested XML. GTG supports basic formatting like
Saying that, the situation can be much more improved. You can have XML nested inside of another XML instead of storing serialized version. There are many bugs where the subtasks are not desrialized properly and tasks end with garbage like GTG supports tasks represented in Directed Acyclic Graphs (aka there can be two parents per task). To be honest, this adds a lot of complexity and is not very well supported in UI. If you put subtasks under the main task, you remove this ability. Subtasks can have their parents changed which would mean more complexity of the code on the serialising the tasks.
Sounds good.
+1 for having proper XML inside of content.
|
Sounds like some of these things would be better handled at a backend level 🤔
True, I forgot to mention that
Right, this is what I wanted to do with the content tag. Separate the text from tags to make it easier to parse. Though it seems like mixing text and tags is legal and supported by lxml. Still on the fence on that, since it would make the file simpler but processing more complicated.
We talked about this with @ploum recently. I'm leaning towards removing this from GTG. There's only a couple of UI functions for this that don't currently work, and very few use cases for the amount of complexity. For the use case he mentioned (tasks being blocked by more than one task), I think it would be easier to have some kind of internal linking between tasks. Then you can write something like Thanks for weighing in! |
@diegogangl asked me for input because I'm an XML nerd and I have some suggestions of my own in #431 (sidenote, a schema would let you validate with xmllint as well as any XSLT/XML 1.0 parser that supports validation, not just LXML). Here goes! id/uuid
I'd also strongly, strongly recommend using UUID4 ( It may be worth considering to use UUID5 instead, with the namespace being the "project" name if that's a feature that GTG decides going forward is something to keep (i.e. Fully agreed on one and only one The only difficulty is we can't specify the Worth noting that I happen by luck to already have a validation for use in schemas for UUID4, so that'd work fine. Unified or split XML files?Why not both? Now that you're switching to LXML (#401), you can use XInclude right from LXML. Let the engine reassemble the files for you when you parse. Split them into however many you want. It's a single function call, dirt easy. I'd avoid hyphens in tags, though. Instead of ISO 8601 for dates/timesThe XML itself should always include static dates, ideally in UTC/"Zulu" time, for data portability and validation purposes. I think it'd be okay (if there's a reasonable way to parse it) to let the user define a specific date/time in relative terms, but then convert it into a static date/time (and, thereafter, display as a static date/time). ISO 8601 is a good choice, since it has a native XML schema definition that can accept multiple formats. If you plan on implementing "expected durations of time", there's a type for that too (or it can be specified right in the timestamp). I do already have a type for accepting either an ISO 8601 or UNIX Epoch, though, too. My recommendation is the format Version info in root element attributesYES. A big yes. You're going to have to probably break backwards compat for this first release of new data since previous versions won't even have the data version attribute, but that should be fine because SO much of this is going to change that it wouldn't be worth keeping code around to parse previous versions. A converter should exist, but you probably don't want to keep conditionals around for that old code in the core. That said, again - I'd avoid hyphens in tags.
Nest subtasks inside their parent tasksYou could, but you lose some uniqueness checking. Instead, I'd recommend keeping subtasks as actual tasks and making a i.e. <!-- ... -->
<task ...>
<!-- ... -->
<subtasks>
<sub>SUB-ID-ATTR-HERE</sub>
<sub>ANOTHER-HERE</sub>
</subtasks>
</task>
<!-- ... --> Remove
|
@johnnybubonic whoa, thank you so much for all the feeback! This helps a lot \o/ id/uuid
UUID4 looks fine. Tasks can be deleted, and old closed tasks get autopurged by default, so I wouldn't worry much about collisions.
This sounds useful, but I think tags and tasks should have "different sets" of IDs. We load the entire file into memory and then query those data structures, so there's no chance of collision between tags and tasks. Unified or split XML files?Didn't know about XInclude, that looks really useful but TBH there's just no good reason to split the files other than file size. ISO 8601 for dates/timesThe problem here is the fuzzy dates. We don't just have tomorrow, we also have: now, someday and soon. None of these match easily with an actual date. The Date class does match them to absolute dates but it's kind of hacky and would be really hard to parse back into something fuzzy. Maybe we can have separate Version info in root element attributesThanks for the tip. Yeah, my idea is to have a separate module to host all the versioning code. Nest subtasks inside their parent tasks
What do you mean by uniqueness checking?
We probably won't support multiple parents. It's a source of headaches both for the backend code and the UI, and the use cases are better served by just supporting internal linking between tasks. Remove task-remote-idsThere was some code to read them but it was already commented out with a suggestion to remove them when I got here :) TagsI like your proposal better, there's no reason to keep ID as an attribute there. Personal/Additional suggestionsThese all sound great! The Pythonista in me hates not using hyphens, but if that's the standard way 👍 About constraints:
That's all I can think of 🤔 , everything else is optional. |
My pleasure!
Yep, but an
Yep, agreed, but it does let it be more modular. Granted, with modularity can come complexity, so YMMV.
Is it required to display them in a fuzzy manner, or just parse them as input and write to the data storage as a fixed time? I'd think the latter would probably be the way to go. (humanize WOULD let you display it as fuzzy pretty well, FWIW. It's best to store the dates in a format easily understood by the machine since, realistically, humans shouldn't be looking at the raw XML files.)
Task IDs, tasks content, anything. Since subtasks can have subtasks of their own, you start messing with recursion. While it's possible to support recursion in a schema from what I recall, it does lead to some potentially messy parsing. For those reasons I'd recommend treating subtasks as references to actual tasks rather than containing the entire subtask.
Yep. I'd hate to see/use camelCase in my actual code too (I tend to opt for underscores), but code and data are different! Hyphens can mess up some XML libraries. W3C occasionally uses hyphens for data in their examples but even then, only sometimes - they're pretty inconsistent about it (for instance, all of the standard type definitions are in camelCase i.e. Thanks for the details about constrains! That helps a lot. |
Yes, it's required to store them. I should mention that "tomorrow" or "friday" aren't fuzzy dates. Those are converted to actual dates after the user selects/types them. Fuzzy dates are "someday", "soon" and "now". None of these can be converted to dates and we need to store them as fuzzy.
Shame, I though XML was all about nesting. By the way, what do you think about mixing text and tags in <content>
This is some text
<subtask>6000caf7-6197-4d77-a50e-8bd8804c5694</subtask>
Some more text, maybe <strong>bold too?</strong>
</content> vs <content>
<p>This is some text</p>
<subtask>6000caf7-6197-4d77-a50e-8bd8804c5694</subtask>
<p>Some more text, maybe</p> <strong>bold too?</strong>
</content> Seems like it's valid somewhat, and lxml supports it. But I don't know if it's supported in schemas or could cause other kinds of trouble further along. |
Hrm, I see... I'd store them as a different element name, then. That way it can validate a fixed time string OR validate against a list of known-good "fuzzy" values. I could always Though I'd imagine "now" wouldn't be a fuzzy since it'd just be a
It absolutely is, yep! But this goes a bit beyond nesting; it's recursion. And while XML Schema can (again, if I recall) support validating recursive elements (
A Schema could validate mixed content like that just fine, but from the parsing end via LXML... I'd recommend against it. It's not without some "gotcha!" because unless you want that example to render in the GUI to the user as: This is some text Some more text, maybe bold too? you'd have to do some stripping of child elements while retaining the text component of them, which is not entirely reliable, even with the amazingness that LXML is. I'd recommend keeping them in separate elements and even perhaps displaying them to users differently, since they're their own thing. Entering them in the task is fine, but the input parser should store them separately and then they should be displayed separately in the GUI once processed in, IMHO. |
Nope, this is how it's stored right now:
Setting a task to now gives it a "higher priority" when sorting by due dates. So it needs to be stored
Ah thanks, I figured there might be problems (some of which we already have). I've updated the proposal with all your suggestions |
Thanks! Will update #431 later today or tomorrow with the changes to match current proposal here! Might as well keep them in tandem. |
Slightly modified version of example follows. Namely, This means that using things like Instead of using CDATA containers, you could base64 encode/decode the See the inline comments below. <?xml version="1.0" encoding="UTF-8"?>
<gtgData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="https://wiki.gnome.org/Apps/GTG"
appVersion="0.5"
xmlVersion="2"
xsi:schemaLocation="http://SOMEDOMAIN.TLD/SOME/PATH/TO/data.xsd">
<taglist>
<tag id="7171ff82-119a-4933-8277-a8ef5ce6a3e2" color="E9B96E" name="GTG"/>
<tag id="140f74ea-b2f1-4b0f-b72b-0e85f471bb98"
color="cdd3854e56d8"
icon="emblem-shared-symbolic.symbolic"
name="life"/>
<tag id="94669f60-2f8e-4b16-b87f-c1d46ade4536" color="c96a52131cd2" name="errands"/>
<tag id="46890bc2-c924-4146-8279-472099abc0b1" color="c96a52131cd2" name="other_errands"/>
<tag id="aeb6e795-cb65-4d89-bf80-c7ea524fcfa7" color="c96a52131cd2" name="home_renovation"/>
</taglist>
<tasklist>
<task id="2fdcd50f-0106-48b2-9f16-db2f8dbbf044" status="Active">
<title>Learn How To Use Subtasks</title>
<tags>
<tag>7171ff82-119a-4933-8277-a8ef5ce6a3e2</tag>
<tag>46890bc2-c924-4146-8279-472099abc0b1</tag>
<tag>94669f60-2f8e-4b16-b87f-c1d46ade4536</tag>
</tags>
<dates>
<addedDate>2020-04-10T20:48:11</addedDate>
<modifyDate>2020-04-10T20:37:02</modifyDate>
<startDate>2020-05-10T00:00:00</startDate>
</dates>
<!-- With the content element in a CDATA, you won't be able to detect subs automatically
if they use XML/HTML-like tagging.
Perhaps a different notation format inside content? e.g. "{! This is a subtask !}" -->
<subtasks>
<sub>bf33b248-ab96-4b99-9e40-8b60c1d7fe2e</sub>
<sub>a957c32a-6293-46f7-a305-1caccdfbe34c</sub>
</subtasks>
<content><![CDATA[<p>@GTG, @errands, @home_renovation
A "Subtask" is something that you need to do first before being able to accomplish your task. In GTG, the purpose of subtasks is to cut down a task in smaller subtasks that are easier to achieve and to track down.
To insert a subtask in the task description (this window, for instance), begin a line with "-", then write the subtask title and press Enter.
Try inserting one subtask below. Type "{! This is my first subtask! !}", for instance, and press Enter:</p>
<p>Alternatively, you can also use the "Insert Subtask" button.
Note that subtasks obey to some rules: first, a subtask's due date can never happen after its parent's due date and, second, when you mark a parent task as done, its subtasks will also be marked as done.
And if you are not happy with your current tasks/subtasks organization, you can always change it by drag-and-dropping tasks on each other in the tasks list.</p>]]></content>
</task>
<task id="bf33b248-ab96-4b99-9e40-8b60c1d7fe2e" status="Done">
<title>One subtask</title>
<!-- The following does not have a matching tag in taglist? -->
<content><![CDATA[<p>This is some test subtask with a @tag </p>]]></content>
</task>
<task id="a957c32a-6293-46f7-a305-1caccdfbe34c" status="Active">
<title>Another subtask</title>
<dates>
<addedDate>2020-04-10T20:48:11</addedDate>
<fuzzyDueDate>someday</fuzzyDueDate>
</dates>
<content/>
</task>
<!-- ... -->
</tasklist>
</gtgData> The above validates against what I just pushed to (EDIT: I did a dumb so removed the |
Is that because of the Base64 is a no-go, since we want to keep it human friendly |
Any SGML-subset (XML, HTML, ...) syntax will trigger a validator error unless it's expected per the schema and the parent is a mixed-type, or it's in a CDATA. It's not the name of the tag so much as it being enclosed by
That is... a good question. I forgot You'll want to find some way around how you handle subtasks inline in In the proposed tagging syntax for inside CDATA'd (EDIT: better POC; it'll actually demonstrate the substitution.) #!/usr/bin/env python3
import re
s = """This is example task text.
There's more text here.
...But suddenly, a wild {! new subtask !} appears! And {! another one !}!
And one with {! an exclamation point! !} And one {!without spaces!}! And even one {! with {} inside because why? !}
It starts with a { and ends with a }. But we only want the subtask text."""
r = re.findall(r'{!\s*(.+?)\s*!}', s)
print('ORIGINAL:')
print(s)
print()
print('FOUND:')
print(r)
for idx, subtask in enumerate(r):
# Pretend that the list index is the new subtask's ID (a UUID4).
# Also, I don't know how GTK renders/uses the link anchors. This should be enough to demonstrate though.
task_ptrn = '{{!\s*{0}\s*!}}'.format(re.escape(subtask))
task_link = '<a href="{0}">'.format(idx)
task_html = '{0}{1}</a>'.format(task_link, subtask)
s = re.sub(task_ptrn, task_html, s)
print('\nThis should now print the original string with links.\n')
print(s) As shown if you run that, you can find subtasks defined in CDATA-stripped content (so it could still be rendered as HTML straight through, which might be nice from a GUI end). It'd also let users do their own formatting with HTML (I'd recommend implementing rendering limits, though. Probably don't need a
Yeah. It feels like a dirty hack and doesn't really fix the parseable-formatted-content problem anyways. |
currently matches getting-things-gnome#279. mostly (still in discussion re: CDATA vs. escaping in <content>). all uniqueness and associations applied, i think, as well.
Just a random drive-by comment: recently with GTG 0.4's UI opening up some possibilities, I have found myself (as a user) sometimes wishing for the ability to parent more than one task to a child, I found the "single parent, many children"-only model to be a bit restrictive... so if somehow multi-parents could work, I'd love to see it happen. I just have no idea currently how that would be represented/managed in the UI, however. |
@leio brought up some interesting questions in IRC re: CDATA/escaping:
So in order:
|
At this point the file format change is basically done with only minor bugs left, so closing this. |
We can (and should) improve the file format of the local backend. A more strict structure would help us move everything into the treemodel faster, and avoid many bugs.
Current format
For reference this is what a task currently looks like:
Here's another example with the "alternative IDs":
There are several problems here:
There's a projects.xml file which has some metadata and connects the tags xml with the tasks file. Apparently the previous team had envisioned something like a projects system, where tasks were contained in a project. Each project having it's own backend and associated tags file.
Looks like this was never completed though 🤔
Proposed
gtg_data.xml
to avoid clashing with the old files.task-remote-ids
. It's not being used at all, and some guy left a comment in the code saying he doesn't think we need them!Tags
This is what that task would look like:
Versioning
We should always keep support for n-1 versions. This could go into
versioning
module. Since we have different filenames we can try to readgtg_data.xml
first, if it's not there we can try to detectprojects.xml
and go into the versioning code.Feedback much appreciated!
The text was updated successfully, but these errors were encountered: