Replies: 23 comments 44 replies
-
Thank you so much for starting this discussion @zaerl! I'll comment with specific thoughts as I read through it |
Beta Was this translation helpful? Give feedback.
-
I'd say we can freely diverge from any choices the core importer makes. Let's aim for sensible outcomes, even if they differ from the original behavior. I'm worried that trying to satisfy all the constraints imposed by the legacy system would complicate the design. |
Beta Was this translation helpful? Give feedback.
-
Whenever the imported post has the same ID as an existing post, we can either decide:
The new importing system needs to support both use-cases. It would be easy to get stuck in the weeds here so I propose starting with a simple definition of "overwriting", e.g. upserting all the post meta defined in WXR. We could deal with complex deltas later on once as we build the site transfer protocol. |
Beta Was this translation helpful? Give feedback.
-
+1 👍. While most imports will be relatively small, this assumption will enable importing 1TB export files and even continuous import of infinite streams of data. Live site-to-site sync would be one such stream. |
Beta Was this translation helpful? Give feedback.
-
What's the use-case for de-duplication? And you mean pre-processing the data, or duplicating at the DB level with upserts and such? My gut says what you say – let's avoid deduplication entirely. Deduplication is complex. I'd stick to garbage in, garbage out. If someone needs to duplicate the imported records, they'll need to clean their data before bringing it in. |
Beta Was this translation helpful? Give feedback.
-
What would be an example of that? Do you mean a WXR file such as this one? <!-- There is no wp:category with `scripting-languages` slug -->
<wp:category>
<wp:term_id>8</wp:term_id>
<wp:category_nicename>javascript</wp:category_nicename>
<wp:category_parent>scripting-languages</wp:category_parent>
<wp:cat_name><![CDATA[JavaScript]]></wp:cat_name>
</wp:category> If yes, this scenario is similar to not being able to download an attachment. I wouldn't force any opinionated actions here – that's what the existing tools do. Instead, I'd expose this information to the API consumer. "Hey, the data is incomplete, what do you want to do?". We could have a few handlers, such as Note we already do that for all the asset frontloading errors. The system won't just skip the download or pull in a placeholder image – it will leave that decision up to the runtime. In |
Beta Was this translation helpful? Give feedback.
-
Eventually we may need a decision point in the API such as "should the default category be renamed?" We should be fine, though, to just treat all the uncategorized posts as uncategorized regardless of the default category name. It should be easy enough to backtrack once this comes up. |
Beta Was this translation helpful? Give feedback.
-
I really like this choice. It turns a memory constraint into a disk space and CPU constraint, making reentrancy possible. Perhaps we could reuse that table as the vector clock eventually. By adding a |
Beta Was this translation helpful? Give feedback.
-
Note we need to store a string-based
|
Beta Was this translation helpful? Give feedback.
-
Can you elaborate on this? I'm a bit confused. Inserting empty records seems to be complicating things instead of making them simpler:
|
Beta Was this translation helpful? Give feedback.
-
That's only true when importing a WXR into a site that was wiped clean and has no content at all. There are a lot of WXRs out there with low IDs, e.g., post id=2, and they commonly conflict even with the default WordPress content. |
Beta Was this translation helpful? Give feedback.
-
General question: What parts of the process would be simplified if we had a globally unique ID/content hash for each entity? |
Beta Was this translation helpful? Give feedback.
-
I would love to ignore them! But a harsh reality is that we cannot take an easy way out. There's plenty of plugins storing IDs in JSON encoded content, serialized PHP arrays etc. in site options. We can't rely on naive str_replace – we'd just break the data. What we can do, though, is:
VersionPress has some prior art on mapping database fields and microformats, and there are also some good specific examples of microformatted data in the URL rewriting discussion.
Yeah, remapping is an entire rabbit hole. Let's tackle the topo sort, pausing, resuming etc. first and once that works well then let's team-tag remapping. However, let's keep discussing and aligning here to make sure we account for the eventual remapping facilities in the overall design. |
Beta Was this translation helpful? Give feedback.
-
Would this be solved by choosing a sparse enough In general, this is similar to a big data pagination problem – I wonder if we can use similar techniques to deal with it. If not, then perhaps the placeholders approach os for the best, but I'd like to avoid it if we can. |
Beta Was this translation helpful? Give feedback.
-
For my own understanding, is there a reason why core devs would intentionally not import category hierarchy? |
Beta Was this translation helpful? Give feedback.
-
Naive question: |
Beta Was this translation helpful? Give feedback.
-
Would there be any benefit to a similar lookup table being included with the export itself so that preprocessing can just be paid once at export time? |
Beta Was this translation helpful? Give feedback.
-
Why does remapping only occur for well-formed WXR? |
Beta Was this translation helpful? Give feedback.
-
Is this a place where WP hooks could be offered so plugins can customize how these data structures are handled during export and import? |
Beta Was this translation helpful? Give feedback.
-
We can place the burden on plugins, but there will always be sites with not-as-well-written plugins. IMO, some of WP’s beauty is how unconstrained it is, but this poses problems when we want to constrain site state enough to understand and transfer it effectively. Maybe we need a way for plugins to export secondary entities for their custom data structures. For example, a site builder plugin might store different DB references and its data structure in post meta, and it might be helpful if we give the plugin an opportunity to say what an export and import of that structure should look like. |
Beta Was this translation helpful? Give feedback.
-
What is the |
Beta Was this translation helpful? Give feedback.
-
What is the new sort phase? After reading this post once, I would have guessed there is no sort phase. |
Beta Was this translation helpful? Give feedback.
-
I thought of a new strategy that works with big files (+1M) and does not have too much impact on preexisting architecture. I made a raw example in my local branch and would like to share it before adding the details. Key concepts:
In this way the :
The algorithm:
The entities are now sorted and can be accessed using Additional detail: the double |
Beta Was this translation helpful? Give feedback.
-
In a well-formed and not manually crafted WXR, this is the structure we expect to read:
metadata*
<wp:author></wp:author>*
1<wp:category>*
<wp:tag>*
<item>*
Root-level entities are not guaranteed to be in this strict order, and categories or terms are hierarchically in order with each other. 2
The autoincrement ID generation on the source site (the site that exported data) is also not guaranteed to generate the same IDs as the target site. There are efforts in the WordPress core importer to maintain the same structure using the
import_id
field that suggestswp_insert_post
to prefer that ID. 3About data integrity, similarity and deduplication
In my PR #2030, I am investigating how to keep track of creating new entities in the target system (the site that imports data) to have a 1:1 structure between the two.
In the core importer, these are the phases:
authors
,posts
, ...tags
associative arrays, andprocessed_*
arrays are created as wellterm_exists()
)category_exists()
)post_exists()
). Unless thewp_import_existing_post
filter returns zeropost_orphans
arrayThis means the system will have all the entities of the WXR in memory for a brief moment. A +1M posts WXR with all the content will stress the RAM. After all the phases, the memory is freed from imported entities.
Note
Design choice 1: do not use in-memory associative arrays
Running a WXR import multiple times will yield the same results. Once created, the entities are not modified. Both
WP_Entity_Importer
and the core importer will skip existing entities.Ideally, the importer should be able to avoid deduplicating data. Categories, tags, and terms are not deduplicated. In Data Liberation trunk categories' hierarchy is not imported; in my new PR yes:
Note
Design choice 2: automatically create categories parent if a category with that slug as a slug and as the name does not exist once a category with that parent is about to be created. It does not happen with core-exported XMLs, but it can
Note
Design choice 3: Category import does update existing categories. This retrofill parent created on the fly and updates categories already existing, such as
uncategorized
(Uncategorized
), the default category that often is translated into sites with a language different from English.About how to keep track of the entities that have been created and potential remapping
The current implementation keeps track of the entities that have been created and potential remapping using in-memory arrays as
WP_Entity_Importer::mapping
, similar to the core importerWP_Import::posts
arrays. This is a perfect file for small imports. Arrays of integers do not take much space in Zend.Le'ts see a raw test to see how much memory an array of 1M integers takes in an M3 MAX PHP 8.2.25 Zend v4.2.25:
Results:
These are raw numbers, but they give a good idea of the memory usage. The standard
memory_limit
is 128MB, so it is easy to chew enough RAM if you start saving in that array a number and all the contents of a post that can be of an arbitrary length.The keep track of the entities is made for two reasons:
Note
Design choice 4: In my PR, I removed the in-memory arrays and replaced them with a database table. This is a more robust solution, and can be linked to the session ID of the import. Each row maps the minimum information.
The table has these columns:
id
session_id
(see the sessions we use for saving preprocessed imports)entity_type
(comment
,comment_meta
,post
,post_meta
,term
,term_meta
)entity_id
(the ID of the entity in the source site)mapped_id
(the ID of the entity in the target site)parent_id
(the ID of the parent entity)additional_id
(the ID of the additional entity, if needed)byte_offset
(the byte offset of the entity in the WXR file)sort_order
(the sort order of the entity in the WXR file)During the XML parsing, the entities are inserted in the table with the original ID and the byte offset inside the WXR file. Once the entities are imported, the
mapped_id
column is updated with the new ID of the entity in the target site.Note
Design choice 5: have a pre-import step to fill the database with IDs. From my test this add ~20% of computing time to all the phases that do not write in the database or download files (
frontload_assets
andimport_entities
). The difference can be noticed only with millions of rows; otherwise, it is a matter of seconds; a running time is usually a couple of orders of magnitude below the download,wp_insert_*
steps.Parsing, but not importing, a file with a million entities is a quick operation. Use this plugin I've created to generate an XML at 10k at a time if you are curious. https://gist.github.com/zaerl/44dad0cd465751702d03eb58f01386e7
So, at the start of the import phase, all the original IDs are already saved in the database. When the importer is about to import an entity, know if the parent, or whatever other entity, has already been imported or if it just exists in the XML somewhere.
Also this will add support for resuming the process. All rows have the session ID attached and can be read/modified and deleted once done. In-memory arrays are lost when the process restarts and must be filled again.
About the remapping
Important rule: remapping IDs should never happen. But it can happen if the sites are entirely different, the target is not a brand new one, one of the two sites deleted posts etc. What is a remapping? The target site has an auto-increment ID generation that does not match the source site. So, an entity with parent X in the source site will have a different ID in the target site. You should replace the parent ID with the new one; otherwise, it will use a different parent.
Where are the IDs saved? That is the problem. IDs in WordPress are saved in well-known places in the database. But they can be in:
serialize()
d dataHow do I find the IDs? That is the problem. The importer needs to find out where the IDs are saved. We know where the standard one is, but we can only guess for the others.
Example: the
foo
plugin saves an option with this content:array( 'post_id' => 10 )
(a:1:{s:7:"post_id";i:10;
). If the ID 10 is different and need to be remapped this data will become obsolete and the plugin broken.Note
Design choice 6: ignore such structures. Well-written plugins should never use direct reference by ID, but always by slug to prevent this. We can fix well-known plugins, but not worth the effort.
About not remapping
Not remapping is a good thing. It means the importer does not need to guess where the IDs are saved. It can directly use the IDs. This is doable with sites that are made to be imported, such as two sites that do not perform deletion, reset the autoincrement IDs, and where the two sets are overlapping or disjointed. In ($A \cap B$ ), our
post_exists()
will prevent multiple imports from continue adding data; just skip the overlapping items and add the ones of the source data.Note
Design choice 7: do not offer the possibility of not remapping now. But do make it the only way of importing in the near future. A-là git
Git does ask you what you want to do with the files changed in the upstream branch that are changed in your local one, and we must do this as well. Imagine a brand-new site. If you don't do anything with that site you will have an
hello-world
post. That post will likely be modified in the source site, adding a new title, new slug etc. We should ask the user: what you want to do with this (post with the same ID)? Is it ok to rewrite it?Hierarchy of entities
A category can be created before its parent, as well as a post. In the worst cases, all posts with ID
x
can refer to a post with IDN - x,
whereN
is the number of posts in the source XML. During the investigation, I tried various things. I needed to change strategies once I approached the numbers that are not "supported" by preexisting importers:Note
Design choice 8: do not perform a DB-level sort. The
sort_order
field will be kept there but not used. We will import all the entities and add theparent_id
only if already mapped to avoid adding more complexity to the PR.Summary:
source site user -> target site user
UIIn an ideal world, an import should add new stuff and update a pre-existing entity by asking what to do if it changes both sides with the possibility of saving the target version (as in
git stash
). And clean up all the categories/tags/meta/etc. if the target site has them but not the source one, it is the user's will.Addendum: what Unison does
Unison is pretty smart and has two rules:
Footnotes
https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/export.php#L530 ↩
https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/export.php#L216 ↩
https://github.com/WordPress/wordpress-importer/blob/master/src/class-wp-import.php#L703 ↩
Beta Was this translation helpful? Give feedback.
All reactions