Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendation to always setDT() an object from disk is not well-documented #6888

Closed
MichaelChirico opened this issue Mar 26, 2025 · 7 comments · Fixed by #6894
Closed

Recommendation to always setDT() an object from disk is not well-documented #6888

MichaelChirico opened this issue Mar 26, 2025 · 7 comments · Fixed by #6894

Comments

@MichaelChirico
Copy link
Member

For #6886, I was trying to find where we document this recommendation and came up short. I checked in several places:

  • FAQ vignette
  • Intro vignette
  • Reference semantics vignette
  • ?setDT
  • ?truelength

This last one is the only place where one could reasonably infer the recommendation:

\code{truelength(x)} returns the length of the vector allocated in memory. \code{length(x)} of those items are in use. Currently, it is just the list vector of column pointers that is over-allocated (i.e. \code{truelength(DT)}), not the column vectors themselves, which would in future allow fast row \code{insert()}. For tables loaded from disk however, \code{truelength} is 0 in \R 2.14.0+ (and random in \R <= 2.13.2), which is perhaps unexpected. \code{data.table} detects this state and over-allocates the loaded \code{data.table} when the next column addition occurs. All other operations on \code{data.table} (such as fast grouping and joins) do not need \code{truelength}.

But even there it basically requires foreknowledge of the issue to have distilled it correctly.

I am sure it's documented somewhere, but anyway, it should also be linked in at least one vignette, one Rd page.

@Mukulyadav2004
Copy link
Contributor

Hi @MichaelChirico

"I’d like to propose improving documentation for #6886 by:
Adding setDT() recommendations in ?setDT, ?truelength, and the FAQ vignette to ensure proper initialization after load() and explaining encoding mismatches in factor handling in the FAQ vignettes.

Any suggestions from your side or may I proceed with drafting these changes?

@MichaelChirico
Copy link
Member Author

I'm fairly certain this is already documented somewhere, so please thoroughly review the Rd /vignette documentation and top StackOverflow answers in the data.table tag to see if I missed something. There is already a pretty extensive warning produced by assign.c, IIRC.

@Mukulyadav2004
Copy link
Contributor

Mukulyadav2004 commented Mar 30, 2025

After checking through official documentation ,I still unable to find anything that recommend to setDT() an object from disk.

But below this Stack Overflow answer aligns with the truelength behavior described in the documentation.
https://stackoverflow.com/questions/28078640/adding-new-columns-to-a-data-table-by-reference-within-a-function-not-always-wor

It provides a consequence of not handling truelength == 0 and suggests to use alloc.col() or setDT() before performing by-reference modifications.

@venom1204
Copy link
Contributor

venom1204 commented Mar 30, 2025

`*.RDS` and `*.RData` are file types which can store in-memory R objects on disk efficiently. However, storing data.table into the binary file loses its column over-allocation. This isn't a big deal -- your data.table will be copied in memory on the next _by reference_ operation and throw a warning. Therefore it is recommended to call `setalloccol()` on each data.table loaded with `readRDS()` or `load()` calls.

Hi @MichaelChirico ,

I think it is a bit related section in datatable-faq.rmd that discusses how *.RDS and *.RData files lose column over-allocation when storing data.tables. It recommends using setalloccol() after loading the data, but it does not mention setDT().

Since the issue is about explicitly recommending setDT() for objects loaded from disk, should we:

  • Update the FAQ to mention setDT() alongside setalloccol(), or
  • Add this clarification in ?setDT, ?truelength, or another vignette?

@ecoRoland2
Copy link

@MichaelChirico I don't think you can document it anywhere where users will reliably find it when they need to know that they need setDT. If you are not familiar with the over-allocation machinery, that is pretty counter-intuitive. Maybe data.table could just mask (at least) load and readRDS and automatically setDT?

@MichaelChirico
Copy link
Member Author

I think the FAQ is where I'd go. Generally one of these will hit you:

if (verbose) Rprintf(_("The data.table internal attributes of this table are invalid. This is expected and normal for a data.table loaded from disk. Please remember to always setDT() immediately after loading to prevent unexpected behavior. If this table was not loaded from disk or you've already run setDT(), please report to the data.table issue tracker.\n"));

if (oldtncol==0) error(_("This data.table has either been loaded from disk (e.g. using readRDS()/load()) or constructed manually (e.g. using structure()). Please run setDT() or setalloccol() on it first (to pre-allocate space for new columns) before assigning by reference to it.")); // #2996

Masking seems workable, but I'm not personally a fan of masking {base} functions, not sure how the other maintainers feel.

I think it is a bit related section in datatable-faq.rmd

Nice find! Yes, let's extend that section to be sure the part about setDT() is clear, and then link to that FAQ from ?setDT and ?truelength.

@venom1204
Copy link
Contributor

venom1204 commented Apr 1, 2025

Thanks for the clarification, @MichaelChirico

Updating the FAQ to explicitly mention setDT() alongside setalloccol() will help ensure users are aware of this recommendation when loading data.table objects from disk. I will proceed with updating that section to clearly state this and ensure that both ?setDT and ?truelength link to the FAQ for better visibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants