Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_num_of_rows not taking effect in html backend #217

Closed
juliohm opened this issue Sep 5, 2023 · 15 comments
Closed

max_num_of_rows not taking effect in html backend #217

juliohm opened this issue Sep 5, 2023 · 15 comments

Comments

@juliohm
Copy link

juliohm commented Sep 5, 2023

Hi @ronisbr , hope you are doing well,

We are facing an issue with the html backend because it is materializing entire columns even when we pass the option max_num_of_rows. Is there a quickfix or workaround to avoid building the entire columns in this backend?

The text backend is working just fine and we can create very large tables and display just a few rows quickly. The problem is in the creation of the objects, not the display itself.

@ronisbr
Copy link
Owner

ronisbr commented Sep 5, 2023

Hi @juliohm !

Everything if fine here, how about you?

Can you provide a MWE? Because this option is working as expected here and also in DataFrames.jl. Take a look at this example:

julia> a = rand(10_000, 10);

julia> @time pretty_table(IOBuffer(), a; tf = tf_html_default, max_num_of_rows = 10)
  0.078409 seconds (350.23 k allocations: 24.639 MiB, 99.56% compilation time)

julia> @time pretty_table(IOBuffer(), a; tf = tf_html_default, max_num_of_rows = -1)
  0.178180 seconds (4.59 M allocations: 228.235 MiB)

PrettyTables.jl uses the same internal handling (a type I call ProcessedTable) in all backends to limit the number of data that will be rendered.

Can it be something specific to your type that somehow triggers the entire rendering when a cell is called with MIME("text/html") ?

@ronisbr
Copy link
Owner

ronisbr commented Sep 5, 2023

The last time I saw something like this was caused due to the Tables.jl implementation of the type, that was leading to entire column rendering. Take a look here please:

#162

@juliohm
Copy link
Author

juliohm commented Sep 6, 2023

I am doing fine, thanks! :)

Here is a MWE that you can try on Jupyter (I tried it on VSCode directly because it supports notebooks):

using GeoTables

georef((a=rand(1000,1000),))

It takes 43s for me.

@juliohm
Copy link
Author

juliohm commented Sep 6, 2023

@juliohm
Copy link
Author

juliohm commented Sep 6, 2023

I think it is our fault! Splatting on the _common_kwargs is causing the problem 🤦🏽‍♂️

@juliohm
Copy link
Author

juliohm commented Sep 6, 2023

But I don't know why. The _common_kwargs function returns very small objects, it shouldn't be a bottleneck.

@juliohm
Copy link
Author

juliohm commented Sep 6, 2023

@ronisbr can you please confirm the semantics of the option max_num_of_rows? We understood that this option can be used to omit rows in very tall tables to save screen space and avoid materialization of all related objects.

Apparently, if we set the maximum number to 10 and the table has 1000000 rows, no information is shown to indicate the omitted rows:

image

@ronisbr
Copy link
Owner

ronisbr commented Sep 6, 2023

Hi @juliohm !

Yes, this is the expected behavior. There is some room for improvement when using max_num_of_* in the text backend. The biggest issue is that this backend has two ways of limiting the number of data to be printed: max_num_of_* and the display size. The internal algorithm should change significantly to solve those inconsistencies.

The advice is to use max_num_of_* only in the LaTeX and HTML, and let the text backend limit the data using the display size.

I will check why it is taking too much time in that function.

@ronisbr
Copy link
Owner

ronisbr commented Sep 6, 2023

Hi @juliohm !

I understood the problem!

When you use vcrop_mode = :middle, we render the initial and the last rows. To obtain the last rows using Tables.jl API, we need to iterate the rows in each column:

it, state = iterate(rtable.table, state)

This process is taking too long for GeoTables. If you change vcrop_mode = :bottom, everything is very fast.

The vast majority of the time, the algorithm is in this function here:

https://github.com/JuliaEarth/GeoTables.jl/blob/7bc56e5135e48900861d99f4b1a5599bed1eeeb6/src/abstractgeotable.jl#L129

@juliohm
Copy link
Author

juliohm commented Sep 6, 2023

Thank you @ronisbr ! Implementing the suggestions now. Will report back if the issue is still present.

Feel free to change the title of the issue to a more appropriate name :)

@ronisbr
Copy link
Owner

ronisbr commented Sep 6, 2023

Thanks for the report! I think I can greatly improve everything if I keep a cache of the state. I will try to do some modifications here and hope it will not introduce a lot of type instabilities.

My ideia is:

  1. Every time we access a row in Tables.jl (row access), I will store the states.
  2. If we have a state close to the row we want, we just use the cache instead of iterating everything.

This algorithm will greatly improve things for this case. However, my concern is that it will allocate a lot more for the other scenarios in which the iteration is much faster.

@ronisbr
Copy link
Owner

ronisbr commented Sep 6, 2023

Hi @juliohm!

I could reduce the time from several seconds to 0.001s considering that GeoTables state in the iteration is the row number. I think we should open a issue in Tables.jl to add a information that a RowTable state is the row number, which will drastically reduce the iterations, leading to a substancial gain.

@ronisbr
Copy link
Owner

ronisbr commented Sep 6, 2023

By the way, if Tables.jl did not add this hint, I will add a keyword probably called: row_tables_jl_use_row_id_as_state, which will solve the problems for GeoTables!

@juliohm
Copy link
Author

juliohm commented Sep 6, 2023 via email

ronisbr added a commit that referenced this issue Nov 3, 2023
@ronisbr ronisbr closed this as completed Nov 3, 2023
@ronisbr
Copy link
Owner

ronisbr commented Nov 3, 2023

I think this issue is solved with the new code that uses Table.subset, leading to a huge gain when printing with middle cropping.

@ronisbr ronisbr reopened this Nov 3, 2023
@ronisbr ronisbr closed this as completed in bed73b6 Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants