Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML search: serialize search index as JSON #13097

Conversation

jayaddison
Copy link
Contributor

@jayaddison jayaddison commented Nov 2, 2024

Feature or Bugfix

  • Bugfix

Purpose

  • Prevent queries for the JavaScript class prototype reference propery name from causing errors in the HTML search JavaScript code provided in the basic theme.

Detail

  • Ensure that the user input query term is found as a named property of the relevant index objects before accessing them, using Object.hasOwnProperty.
  • Represent the HTML search indices as JavaScript Map instances instead of Object literals.
  • Serialize the search index content as JSON, so that terms named __proto__ can be serialized and loaded.

Relates

Edit: update description to reflect JavaScript Map-based approach.
Edit: update description to reflect JSON-serialization approach.

@jayaddison jayaddison added html search javascript Pull requests that update Javascript code labels Nov 2, 2024
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this would be a problem but I'm a little unclear on how the solution actually works? See comments

@@ -514,8 +514,8 @@ const Search = {
searchTerms.forEach((word) => {
const files = [];
const arr = [
{ files: terms[word], score: Scorer.term },
{ files: titleTerms[word], score: Scorer.title },
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually a little confused by how this might work, as it doesn't appear that JavaScript supports setting the __proto__ property in an object. Am I missing something? Should we be using a map to store these terms instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When parsing JSON, an object key named "__proto__" is accepted, and converted to the JavaScript object key ["__proto__"] -- distinct from the __proto__ property (a reference to a class prototype).

During read-access, introducing hasOwnProperty allows us to filter out the latter, so that we don't mistakenly retrieve the prototype class reference from terms.

I'm not certain whether the below example will help to clarify, but I find experimentation helpful to understand some of the behaviours:

> x = JSON.parse('{}')
{}
> y = JSON.parse('{"__proto__": "testing"}')
{ ['__proto__']: 'testing' }
> '__proto__' in x
true
> x.hasOwnProperty('__proto__')
false
> '__proto__' in y
true
> y.hasOwnProperty('__proto__')
true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to respond to your Map question: that is certainly possible, although I feel that doing so might add some runtime overhead (since we'd need to load the native JS object contents from the index into respective Map objects).

As I understand it, Map objects are also more difficult to make immutable in JavaScript than Object and Array instances. Freezing a Map object doesn't prevent its private contents from being updated:

> x = {'foo': 'bar'}
{ foo: 'bar' }
> Object.freeze(x)
{ foo: 'bar' }
> x['foo'] = 'baz'
'baz'
> x['foo']
'bar'
> y = new Map([['foo', 'bar']])
Map(1) { 'foo' => 'bar' }
> Object.freeze(y)
Map(1) { 'foo' => 'bar' }
> y.set('foo', 'baz')
Map(1) { 'foo' => 'baz' }
> y.get('foo')
'baz'

So another factor is that, to the extent that I'd like to see the search index contents become immutable at load-time in future (#13098), I'd prefer not to migrate to the Map type unless there is a compelling reason to (maybe avoiding prototype pollution is that reason -- but I don't think we're exposed to that currently).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one extra piece of optional feedback: since this is kind of counter-intuitive, you might want to add a comment on this line explaining what you're doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, OK - I'll try to add some phrasing to explain this succinctly. Thanks for the code review!

tests/js/roots/ecmascript/index.rst Outdated Show resolved Hide resolved
The JS prototype property should _not_ be found in unrelated documents
@@ -514,8 +514,8 @@ const Search = {
searchTerms.forEach((word) => {
const files = [];
const arr = [
{ files: terms[word], score: Scorer.term },
{ files: titleTerms[word], score: Scorer.title },
{ files: terms.hasOwnProperty(word) ? terms[word] : [], score: Scorer.term },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok got it, this makes sense to me re: the property. TIL!

I'm a little bit less sure about whether the index should be immutable, but happy to defer that discussion to another day.

@AA-Turner
Copy link
Member

Would this work as an alternative?

    // in Search.setIndex()
    const frozen_index = JSON.parse(index, (key, value) => {
      typeof value === "object" && value !== null
        ? Object.freeze(value)
        : value;
    });

If we change Search.setIndex() to taking a JSON string, it seems we could use the 'reviver' function to freeze arrays and objects?

A

@jayaddison
Copy link
Contributor Author

@AA-Turner although I'd recommend some performance testing, that could be useul for #13098, yep.

@jayaddison
Copy link
Contributor Author

In fact, no, my mistake: JSON.parse may in fact be necessary to solve this robustly. The test appears only to be passing due to the fact that __proto__ is found in the objects part of the searchindex.js file. The terms entry for __proto__ is not being handled correctly.

@jayaddison
Copy link
Contributor Author

I've pushed the change to use JSON.parse and reviver -- a note of caution, though: when I was testing that against a JSON-ized version of the Python documentation searchindex.js file yesterday, I was encountering ~200ms duration within the setIndex function in my browser.

@wlach
Copy link
Contributor

wlach commented Nov 14, 2024

Personally, I'd be inclined to merge the previous version of this and address that (if at all) in a followup. Trying to "freeze" the index here feels like scope creep to me and it would be nice to get this fix in.

@jayaddison
Copy link
Contributor Author

That's reasonable, yep, and I've no strong opinions on whether we include or omit the Object.freeze aspect. However, without 462c859, I don't believe this branch would genuinely fix the reported bug.

The `JSON.dumps` call escapes single-quotes within strings; all JSON strings are enclosed by double-quotes
@jayaddison
Copy link
Contributor Author

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

Comment on lines 71 to 76
indices = ast.literal_eval(
searchindex[16:-1]
.replace('new Map', '')
.replace('null', 'None')
.replace('false', 'False')
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fragile and inelegant; it's a compromise to avoid introducing the complexity of a Python-implemented JavaScript parser dependency given the use of new Map(...) within searchindex.js files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to dump the index as JSON, then reload it into the actual format we want. Would almost certainly be a little slower, but probably not disastrously so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain that in a little more detail? I think I understand what you mean, but I'd like to make sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I mean actually just writing out a JSON file and having the frontend code load it seperately via a fetch call. I'm not actually sure it would be slower, now that I think about it-- this post suggests this will be faster on the interpreter side: https://v8.dev/blog/cost-of-javascript-2019#json

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks! (I wasn't entirely sure whether you were suggesting JSON format output only for the unit test code - using it in both seems more robust).

Let's give searchindex.json a try, in that case. It'll make a few things simpler, although the searchtools.js code may lose some of the readability improvements (we'll want to begin/reintroduce using obj.hasOwnProperty again in a few places).

We may also want to be careful about where we place the fetch call so that the browser doesn't block for unnecessarily long before invoking that (perhaps especially relevant for the case of downstream themes that include their own scripting that may also run on each page load).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's give searchindex.json a try, in that case. It'll make a few things simpler, although the searchtools.js code may lose some of the readability improvements (we'll want to begin/reintroduce using obj.hasOwnProperty again in a few places).

Personally I would read the search index as JSON, then convert it into a Map (i.e. building on the work you already did).

I'd prefer to avoid multiple read passes over the search index data on the client-side, if possible. Similarly that makes me reluctant to use the reviver function in JSON.parse, and I considered it a downside of the freeze-and-deproto code I had implemented in #13098.

Aside from code clarity, do you anticipate many benefits from the transformation to Map datastructures?

Copy link
Contributor

@wlach wlach Nov 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from code clarity, do you anticipate many benefits from the transformation to Map datastructures?

Not really, my guess is that getting entries out of a map will be a little faster (since it's a datastructure purely oriented around storing and retrieving keys/values) but probably not noticeably so. I think it's at least worth exploring though. The conversion should just be a few extra lines of code after you've loaded the JSON structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mm, I agree it'd only be a few lines of code, but I don't think that it'd be worth the resulting compute resource expenditure to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I feel inclined to disentangle the two aspects of this pull request: adding support for __proto__ as an indexed/queryable term, and preventing __proto__ queries from accessing JavaScript object prototypes.

The latter seems more important (it's a bug) - the former is a feature request that hasn't really been requested by anyone other than myself during development.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted the bugfix part of this changeset into #13153; I don't have any plans currently to add feature support for term/title-term matching of the query term __proto__ itself.

@jayaddison jayaddison changed the title HTML search: check for query terms presence in index term properties HTML search: represent indices using JavaScript Map instead of object literals Nov 16, 2024
@jayaddison
Copy link
Contributor Author

Without the reviver function, I recall the setIndex + JSON.parse performance overhead being in the order of ~60ms on my local machine. I'll confirm those stats in some more detail within the next day or so. I get the feeling that some of this will be coupled to the browser+implementation (Firefox) details, so I may try to compare to Chrom[e|ium] also.

I haven't gotten around to this comprehensive performance testing yet, but want to share some notes:

Initial results

Initially the native Map approach (up to fae81a0) seems similar in runtime performance to the existing/baseline approach (object literals), based on my testing using Sphinx's self-built documentation.

The page-load searchindex-evaluation time on my machine for a query for phin (~700 results) is ~30ms in Firefox, compared to a ~20ms baseline. Those are both relatively small compared to a ~0.5s duration for display/rendering of the search results, but arguably that is a pessimal query (a more-precise match would reduce the display duration).

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

Observations so far

There seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize.

I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the Map approach here (again, up to fae81a0) seems good on most criteria except for Python complexity (particularly the unit tests, but also the application serialization code is more complex than it was) and immutability (that it does not address, in my opinion). The tc39 Records-and-Tuples proposal, I think, would be near-optimal on all factors -- but we could be years away from that reaching widespread adoption.

@wlach wlach self-requested a review November 19, 2024 12:40
Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

My guess is that most of this difference (which is relatively small) would disappear with compression.

There seem to be multiple factors to balance: Python complexity, JavaScript complexity, code safety (both datastructure immutability, and also prototype access), runtime performance and index filesize.

Yep, these things are always about tradeoffs.

I'm not sure how much to explore and how to present my thoughts/findings. Generally my feeling is that the Map approach here (again, up to fae81a0) seems good on most criteria except for Python complexity (particularly the unit tests, but also the application serialization code is more complex than it was) and immutability (that it does not address, in my opinion). The tc39 Records-and-Tuples proposal, I think, would be near-optimal on all factors -- but we could be years away from that reaching widespread adoption.

I think it's a pretty good solution. I'll defer approving in case you want to experiment / document more, but I think this fixes the bug while still being reasonable to maintain.

for name, entries in sorted(data.items())
}
data_js = '{' + ','.join(f'{k}:{v}' for k, v in js_indices.items()) + '}'
return self.PREFIX + data_js + self.SUFFIX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a jinja template for the search index? That might make this look a little better. Seems to be one of Sphinx's dependencies:

"Jinja2>=3.1",

Definitely worth adding a comment or two explaining what's going on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mulling over the suggestion; making the logic and output format of the method clearer would both be helpful, but I'd probably prefer not to include an additional technology in the mix (even though as you mention, it's already available as an application dependency).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use it if it's available and it makes things clearer. It's usage elsewhere in Sphinx is pretty extensive: https://github.com/search?q=repo%3Asphinx-doc%2Fsphinx%20jinja&type=code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although dumping the index as JSON is perhaps a better alternative)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wlach I'm fairly sure Jinja was originally written for use in Sphinx / Flask -- they're all originally Pocco projects!

sphinx/themes/basic/static/searchtools.js Outdated Show resolved Hide resolved
Comment on lines 71 to 76
indices = ast.literal_eval(
searchindex[16:-1]
.replace('new Map', '')
.replace('null', 'None')
.replace('false', 'False')
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative would be to dump the index as JSON, then reload it into the actual format we want. Would almost certainly be a little slower, but probably not disastrously so.

@jayaddison
Copy link
Contributor Author

There is a small amount of index size bloat -- not so much from the new Map( and ) text overhead, but more from the additional Array literals (e.g. square brackets) used to format their entries instead of object literals. This is approximately 5% overhead on the test dataset; 541857 bytes compared to 516691 baseline.

My guess is that most of this difference (which is relatively small) would disappear with compression.

I should have thought to evaluate that at the same time - from a quick check with brotli - indeed, the difference becomes fairly insigificant (102833 vs 102137, less than 1%). Perhaps it's helpful for compression that the characterset becomes slightly more predictable.

@jayaddison jayaddison changed the title HTML search: represent indices using JavaScript Map instead of object literals HTML search: serialize search index as JSON Nov 24, 2024
@wlach wlach self-requested a review November 24, 2024 13:32
@jayaddison jayaddison closed this Nov 24, 2024
@jayaddison jayaddison deleted the issue-13096/html-search-handle-proto-query branch November 24, 2024 17:05
@jayaddison
Copy link
Contributor Author

That's reasonable, yep, and I've no strong opinions on whether we include or omit the Object.freeze aspect. However, without 462c859, I don't believe this branch would genuinely fix the reported bug.

This was an error in my reasoning, and it may have caused a lot of unnecessary confusion (I confused myself with it, too).

The important part of the bug to fix here, in my opinion, is that a specific query -- for the term __proto__ -- is not handled correctly and produces an error and search result entitled undefined.

My change in 462c859 introduced something tangentially related, but different -- it added support for search index representation of, and subsequent querying of, __proto__ -- and doing that unnecessarily complicates the fix, because it introduces the various other tradeoff considerations we've been discussing.

My apologies for the extra work and meandering explorations caused by this.

@wlach
Copy link
Contributor

wlach commented Nov 28, 2024

My apologies for the extra work and meandering explorations caused by this.

No worries! I think this was a useful exercise and I think we might still want some version of your later work on this. Agreed that getting a small fix in for the originally reported bug is a good incremental step worth doing on its own though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE html search javascript Pull requests that update Javascript code type:performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTML search: bug with JS-prototype-property query
3 participants