Skip to content

Avoid //expr XPaths #1358

@MichaelChirico

Description

@MichaelChirico

The following currently linters use //expr:

  • any_duplicated_linter.R
  • any_is_na_linter.R
  • brace_linter.R
  • class_equals_linter.R
  • condition_message_linter.R
  • conjunct_test_linter.R
  • consecutive_stopifnot_linter.R
  • duplicate_argument_linter.R
  • equals_na_linter.R
  • expect_comparison_linter.R
  • expect_identical_linter.R
  • expect_length_linter.R
  • expect_named_linter.R
  • expect_not_linter.R
  • expect_null_linter.R
  • expect_s3_class_linter.R
  • expect_true_false_linter.R
  • expect_type_linter.R
  • fixed_regex_linter.R
  • function_left_parentheses_linter.R
  • ifelse_censor_linter.R
  • inner_combine_linter.R
  • literal_coercion_linter.R
  • missing_package_linter.R
  • nested_ifelse_linter.R
  • object_usage_linter.R
  • outer_negation_linter.R
  • package_hooks_linter.R
  • paste_linter.R
  • pipe_call_linter.R
  • redundant_ifelse_linter.R
  • regex_subset_linter.R
  • seq_linter.R
  • sprintf_linter.R
  • string_boundary_linter.R
  • strings_as_factors_linter.R
  • system_file_linter.R
  • unneeded_concatenation_linter.R
  • unused_import_linter.R
  • yoda_test_linter.R

Original issue raised

As seen in #1353, #1340, #1310, there are subtle performance implications to the way we write our XPaths.

One thing that became clear is that writing //NODE1[NODE2] is slower than //NODE2[parent::NODE1] if NODE2 is far less frequent than NODE1.

And <expr> is by far the most common node; here's a guess at the general frequency by tabulating across r-devel and my local R packages/scripts:

library(data.table)
library(knitr)

get_r_files = function(dir) list.files(dir, pattern = "\\.R$", recursive = TRUE, full.names = TRUE)
get_freq = function(f) {
  pc = tryCatch(parse(f), error = identity, warning = identity)
  if (inherits(pc, "condition")) return(NULL)
  setDT(getParseData(pc))[, .N, by = token]
}

r_files = c(get_r_files("~/github"), get_r_files("~/svn"))
token_freq = rbindlist(lapply(r_files, get_freq))
token_freq = token_freq[, .(N = sum(N)), by = token][order(-N)]
token_freq[, pct := round(100 * N/sum(N), 1)]
knitr::kable(token_freq[pct > 1])
token N pct
expr 5713829 36.4
',' 1514487 9.6
SYMBOL 1357662 8.6
'(' 1119392 7.1
')' 1119392 7.1
NUM_CONST 1086271 6.9
SYMBOL_FUNCTION_CALL 917895 5.8
STR_CONST 375598 2.4
LEFT_ASSIGN 334988 2.1
COMMENT 307772 2.0
EQ_SUB 263926 1.7
SYMBOL_SUB 254903 1.6

i.e., //expr eliminates at most 2/3 of tokens, while other tokens typically eliminate >90% of the tree.

The trade-off here is for readability. XPaths with a lot of parent::/preceding-sibling::/following-sibling:: axes tend to be less readable -- our current XPaths are fairly readable IMO. Moreover, most of our linters are built around expression-level lints, and having a comparatively small tree is the norm in that case -- I guess the overhead of iterating over expressions is usually higher than the savings from fine-tuning XPaths, and that in the presence of cacheing, performance gains will be unnoticeable in all but edge cases.

So we should proceed gently on this issue. Some ideas:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions