Search: Indexing Text for Both Effective Search and Accurate Analysis #433

amotl · 2025-10-21T22:05:32Z

About

The excellent article Indexing Text for Both Effective Search and Accurate Analysis by David Norton (Home, LinkedIn, Substack) should not be left behind.

Preview

https://cratedb-guide--433.org.readthedocs.build/feature/search/fts/effective-search.html

/cc @surister

coderabbitai · 2025-10-21T22:05:46Z

Walkthrough

Adds a new "effective-search" FTS guide and updates the FTS index page layout/navigation and cards; also makes small edits to the explain docs (rubric/tag additions and guidance on reporting flaws). All changes are documentation only.

Changes

Cohort / File(s)	Summary
New FTS guide `docs/feature/search/fts/effective-search.md`	Adds a new documentation guide covering indexing text for effective search and accurate analysis: CrateDB analyzers (default/similar/exact), tokenizers, token/character filters, character folding, lemmatization, spelling-correction filters (Lucene SpellChecker), processing pipeline examples, tokenizer/filter behavior, and high-level explanations. No code changes.
FTS index & navigation `docs/feature/search/fts/index.md`	Restructures the FTS index page: renames rubric sections (Guides → Tutorials, Articles → Explanations), replaces grid/info-card entries with card-style links to the new guide, updates toctree entries and tag groupings, and adjusts wording around product and analyzer descriptions.
Explain docs tweaks `docs/explain/index.md`	Adds a rubric block labeled `2018`, adds a reference tag to `effective-fulltext-search`, and expands guidance on reporting flaws (instructions referencing the tool flyout and "Suggest improvement").

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pay attention to cross-links and toctree entries in docs/feature/search/fts/index.md and the new guide to ensure nav consistency.
Verify the new guide's terminology and examples align with existing FTS docs.

Possibly related PRs

Diátaxis: Index how-to guides and tutorials #364 — Overlapping modifications to docs/feature/search/fts/index.md (index/toctree and card/link content).

Suggested labels

guidance

Suggested reviewers

surister
kneth
bmunkholm

Poem

🐰 I hopped through pages, tidy and keen,
I planted a guide where search is seen,
Cards reshuffled, tags all in tune,
A nibble of clarity under the moon. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "Search: Indexing Text for Both Effective Search and Accurate Analysis" directly aligns with the main changes in the changeset. The PR adds a new documentation file (`effective-search.md`) that details indexing text for effective search and accurate analysis using CrateDB, and updates related navigation and index files to incorporate this new content. The title is concise, clear, and fully captures the primary objective without being vague or misleading.
Description Check	✅ Passed	The pull request description is directly related to the changeset and provides meaningful context about the additions. It correctly identifies that the PR adds content referencing the article "Indexing Text for Both Effective Search and Accurate Analysis" by David Norton, includes proper attribution with links to the author's profiles and an archived copy of the article, and provides a preview URL for validation. The description is not vague or off-topic; it clearly communicates the purpose and scope of the changes.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch explain-effective-search

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be80745 and 0370884.

📒 Files selected for processing (3)

docs/explain/index.md (1 hunks)
docs/feature/search/fts/effective-search.md (1 hunks)
docs/feature/search/fts/index.md (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

docs/explain/index.md

🧰 Additional context used

🪛 LanguageTool

docs/feature/search/fts/effective-search.md

[grammar] ~83-~83: Ensure spelling is correct
Context: ...Indexing If a client was to search for "wlking to work", they would probably hope to g...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~84-~84: Ensure spelling is correct
Context: ...back like: "I walked to work", "I enjoy walkng to work", and "I walk to work every day...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[style] ~88-~88: Often, this adverbial phrase is redundant. Consider using an alternative.
Context: ...ts without other negative consequences. First of all, “walking” is spelled wrong. Second, di...

(FIRST_OF_ALL)

[style] ~159-~159: To elevate your writing, try using an alternative expression here.
Context: ...nd that the actual content of the index does not matter as long as the search results are accur...

(MATTERS_RELEVANT)

[style] ~233-~233: Consider using a different adverb to strengthen your wording.
Context: ...ords (less than 4 characters) which are completely ignored by Lucene. Our spell correctio...

(COMPLETELY_ENTIRELY)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build docs

🔇 Additional comments (5)

docs/feature/search/fts/effective-search.md (2)

1-14: Article metadata and structure look good.

The article is properly attributed to David Norton from Qualtrics, includes publication date and read time, and presents a well-organized structure with clear sections. The archive link (line 266) ensures long-term accessibility of the original source.

83-91: Intentional misspellings in examples are correctly used.

Lines 83-84 contain "wlking" and "walkng" as examples of misspelled search terms—these are intentional and serve to illustrate the problem the article addresses. No correction needed.

docs/feature/search/fts/index.md (3)

277-277: Rubric naming improvements enhance documentation clarity.

The changes from "Guides" → "Tutorials" (line 277) and "Articles" → "Explanations" (line 301) follow standard documentation taxonomy and make the content organization more explicit and intuitive.

Also applies to: 301-301

341-360: New card section is well-structured and properly integrated.

The card block follows proper RST syntax with metadata, description, footer content (after +++), and tags. The link reference to effective-fulltext-search correctly points to the label defined in effective-search.md (line 1). Tags appropriately categorize the content as Introduction-level, covering Analyzer, Tokenizer, and Plugin topics.

Please verify that the cross-reference label effective-fulltext-search is correctly resolved by running a documentation build or link checker to ensure the hyperlink functions as intended.

370-370: Navigation entry properly added.

The new effective-search entry in the toctree (line 370) ensures the new page is included in the FTS guide navigation structure and will be discoverable in the table of contents.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docs/feature/search/fts/effective-search.md (1)

88-88: Optional: Minor style refinements.

Static analysis suggests a few stylistic improvements (lines 88, 159, 233), but these are preferences rather than issues. The current phrasing is natural and idiomatic. If you wish to polish: consider alternatives to "first of all" for variety, and review whether "completely" and "as long as" could be replaced with more concise alternatives. These are entirely optional in a chill review.

Also applies to: 159-159, 233-233

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ef86bd and be80745.

📒 Files selected for processing (3)

docs/explain/index.md (1 hunks)
docs/feature/search/fts/effective-search.md (1 hunks)
docs/feature/search/fts/index.md (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

docs/explain/index.md

🧰 Additional context used

🪛 LanguageTool

docs/feature/search/fts/effective-search.md

[grammar] ~83-~83: Ensure spelling is correct
Context: ...Indexing If a client was to search for "wlking to work", they would probably hope to g...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~84-~84: Ensure spelling is correct
Context: ...back like: "I walked to work", "I enjoy walkng to work", and "I walk to work every day...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[style] ~88-~88: Often, this adverbial phrase is redundant. Consider using an alternative.
Context: ...ts without other negative consequences. First of all, “walking” is spelled wrong. Second, di...

(FIRST_OF_ALL)

[style] ~159-~159: To elevate your writing, try using an alternative expression here.
Context: ...nd that the actual content of the index does not matter as long as the search results are accur...

(MATTERS_RELEVANT)

[style] ~233-~233: Consider using a different adverb to strengthen your wording.
Context: ...ords (less than 4 characters) which are completely ignored by Lucene. Our spell correctio...

(COMPLETELY_ENTIRELY)

🔇 Additional comments (4)

docs/feature/search/fts/effective-search.md (2)

1-14: Excellent article header, metadata, and archival reference.

The article-info frontmatter is properly structured and the archive link to the original Qualtrics engineering blog article is correctly formatted with appropriate versioning.

Also applies to: 262-267

33-113: Strong technical depth and clear pedagogical structure.

The content progresses logically from business rationale (Why CrateDB?) through analyzer fundamentals to implementation techniques (character folding, lemmatization, spelling correction). The lemmatization comparison table and spell correction pseudocode effectively communicate complex concepts with concrete examples (e.g., Unicode apostrophes, German character folding rules, Morphy vs. stemmer accuracy).

Also applies to: 150-248

docs/feature/search/fts/index.md (2)

277-277: Semantic section renaming improves taxonomy consistency.

The updates from "Guides" → "Tutorials" and "Articles" → "Explanations" align with the broader documentation structure (as referenced in the PR context for docs/explain/index.md). This creates clearer semantic distinction: Tutorials are procedural/hands-on, Explanations are conceptual/deep-dive.

Also applies to: 301-301

341-360: New card entry is well-integrated with correct cross-references.

The card title, description, and link target correctly reference the new effective-search.md article. The reference label "effective-fulltext-search" at line 342 matches the file header label (verified at effective-search.md:1), and the toctree entry at line 370 correctly resolves to docs/feature/search/fts/effective-search.md. Tag assignments (Introduction, Analyzer, Tokenizer, Plugin) accurately reflect article content.

matriv

thx, not much here, since it's from an external author and must be taken as is.

Add article "Indexing Text for Both Effective Search and Accurate Analysis" by David Norton to "Explanation" section. Original source: https://web.archive.org/web/20250210021928/https://www.qualtrics.com/eng/indexing-text-for-both-effective-search-and-accurate-analysis/

bmunkholm · 2025-10-27T12:24:11Z

@moll Is there anything technical in the content that isn't already mentioned in the docs?

amotl added the pitch A feature or request about anything, content and layout. label Oct 21, 2025

amotl changed the title ~~Explain effective search~~ Explain: Indexing Text for Both Effective Search and Accurate Analysis Oct 21, 2025

amotl changed the base branch from main to explain October 21, 2025 22:07

This comment was marked as resolved.

Sign in to view

amotl force-pushed the explain-effective-search branch from cdf7c17 to 4e8a1e1 Compare October 21, 2025 22:11

amotl changed the title ~~Explain: Indexing Text for Both Effective Search and Accurate Analysis~~ Search: Indexing Text for Both Effective Search and Accurate Analysis Oct 21, 2025

amotl force-pushed the explain-effective-search branch 2 times, most recently from 204f4fb to 395b467 Compare October 21, 2025 23:02

This comment was marked as resolved.

Sign in to view

amotl force-pushed the explain-effective-search branch from 395b467 to 7ef86bd Compare October 21, 2025 23:35

amotl added cross linking Linking to different locations of the documentation. guidance Matters of layout, shape, and structure. labels Oct 21, 2025

This comment was marked as resolved.

Sign in to view

amotl added reorganize Moving content around, inside and between other systems. and removed pitch A feature or request about anything, content and layout. cross linking Linking to different locations of the documentation. guidance Matters of layout, shape, and structure. labels Oct 22, 2025

amotl requested review from matriv and seut October 22, 2025 01:36

amotl force-pushed the explain-effective-search branch from 7ef86bd to be80745 Compare October 22, 2025 01:39

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

matriv approved these changes Oct 23, 2025

View reviewed changes

seut approved these changes Oct 23, 2025

View reviewed changes

amotl force-pushed the explain branch from 34f3b88 to 882e9db Compare October 24, 2025 02:49

Base automatically changed from explain to main October 24, 2025 18:50

amotl added 3 commits October 27, 2025 10:55

Effective Search: Hyphenation, Linewrapping, Punctuation, Typos, Wording

bf22ef9

Effective Search: Improve cross linking

0370884

amotl force-pushed the explain-effective-search branch from be80745 to 0370884 Compare October 27, 2025 09:55

amotl merged commit 4f5f615 into main Oct 27, 2025
3 checks passed

amotl deleted the explain-effective-search branch October 27, 2025 10:54

coderabbitai bot mentioned this pull request Oct 28, 2025

Storage internals: Indexing and storage #434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Search: Indexing Text for Both Effective Search and Accurate Analysis #433

Search: Indexing Text for Both Effective Search and Accurate Analysis #433

Uh oh!

amotl commented Oct 21, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot left a comment

Uh oh!

matriv left a comment

Uh oh!

Uh oh!

bmunkholm commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Search: Indexing Text for Both Effective Search and Accurate Analysis #433

Search: Indexing Text for Both Effective Search and Accurate Analysis #433

Uh oh!

Conversation

amotl commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

About

Preview

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

matriv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bmunkholm commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

amotl commented Oct 21, 2025 •

edited

Loading

coderabbitai bot commented Oct 21, 2025 •

edited

Loading