❓ Per-field tokenization (question for the community) ❓ #2006

sanikolaev · 2024-03-26T10:11:54Z

Since the beginning, Sphinx and Manticore have not offered per-field tokenization settings (except for morphology_skip_fields and infix/prefix_fields), and it seems that there hasn't been much concern about this. On the other hand, if Manticore were to introduce this functionality, it would simplify certain use cases that require different tokenization, such as:

Storing titles/descriptions along with SKU numbers (e.g., ABC-12345-S-BL).
Managing titles/descriptions and email/IP addresses in the same table.

It would be interesting to know if the community considers it important to implement per-field tokenization settings in Manticore, similar to how it works in Elasticsearch and SOLR, allowing for the specification of tokenization settings for each field.

Furthermore, I'm curious how those who have been using Manticore for years have addressed this issue. I'm going to ask personally some Manticore users to provide feedback.

The text was updated successfully, but these errors were encountered:

nickchomey · 2024-03-27T23:06:11Z

Could you please elaborate on what sorts of tokenization settings might be available on a per-field basis and some of the use cases/advantages for it?

sanikolaev · 2024-03-28T07:51:01Z

I think all the available tokenization settings would become per-field in this case:

charset_table
morphology
blend_chars
ignore_chars
stopwords
exceptions
wordforms

etc.

some of the use cases/advantages for it?

Inviting @superkelvint as I know he knows a lot about it.

unterninja · 2024-03-28T09:50:03Z

Just to make sure: the mentioned performance reduction would only apply to tables where this feature is used, not on all tables regardless of tokenization model?

sanikolaev · 2024-03-28T10:27:50Z

the mentioned performance reduction would only apply to tables where this feature is used

The performance reduction mentioned would likely apply only to tables that utilize this feature. We would do our best to maintain the current level of performance in other aspects.

superkelvint · 2024-03-28T16:05:35Z

Common fields which require non-fulltext treatment include:

Numeric Codes and Identifiers

ISBNs: Unique identifiers for books that should be searchable in their entirety.
SSNs (Social Security Numbers): For applications that require identity verification, SSNs need exact match searching without tokenization.
Vehicle Identification Numbers (VINs): Each VIN is unique to a specific vehicle and must be searched precisely.

IDs and Part numbers

Model Numbers: "Model XR-2000" should remain unaltered for exact model searches.
SKUs: e.g. "ELEC-12345-BLU", "SHOE-98765-M-8"
ASIN (Amazon Standard Identification Numbers): Unique blocks of letters and/or numbers for identifying items on Amazon. e.g. B0825K99RP
Parts Numbers: "6E5-45371-01"
Electronic Component Identifiers: Unique codes used for electronic components in manufacturing and assembly, like resistors, capacitors, and integrated circuits, e.g. "ATMEGA328P-PU"

Internet

IP addresses
URLs
email addresses
Twitter hashtags and @ mentions: "#ThrowbackThursday" needs to be indexed as a single token for hashtag-based searches "@username" should be searchable as a distinct token to find mentions of specific users.
File system paths: c:\Users\MyDocuments or /home/user/documents

Legal

Legal Terms: "Ex post facto" should not be stemmed to preserve its specific legal context.
Case Names: "Roe v. Wade, 410 U.S. 113" must be tokenized as a whole entity for precise legal reference searching.

superkelvint · 2024-03-28T16:12:33Z

Perhaps also important to mention that for users planning to migrate from Lucene/Solr/Elasticsearch (like myself), not being able to specify analyzers per-field makes migrating extremely difficult because we are used to having this flexibility in Lucene-based systems and have therefore used this feature extensively.

Granted, Manticore does provide some support for this in the form of numeric, boolean, date field types. But that is very basic compared to Lucene, and applications would very likely have to lose functionality when migrating to Manticore which is a difficult pill to swallow.

ChrisHSandN · 2024-05-22T16:25:19Z

I came here to open a feature request for this specific feature (but spotted this post).

Our use case for manticore means we want only a subset of our fields expanded with infixes.
We have always used dict=crc (since the early days of Sphinx) but reading the Manticore docs recently made dict=keyword sound appealing (extra wildcard characters, smaller indexes etc.)
It was therefore disappointing to find enabling this option disabled the ability to specify infix_fields option.

sanikolaev · 2024-05-24T03:49:17Z

@ChrisHSandN

It was therefore disappointing to find enabling this option disabled the ability to specify infix_fields option

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc, but to make queries to some fields not run in infix mode (with probably expand_keywords=1)? If so, it shouldn't be a big deal (at least seems so to me, I'd need to check with the devs) to add support for it for the dict=keywords mode.

ChrisHSandN · 2024-05-28T10:44:42Z

@sanikolaev

Do you mean you used infix_fields, not just as a resource/performance optimization with dict=crc

We have a large amount of data indexed and only have the resources (and requirement) to infix certain (short) selected fields.

I tested swapping one of the indexes from dict=crc to dict=keyword and total .sp* file space increased 40% from 3.2GB to 4.5GB (.spa + .spi went from 0.26GB to 0.46GB; as we are memory limited these are the main limitation).

I was presuming this was due to dict=keywords infixing all the fields?

sanikolaev · 2024-05-30T04:41:49Z

@ChrisHSandN

we want only a subset of our fields expanded with infixes.
We have always used dict=crc

Please make sure it actually worked for you. Here's an example showing infix_fields doesn't take effect with dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='crc' infix_fields='f'
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.01 sec)

--------------
select * from t where match('@f abc*')
--------------

Empty set (0.00 sec)
--- 0 out of 0 results in 0ms ---

Same with dict=keywords and min_infix_len works fine:

mysql> drop table if exists t; create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'; insert into t(id, f) values(1, 'abcdef'); select * from t where match('@f abc*');
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
create table t(f text, f2 text) dict='keywords' min_infix_len='2' infix_fields='f'
--------------

Query OK, 0 rows affected, 1 warning (0.01 sec)

--------------
insert into t(id, f) values(1, 'abcdef')
--------------

Query OK, 1 row affected (0.00 sec)

--------------
select * from t where match('@f abc*')
--------------

+------+--------+------+
| id   | f      | f2   |
+------+--------+------+
|    1 | abcdef |      |
+------+--------+------+
1 row in set (0.00 sec)
--- 1 out of 1 results in 1ms ---

The point is that you can't enable min_infix_len for dict=crc:

mysql> drop table if exists t; create table t(f text, f2 text) dict='crc' min_infix_len='2';
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(f text, f2 text) dict='crc' min_infix_len='2'
--------------

ERROR 1064 (42000): error adding table 't': RT tables support prefixes and infixes with only dict=keywords

So could it be that you thought that infix_fields worked for you, but it actually didn't, an infix search wasn't effective at all and you didn't notice it?

sanikolaev pinned this issue Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

❓ Per-field tokenization (question for the community) ❓ #2006

❓ Per-field tokenization (question for the community) ❓ #2006

sanikolaev commented Mar 26, 2024

nickchomey commented Mar 27, 2024

sanikolaev commented Mar 28, 2024

unterninja commented Mar 28, 2024

sanikolaev commented Mar 28, 2024

superkelvint commented Mar 28, 2024 •

edited

Loading

superkelvint commented Mar 28, 2024

ChrisHSandN commented May 22, 2024

sanikolaev commented May 24, 2024

ChrisHSandN commented May 28, 2024

sanikolaev commented May 30, 2024

❓ Per-field tokenization (question for the community) ❓ #2006

❓ Per-field tokenization (question for the community) ❓ #2006

Comments

sanikolaev commented Mar 26, 2024

nickchomey commented Mar 27, 2024

sanikolaev commented Mar 28, 2024

unterninja commented Mar 28, 2024

sanikolaev commented Mar 28, 2024

superkelvint commented Mar 28, 2024 • edited Loading

Numeric Codes and Identifiers

IDs and Part numbers

Internet

Legal

superkelvint commented Mar 28, 2024

ChrisHSandN commented May 22, 2024

sanikolaev commented May 24, 2024

ChrisHSandN commented May 28, 2024

sanikolaev commented May 30, 2024

superkelvint commented Mar 28, 2024 •

edited

Loading