bug: Unable to ban numbers #23

jwoertink · 2023-05-30T18:19:44Z

Expected behavior

Using this pattern with numbers

pattern`|666|`

I expect that to be matched

Actual behavior

it's not matched

Minimal reproducible example

import {
  DataSet,
  RegExpMatcher,
  englishRecommendedTransformers,
  pattern,
} from "obscenity";

const customDataset = new DataSet();
customDataset.addPhrase((phrase) => phrase.setMetadata({ originalWord: "666" }).addPattern(pattern`|666|`));
const matcher = new RegExpMatcher({
        ...customDataset.build(),
        ...englishRecommendedTransformers
      });

matcher.getAllMatches("666").length //=> 0

Steps to reproduce

Use that code
See 0, but expected 1

Additional context

No response

Node.js version

v16.6.1

Obscenity version

0.1.1

Priority

Low
Medium
High

Terms

I agree to follow the project's Code of Conduct.
I have searched existing issues for similar reports.

The text was updated successfully, but these errors were encountered:

jo3-l · 2023-06-02T02:25:27Z

The issue you're seeing here is caused by the collapse duplicates transformer, which is included in the set of recommended transformers. After this transformer is applied, the original input 666 becomes 6 which no longer matches the pattern 666; there's a warning to this effect in the documentation of the transformer:

This transformer should be used with caution, as while it can make certain
patterns match text that wouldn't have been matched before, it can also go
the other way. For example, the pattern hello clearly matches hello, but
with this transformer, by default, hello would become helo which does
not match. In this cases, the customThresholds option can be used to
allow two ls in a row, making it leave hello unchanged.

Thus to avoid this behavior you can either do as is suggested above and add a custom threshold that avoids transforming 6, or disable the transformer entirely.

As the behavior is documented I'm inclined to mark this as not a bug, though I absolutely see how the library behaves unintuitively in this case and would be open to considering suggestions for improvement.

jwoertink · 2023-06-02T17:40:18Z

Thanks for responding! That makes sense for 666, but I think this also fails for any number. For example 8562.

To add a bit more context to this, I'm using this in a live chat filter for my site. The chat exists on a live-stream page where the streamer can set which words they'd like to ban. This came up because a streamer had set the different parts of their address as "banned words" to avoid being doxxed in real-time. So for example, "8562" is banned

const customDataset = new DataSet();
customDataset.addPhrase((phrase) => {
  return phrase.setMetadata({ originalWord: "8562" })
    .addPattern(parseRawPattern("8562"))
    .addPattern(pattern`|8562|`)
});
const matcher = new RegExpMatcher({
        ...customDataset.build(),
        ...englishRecommendedTransformers
      });

matcher.getAllMatches("8562").length //=> 0

Though, this may just be because I have my patterns wrong... You've done a really great job with adding docs, so I may just be missing something.

jo3-l · 2023-06-03T02:32:05Z

This one is really quite unfortunate. The issue is again due to one of the default transformers, which skips non-alphabetic characters; this allows patterns such as foo to match f_o_o. Hence 8562 is transformed to the empty string which matches nothing.

While this is not a bug per se I think this is extremely unintuitive behavior and I'm inclined to remove this transformer from the set of recommended ones for English text, or perhaps even removing "recommended" transformers in general. The original intent was to provide something that works nicely out of the box, but as your report shows once anything custom is added it's easy for it to become confusing. WDYT?

jo3-l · 2023-06-03T02:36:40Z

Also, on an unrelated note — would you be willing to comment on which parts of the Obscenity API you use in your project? In particular, I'm curious as to whether you use any of the more complex features (custom transformers, censors, etc.) The reason I'm asking is now that I have a bit more time to work on open source, I'm thinking about cleaning up the API a little and releasing a v1.0.0, but if users are tied to the existing API I'll hold off on breaking changes.

jwoertink · 2023-06-03T15:10:04Z

I think your idea makes sense.

As for the API, I don't get too crazy with it. There's probably a lot more I should be doing with it that I'm not currently.
For the most part, that code I pasted above is essentially it. We have 2 separate functions where the first one is our default banned words.

import {
  DataSet,
  RegExpMatcher,
  englishRecommendedTransformers,
  pattern,
} from "obscenity";

const joystickDataset = new DataSet()
  .addPhrase((phrase) => {
                // they generally follow this pattern
		return phrase
			.setMetadata({ originalWord: 'badword' })
			.addPattern(pattern`badword`)
			.addWhitelistedTerm('...') // some close related word
	})
  .setMetadata({ originalWord: 'otherword' })
			.addPattern(pattern`oth[e]rword`)
			.addPattern(pattern`|oth|`)
			.addPattern(pattern`|word|`)
			.addPattern(pattern`?therword`)
const matcher = new RegExpMatcher({
  ...joystickDataset.build(),
  ...englishRecommendedTransformers,
});
export default {
  methods: {
    isTextInViolation(text) {
      return matcher.getAllMatches(text).length > 0;
    },
  }
}

Then we have the second function where a streamer can specify which words or phrases they want banned

bannedChatWordsDataset() {
  const customDataset = new DataSet();
  this.streamer.bannedChatWords.forEach((item, idx) => {
    const word = item.toLowerCase();
    customDataset.addPhrase((phrase) => {
      return phrase
        .setMetadata({ originalWord: word })
        .addPattern(parseRawPattern(word))
        .addPattern(pattern`|${word}`)
        .addPattern(pattern`${word}|`)
    });
  });

  return customDataset;
},

bannedChatWordsMatcher() {
  return new RegExpMatcher({
    ...this.bannedChatWordsDataset.build(),
    ...englishRecommendedTransformers
  });
},
isMessageInViolationOfBannedChatWords(message) {
      return this.bannedChatWordsMatcher.getAllMatches(message).length > 0;
    },

We do still have a few bugs around this implementation. I believe there's some words where you can't say just the word, but if you use it in a sentence, then it goes through... I have to come back to that portion later.

Stoked to hear you'll have more time on this though! It's a great project ❤️

jo3-l · 2023-06-04T15:26:28Z

Thanks for the response, that's helpful!

Back to the issue: I anticipate that I'll have time to work on releasing a fix next weekend, but for now you should be able to work around it in your application by passing a custom set of transformers instead of englishRecommendedTransformers.

For #23, #46. BREAKING CHANGE: Using the default English preset, Obscenity will no longer strip non-alphabetic characters from the input text before matching. This addresses a class of egregious false negatives in previous versions (see #23), but introduces a regression where cases such as 'f u c k' (with the space) will no longer be detected by default. We expect to provide a more comprehensive fix in the next minor release. If desired, it remains possible to revert to the previous behavior by providing a custom set of transformers to the matcher.

jo3-l · 2024-08-02T22:57:04Z

Since v0.2.0 (released January), this should not be an issue as the problematic transformer is not included by default, so closing for housekeeping purposes.

jwoertink added the bug Something isn't working label May 30, 2023

jo3-l mentioned this issue Jan 5, 2024

reenable non-alphabetic stripping by default #46

Open

5 tasks

jo3-l closed this as completed Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Unable to ban numbers #23

bug: Unable to ban numbers #23

jwoertink commented May 30, 2023

jo3-l commented Jun 2, 2023

jwoertink commented Jun 2, 2023

jo3-l commented Jun 3, 2023

jo3-l commented Jun 3, 2023 •

edited

Loading

jwoertink commented Jun 3, 2023

jo3-l commented Jun 4, 2023

jo3-l commented Aug 2, 2024 •

edited

Loading

bug: Unable to ban numbers #23

bug: Unable to ban numbers #23

Comments

jwoertink commented May 30, 2023

Expected behavior

Actual behavior

Minimal reproducible example

Steps to reproduce

Additional context

Node.js version

Obscenity version

Priority

Terms

jo3-l commented Jun 2, 2023

jwoertink commented Jun 2, 2023

jo3-l commented Jun 3, 2023

jo3-l commented Jun 3, 2023 • edited Loading

jwoertink commented Jun 3, 2023

jo3-l commented Jun 4, 2023

jo3-l commented Aug 2, 2024 • edited Loading

jo3-l commented Jun 3, 2023 •

edited

Loading

jo3-l commented Aug 2, 2024 •

edited

Loading