Skip to content

Conversation

@jthack
Copy link

@jthack jthack commented Feb 26, 2025

NOTE: I have NEVER used stagehand before. Also, AI wrote all of this. I am not a ts/js guy so please review it really well. THAT SAID, I'm super proud to add this because y'all are the GOAT of ai-based browsing AND I aspire to be the AI security GOAT so it's cool to add this.

Why

Stagehand processes website content that may contain invisible Unicode characters or emoji variation selectors. These characters can potentially be used in prompt injection attacks or other security exploits against AI systems. By filtering these characters, we can prevent potential security issues while still maintaining the functionality of the application.

What Changed

  • Added a configurable Unicode character filtering system that can be enabled/disabled
  • Implemented filtering for three specific Unicode ranges:
    • Language Tag characters (U+E0001, U+E0020–U+E007F)
    • Emoji Variation Selectors (U+FE00 - U+FE0F)
    • Supplementary Variation Selectors (U+E0100 - U+E01EF)
  • Added the CharacterFilterConfig interface to allow fine-grained control over which character ranges to filter
  • Integrated the filtering functionality into the core extraction and prompt building processes
  • Updated the Stagehand constructor to accept character filtering configuration
  • Added comprehensive tests to verify the filtering functionality

Test Plan

The implementation has been tested with several test cases:

  1. Run npx tsx examples/simple_unicode_test.ts to verify the basic filtering functionality
  2. Run npx tsx examples/stagehand_unicode_test.ts to test the integration with the Stagehand framework
  3. Run npx tsx examples/unicode_filter_test.ts to test different filtering configurations

The tests demonstrate that:

  • With filtering enabled (default), potentially unsafe Unicode characters are removed
  • With filtering disabled, all characters are preserved
  • Individual ranges can be selectively filtered based on configuration

All tests pass successfully, confirming that the character filtering system works as expected.

@changeset-bot
Copy link

changeset-bot bot commented Feb 26, 2025

⚠️ No Changeset found

Latest commit: 3e0b451

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants