Skip to content

Add squared Euclidean distance (l2_squared) in java implementation#25409

Merged
feilong-liu merged 1 commit intoprestodb:masterfrom
zhichenxu-meta:export-D77157252
Jun 26, 2025
Merged

Add squared Euclidean distance (l2_squared) in java implementation#25409
feilong-liu merged 1 commit intoprestodb:masterfrom
zhichenxu-meta:export-D77157252

Conversation

@zhichenxu-meta
Copy link
Contributor

@zhichenxu-meta zhichenxu-meta commented Jun 23, 2025

Description

This PR introduces the squared Euclidean distance (l2_squared) function between identical sized vectors represented as arrays(real). The l2_squared distance is commonly used to measure similarities between embeddings of multimedia data.

Differential Revision: D77157252

Motivation and Context

We are introducing vector search capabilities into Presto, and this PR takes the first step by adding common distance functions. This functionality will enable users to perform efficient similarity searches on multimedia data.

Impact

The addition of the l2_squared distance function will enhance Presto's capabilities in handling multimedia data and enable users to perform more complex analytics tasks.

Test Plan

  • Unit tests
  • Manual tests: on test servers with various input scenarios to ensure correctness and performance.

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==
General Changes

  • Add l2_squared function to calculate the squared Euclidean distance between two identically sized vectors represented as arrays. This function supports both array(real) and array(double) input types. For more information, refer to the Euclidean distance definition.

Example Usage:

-- Using real arrays
SELECT l2_squared(ARRAY[1.0, 2.0, 3.0], ARRAY[4.0, 5.0, 6.0]);
-- Returns: 27.0
-- Using double arrays
SELECT l2_squared(ARRAY[1.0E0, 2.0E0, 3.0E0], ARRAY[4.0E0, 5.0E0, 6.0E0]);
-- Returns: 27.0

@zhichenxu-meta zhichenxu-meta requested a review from a team as a code owner June 23, 2025 18:28
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 23, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: zhichenxu-meta (cae80e0)

@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

@zhichenxu-meta zhichenxu-meta marked this pull request as draft June 23, 2025 18:28
@zhichenxu-meta zhichenxu-meta changed the title Add l2_squared java implementation Feat: Add l2_squared java implementation Jun 23, 2025
zhichenxu-meta added a commit to zhichenxu-meta/presto that referenced this pull request Jun 23, 2025
Summary:

l2Squared is commonly used for computing similarity for  image and video embeddings.

Differential Revision: D77157252
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

zhichenxu-meta added a commit to zhichenxu-meta/presto that referenced this pull request Jun 23, 2025
Summary:

l2Squared is commonly used for computing similarity for  image and video embeddings.

Differential Revision: D77157252
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

zhichenxu-meta added a commit to zhichenxu-meta/presto that referenced this pull request Jun 23, 2025
Summary:

l2Squared is commonly used for computing similarity for  image and video embeddings.

Differential Revision: D77157252
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

@zhichenxu-meta zhichenxu-meta changed the title Feat: Add l2_squared java implementation Feat: Add squared Euclidean distance (l2_squared) in java implementation Jun 23, 2025
@zhichenxu-meta zhichenxu-meta changed the title Feat: Add squared Euclidean distance (l2_squared) in java implementation Add squared Euclidean distance (l2_squared) in java implementation Jun 23, 2025
zhichenxu-meta added a commit to zhichenxu-meta/presto that referenced this pull request Jun 23, 2025
Summary:

l2Squared is commonly used for computing similarity for  image and video embeddings.

Differential Revision: D77157252
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

@zhichenxu-meta zhichenxu-meta marked this pull request as ready for review June 23, 2025 22:40
zhichenxu-meta added a commit to zhichenxu-meta/presto that referenced this pull request Jun 25, 2025
Summary:

l2Squared is commonly used for computing similarity for  image and video embeddings.

Differential Revision: D77157252
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

@steveburnett
Copy link
Contributor

  • Please add a release note, even if it's only
== NO RELEASE NOTE ==
  • Is documentation needed for this new function? If yes, the release note entry can link to the new documentation: see Formatting in the Release Note Guidelines for examples.

@zhichenxu-meta
Copy link
Contributor Author

zhichenxu-meta commented Jun 25, 2025

  • Please add a release note, even if it's only
== NO RELEASE NOTE ==
  • Is documentation needed for this new function? If yes, the release note entry can link to the new documentation: see Formatting in the Release Note Guidelines for examples.
    Thank you and put == NO RELEASE NOTE == for now.

Copy link
Contributor

@feilong-liu feilong-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Leave some comment
And there is a test failure from [Meta Internal-Only Changes Check](https://github.com/prestodb/presto/pull/25409/checks?check_run_id=44781422636)

@feilong-liu feilong-liu requested a review from rschlussel June 26, 2025 00:19
zhichenxu-meta added a commit to zhichenxu-meta/presto that referenced this pull request Jun 26, 2025
Summary:

l2Squared is commonly used for computing similarity for  image and video embeddings.

Differential Revision: D77157252
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

@zhichenxu-meta
Copy link
Contributor Author

Overall looks good. Leave some comment And there is a test failure from [Meta Internal-Only Changes Check](https://github.com/prestodb/presto/pull/25409/checks?check_run_id=44781422636)

Feilong, thank you for the review and I have added support for array(double) and follow convention used for other math functions. I will look into the internal checks.

Summary:

l2Squared is commonly used for computing similarity for  image and video embeddings.

Differential Revision: D77157252
@facebook-github-bot
Copy link
Collaborator

This pull request was exported from Phabricator. Differential Revision: D77157252

@feilong-liu
Copy link
Contributor

By the way, can you also add the cpp version too to maintain availability of prestissimo engine

@zhichenxu-meta
Copy link
Contributor Author

By the way, can you also add the cpp version too to maintain availability of prestissimo engine

@feilong-liu I have a separate PR for cpp support, and thank you!

@feilong-liu feilong-liu merged commit 4af65e7 into prestodb:master Jun 26, 2025
108 checks passed
@steveburnett
Copy link
Contributor

  • Please add a release note, even if it's only
== NO RELEASE NOTE ==
  • Is documentation needed for this new function? If yes, the release note entry can link to the new documentation: see Formatting in the Release Note Guidelines for examples.
    Thank you and put == NO RELEASE NOTE == for now.

Does documentation exist for this function?

@kaikalur
Copy link
Contributor

This function is not handling null values?

@zhichenxu-meta
Copy link
Contributor Author

This function is not handling null values?

Hi @kaikalur
Dense vectors represented as array(real) or array(double) usually do not contain null elements. I will create a task to add some checks. Thank you!

@kaikalur
Copy link
Contributor

kaikalur commented Jul 15, 2025

This function is not handling null values?

Hi @kaikalur Dense vectors represented as array(real) or array(double) usually do not contain null elements. I will create a task to add some checks. Thank you!

Presto UDF are general purpose so someone could call it on things with nulls. So we should always keep it general. Also the check is simple in java. Look at the mayHaveNulls() method in Block - error out if that's true for either of the arrays

@zhichenxu-meta
Copy link
Contributor Author

This function is not handling null values?

Hi @kaikalur Dense vectors represented as array(real) or array(double) usually do not contain null elements. I will create a task to add some checks. Thank you!

Presto UDF are general purpose so someone could call it on things with nulls. So we should always keep it general. Also the check is simple in java. Look at the mayHaveNulls() method in Block - error out if that's true for either of the arrays

Sounds good and thanks for the code pointer.

@kaikalur
Copy link
Contributor

Also if we expect a lot of 0's (I guess for dense vectors that won't happen?) maybe good to short circuit on either element being 0.

@zhichenxu-meta
Copy link
Contributor Author

Also if we expect a lot of 0's (I guess for dense vectors that won't happen?) maybe good to short circuit on either element being 0.

Good suggestions, and thank you!

Note that we also have c++ implementations of these functions, and the cpp version relies on the FAISS library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants