Changing serialization for knn vector from single array object to collection of floats#253
Conversation
|
Adding results of performance tests perf_test_results.zip |
Codecov Report
@@ Coverage Diff @@
## main #253 +/- ##
============================================
+ Coverage 83.22% 83.32% +0.09%
- Complexity 865 883 +18
============================================
Files 123 127 +4
Lines 3780 3832 +52
Branches 359 361 +2
============================================
+ Hits 3146 3193 +47
- Misses 473 477 +4
- Partials 161 162 +1
Continue to review full report at Codecov.
|
| import java.util.Random; | ||
| import java.util.stream.IntStream; | ||
|
|
||
| public class VectorSerializerTests extends KNNTestCase { |
There was a problem hiding this comment.
Adding unit tests to keep percentage of code coverage greater or equals to the current one
e7d2593 to
da398af
Compare
| * @param byteStream stream of bytes that will be used for deserialization to array of floats | ||
| * @return array of floats deserialized from the stream | ||
| * @throws IOException | ||
| * @throws ClassNotFoundException |
There was a problem hiding this comment.
Do we need to have ClassNotFoundException here?
There was a problem hiding this comment.
It comes from the serializer implementation, ObjectInputStream.readObject().
I've dig deeper into exceptions and seems we just catch and re-throw RuntimeException at the higher level, so I moved this logic inside the serializer and removed all exceptions from the interface method signature. I hope it makes sense
| * | ||
| * The OpenSearch Contributors require contributions made to | ||
| * this file be licensed under the Apache-2.0 license or a | ||
| * compatible open source license. | ||
| * | ||
| * Modifications Copyright OpenSearch Contributors. See | ||
| * GitHub history for details. |
There was a problem hiding this comment.
we don't need to expand license description
There was a problem hiding this comment.
Sure, let me revert to a shorter header, seems this one has been added by signoff command automatically
| /* | ||
| * SPDX-License-Identifier: Apache-2.0 | ||
| * | ||
| * The OpenSearch Contributors require contributions made to | ||
| * this file be licensed under the Apache-2.0 license or a | ||
| * compatible open source license. | ||
| * | ||
| * Modifications Copyright OpenSearch Contributors. See | ||
| * GitHub history for details. | ||
| */ |
There was a problem hiding this comment.
| /* | |
| * SPDX-License-Identifier: Apache-2.0 | |
| * | |
| * The OpenSearch Contributors require contributions made to | |
| * this file be licensed under the Apache-2.0 license or a | |
| * compatible open source license. | |
| * | |
| * Modifications Copyright OpenSearch Contributors. See | |
| * GitHub history for details. | |
| */ | |
| /* | |
| * Copyright OpenSearch Contributors | |
| * SPDX-License-Identifier: Apache-2.0 | |
| */ |
| public float[] byteToFloatArray(ByteArrayInputStream byteStream) { | ||
| final byte[] vectorAsByteArray = new byte[byteStream.available()]; | ||
| byteStream.read(vectorAsByteArray, 0, byteStream.available()); | ||
| final float[] vector = new float[vectorAsByteArray.length / BYTES_IN_FLOAT]; |
There was a problem hiding this comment.
shall we move it to a variable
There was a problem hiding this comment.
agree, it will make code more readable. let me do the change
| } | ||
|
|
||
| private static byte highByte(short shortValue) { | ||
| return (byte) (shortValue>>8); |
There was a problem hiding this comment.
need constant instead of number
There was a problem hiding this comment.
agree, changing in next revision
|
|
||
| public void testVectorSerializerFactory() { | ||
| final KNNVectorSerializer defaultSerializer = KNNVectorSerializerFactory.getDefaultSerializer(); | ||
| assertNotNull(defaultSerializer); |
There was a problem hiding this comment.
shall we assert what is the default Serializer type you are expecting too.
There was a problem hiding this comment.
I can check on exact serializer based on functionality it supports. Current default is for collection of floats, so it should be able to deserialize collection of floats without any exception and to the same original content. Is this something you're looking for in such check?
5508fa8
186bf04 to
8e946f5
Compare
…lection of floats Signed-off-by: Martin Gaievski <gaievski@amazon.com>
-added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters Signed-off-by: Martin Gaievski <gaievski@amazon.com>
- replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
- rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
8e946f5 to
7a3a1a9
Compare
…lection of floats (opensearch-project#253) * Changing serialization for knn vector from single array object to collection of floats rev2: * Addressing PR comments: - added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters rev3: * Addressing multiple review comments: - replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant rev4: * Moving new classes under index.codec.util rev5: * Addressing multiple review comments: - rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
…lection of floats (opensearch-project#253) * Changing serialization for knn vector from single array object to collection of floats rev2: * Addressing PR comments: - added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters rev3: * Addressing multiple review comments: - replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant rev4: * Moving new classes under index.codec.util rev5: * Addressing multiple review comments: - rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
…lection of floats (opensearch-project#253) * Changing serialization for knn vector from single array object to collection of floats rev2: * Addressing PR comments: - added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters rev3: * Addressing multiple review comments: - replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant rev4: * Moving new classes under index.codec.util rev5: * Addressing multiple review comments: - rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Previously, we had serialized vectors via an array method. This was inefficient and remove in the opensearch-project#253 PR and launced in 1.3.0. With that, we no longer serialized new segment data via the array based serializer. Now that it is 3.0, because the oldest index that can be upgraded is from 1.3.0, we no longer need to handle array based serialization. So, we can remove all of it. Signed-off-by: John Mazanec <jmazane@amazon.com>
Previously, we had serialized vectors via an array method. This was inefficient and remove in the opensearch-project#253 PR and launced in 1.3.0. With that, we no longer serialized new segment data via the array based serializer. Now that it is 3.0, because the oldest index that can be upgraded is from 1.3.0, we no longer need to handle array based serialization. So, we can remove all of it. Signed-off-by: John Mazanec <jmazane@amazon.com>
Previously, we had serialized vectors via an array method. This was inefficient and remove in the #253 PR and launced in 1.3.0. With that, we no longer serialized new segment data via the array based serializer. Now that it is 3.0, because the oldest index that can be upgraded is from 1.3.0, we no longer need to handle array based serialization. So, we can remove all of it. Signed-off-by: John Mazanec <jmazane@amazon.com>
Previously, we had serialized vectors via an array method. This was inefficient and remove in the opensearch-project#253 PR and launced in 1.3.0. With that, we no longer serialized new segment data via the array based serializer. Now that it is 3.0, because the oldest index that can be upgraded is from 1.3.0, we no longer need to handle array based serialization. So, we can remove all of it. Signed-off-by: John Mazanec <jmazane@amazon.com>
…lection of floats (opensearch-project#253) * Changing serialization for knn vector from single array object to collection of floats rev2: * Addressing PR comments: - added getDefaultSerializer to Factory - moved SerializationMode enum to a separate file - added javadocs and comments - adjust format, added missing endline characters rev3: * Addressing multiple review comments: - replace Vector by KNNVector in class names and variables - fixed method names in Serializer interface - replace number of bytes in float from number to constant rev4: * Moving new classes under index.codec.util rev5: * Addressing multiple review comments: - rework factory method getSerializerByStreamContent - added test case for stream of unsupported content - removed exceptions from Serializer interface method's signatures, changed it to unchecked runtime exception - simplify license header in new classes Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Previously, we had serialized vectors via an array method. This was inefficient and remove in the opensearch-project#253 PR and launced in 1.3.0. With that, we no longer serialized new segment data via the array based serializer. Now that it is 3.0, because the oldest index that can be upgraded is from 1.3.0, we no longer need to handle array based serialization. So, we can remove all of it. Signed-off-by: John Mazanec <jmazane@amazon.com>
Description
Changing logic of k-NN vector serialization/deserialization in order to decrease memory taken by the index. Existing serialization stores vector as Java array, which has 27 bytes of overhead for each individual array (as per stream protocol). In newly implemented logic vector is serialized as a collection of individual numbers. During deserialization we can restore this collection to an array assuming it length from number of bytes in byte stream.
Logic has been added in a backward compatibility manner for deserialization. Initially we read first 27 bytes of the byte stream and identify if this is a serialized array based on the same stream protocol grammar. Depending on result of the check we apply old logic (deserialize as a single array object) or a new logic (deserialize as collection of individual floats).
Performance stayed roughly the same time-wise but we see an improvement in memory usage. Below are results of 2 tests comparing existing and new approaches. For testing we have used the benchmark tool from main k-NN branch with two data sets Fashion-MNIST and SIFT from here.
Diff (changed - original) for SIFT dataset:
Diff (changed - original) for Fashion-MNIST dataset:
Issues Resolved
knn_vector codec minor overhead
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.