Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC : Point In Time Search #1147

Closed
rajkthakur opened this issue Aug 24, 2021 · 37 comments
Closed

RFC : Point In Time Search #1147

rajkthakur opened this issue Aug 24, 2021 · 37 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search v2.4.0 'Issues and PRs related to version v2.4.0'

Comments

@rajkthakur
Copy link

rajkthakur commented Aug 24, 2021

Is your feature request related to a problem? Please describe.

Today, in OpenSearch, if you want to run different queries on the same data set chances are you will get different result as data is constantly changing. However, in real world scenario when analyzing data or trying to provide a consistent user experience to your end users you may want the result from a query not to change while the context remains the same and control when changes should appear in the result set. You want to be able to query the same data set and paginate through the data set expecting consistent result. This is not possible using current available options in OpenSearch.

Opensearch currently supports the following options to achieve pagination, each having a certain limitation:

  1. Scroll API : Scroll API cannot share point in time context with other queries. Moreover, the scroll API only allows to move forwards(next page) in the search, cases when the client sends the request for a page but fails to get a response, a subsequent retry call skips the page(retried for) and returns the next page in the scroll.
  2. Search After : The search_after mechanism doesn't preserve the state of data when the search was issued, so one can paginate using the key (search_after) and fetch subsequent pages while getting more recent results since the search was issued as the pagination progresses.
  3. From To : This mechanism does not support deep pagination since every page request requires the shard to process all previous results and then filter the requested page which might be taxing deeper the pagination goes

Describe the solution you'd like

Point in Time allows users to run different queries against the same fixed data set in time. Point in time only takes data into account up until the moment it is created. Hence, none of the resources that are required to return the data from the initial request are modified or deleted. Segments are retained, even though the segment might already have been merged away and is not needed for the live data set. In short, Point in Time Search allows user to maintain a state which can be re-used by different queries in order to achieve consistent results.

Key goals:

  1. Optimize resource consumption compared to a scroll by providing a consistent, shareable view of data set across queries. More segments are otherwise needed to be retained as needed by individual queries which means more file handles, more disk and more heap to keep metadata from segments in the heap.
  2. Resilient to
    1. Network failures : allows searches to move forward with a search_after parameter
    2. Shard failures for read-only data : allows retries on other shard copies that share the same segments (Phase - II)
  3. Replaces scroll API, as a more comprehensive solution for deep pagination when used with search_after
  4. Point in Time will be supported by Asynchronous Search and Cross Cluster searches

APIs

Create Point In Time API

Unlike a Scroll, by creating a dedicated Point in Time, we decouple the context from a single query and make it re-usable across arbitrary search requests by passing the Point in Time Id. We can achieve this by using the Create Point in Time API.

POST <index>/_point_in_time?keep_alive=1m
{
   "id" : "s9O9QAIFaW5kZXgWOFVaMXFTc3pTV3lLMGE4VU42dmo4dwAWekthUVBmYnRUWk9XVzh4WW56TG5lZwAAAAAAAAAAARZQd3JkNlE4WlJicXRuS0M1VzNDaHV3BWluZGV4FjhVWjFxU3N6U1d5SzBhOFVONnZqOHcBFnpLYVFQZmJ0VFpPV1c4eFluekxuZWcAAAAAAAAAAAIWUHdyZDZROFpSYnF0bktDNVczQ2h1dwEWOFVaMXFTc3pTV3lLMGE4VU42dmo4dwAA",
   "created_time" : 1632727466283,
   "end_time" : 1632727526283
}

Delete Point In Time API

Point-in-times are automatically closed when the keep_alive is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests. We may also delete a Point in Time and free the resources before its keep alive using the Delete Point in Time API.

DELETE /_point_in_time/<id>

List All Active Point In Time API

A useful admin API to have is to list all active Points in Time and their keep-alives.

GET /_point_in_time

[
    {
        "point_in_time_id_1",
        "created_time" : 1632727466283,
        "end_time" : 1632727526283
    },
    {
        "point_in_time_id_2",
        "created_time" : 1674662833272,
        "end_time" : 1632727526283
    }
    ...
    ...
]

Using a Point in Time in a search request:

In the search request we pass the point in time id and (optionally) a keep alive to extend the Point In Time. (Passing PIT id in search request is supported in Opensearch)
Search request with PIT ID will not accept indices, preference, routing and indices options as these are already passed at the time of creating a Point In Time.

GET /_search
{
  "pit": {
        "id":  "ID_RETURNED_FROM_CREATE_POINT_IN_TIME_REQUEST", 
        "keep_alive": "1m" //optional to extend a Point In Time
  },
  "sort": [
    {
      "name.keyword": {
        "order": "desc"
      }
    }
  ],
  "search_after" : ["Opensearch", 1] //optional to fetch further results 
}
@rajkthakur rajkthakur added the enhancement Enhancement or improvement to existing feature or request label Aug 24, 2021
@stockholmux
Copy link
Member

@rajkthakur Can you make this issue a little more... fleshed out? I'm just seeing the dummy text.

@anasalkouz
Copy link
Member

Hi @rajkthakur, are you actively working on this issue? if yes, please could you assign it to yourself and add a comment?

@eirsep
Copy link
Member

eirsep commented Oct 5, 2021

@anasalkouz I am actively working on this issue.

@stockholmux
Copy link
Member

Can we mark this as a proposal?

@rajkthakur rajkthakur changed the title Point In Time Search RFC : Point In Time Search Oct 18, 2021
@eirsep
Copy link
Member

eirsep commented Oct 19, 2021

I have done a small POC to check feasibility of the proposed APIs

We will not be able to provide a List All Points In Time APIs, as a point in time is not tied to any coordinator node.
Rather the Point In Time ID is simply a Base 64 encoded hash of list of reader context to node mappings and a UUID.

@eirsep
Copy link
Member

eirsep commented Oct 19, 2021

We will provide new information in the Nodes stats api -
a nested object in the search section mentioning the number of currently active Point In Time contexts, total number of Point In Time contexts.

This will help keep track of point in time related statistics.

@eirsep
Copy link
Member

eirsep commented Oct 19, 2021

We will provide settings to:

  1. restrict the max open point in time contexts on a node
  2. restrict the maximum keep alive allowed for point in times.

@eirsep
Copy link
Member

eirsep commented Oct 19, 2021

similar to scroll, we will provide all points in time via the api by passing _all in the path for Delete Point In Time API.

@eirsep
Copy link
Member

eirsep commented Nov 2, 2021

Currently doing a POC to add an API for PIT disk utilization i.e. segments retained by PIT Ids. I am trying to put out stats similar to cat segments, but for Points In Time.

@eirsep eirsep mentioned this issue Dec 3, 2021
4 tasks
@nknize
Copy link
Collaborator

nknize commented Dec 3, 2021

Segments are retained

This can get expensive. I see the objective is to optimize resource consumption ✔️ . I think this is a specialized use case for archival or time based analysis use cases and not the normal search use case so this should be configured through an Index Scoped setting.

Segment replication will heavily change this design as it will push a lot of the PIT burden on the storage layer.

@eirsep
Copy link
Member

eirsep commented Dec 3, 2021

Segments are retained

This can get expensive. I see the objective is to optimize resource consumption ✔️ . I think this is a specialized use case for archival or time based analysis use cases and not the normal search use case so this should be configured through an Index Scoped setting.

@nknize
I agree that segment retention is expensive, but this is exactly what scrolls do today. The resource consumption is being optimised only in comparison to Scrolls(i.e. being able to share PIT Id across queries while scrolls create different contexts per scroll).

(Among other things) PIT would be a replacement for scrolls and hence be the pagination solution for Opensearch, which is also an important use case to keep in mind. We would limit the number of PIT contexts that can be opened via a setting and would also provide an API to provide info about PITs' disk consumption

Will look into how-to and benefits of configuring PIT through Index Scoped setting.

Segment replication will heavily change this design as it will push a lot of the PIT burden on the storage layer.

Segments retained by PIT/Scrolls will not be replicated I think. Can you plz elaborate the caveats you see?

@nknize
Copy link
Collaborator

nknize commented Dec 3, 2021

PIT would be a replacement for scrolls

👍

Can you plz elaborate the caveats you see?

A storage engine w/ verisoned backups (e.g., S3 buckets w/ versioning) can be used to restore files from backups. I think this along w/ Lucene sequence IDs enables this feature without having to retain as many historic segments. A storage engine w/o versioning (e.g., local NFS, or smb) could possibly use the segment retention logic provided by this feature.

@eirsep
Copy link
Member

eirsep commented Dec 6, 2021

Can you plz elaborate the caveats you see?

A storage engine w/ verisoned backups (e.g., S3 buckets w/ versioning) can be used to restore files from backups. I think this along w/ Lucene sequence IDs enables this feature without having to retain as many historic segments. A storage engine w/o versioning (e.g., local NFS, or smb) could possibly use the segment retention logic provided by this feature.

Ack.
Thanks for elaborating! So, segment replication feature would have to handle how Scrolls would function which is currently doing segment retention (prolly by using versioned backups as you've mentioned). Hence, by extension, PITs will get handled too as PITs are simply re-using that idea.

@eirsep
Copy link
Member

eirsep commented Feb 23, 2022

@CEHENKLE Can you please create a feature branch feature/point-in-time which will be used to run full test suite?

@CEHENKLE
Copy link
Member

https://github.com/opensearch-project/OpenSearch/tree/feature/point-in-time created

@elfisher
Copy link

elfisher commented Apr 5, 2022

@rramachand21 is this still aiming for 2.0?

@andrross andrross mentioned this issue Apr 7, 2022
5 tasks
@andrross
Copy link
Member

andrross commented Apr 8, 2022

I think this is a specialized use case for archival or time based analysis use cases and not the normal search use case

I'm reading this feature as primarily a replacement/improvement over the scroll API. It seems like the primary use case is for pagination (which is currently solved by the scroll API but does have limitations). This essentially generalizes it a bit and gives semantics similar to snapshot isolation in a traditional database where you can do multiple queries within a transaction and observe a consistent view of the data. @nknize do you have any major concerns moving forward with this feature?

I do have a nitpick about the name, though, particularly the /_point_in_time API. "Point in time" is likely a phrase to be overloaded in the future, first and most obvious to me is something like "point in time restore". There also may in fact be other features to solve archival or time-based analysis requirements that would require the ability to do searches from points far in the past. I'd love to hear opinions from other folks, but given that this feature's scope is pretty narrowly focused on "establish a point in time now and allow me to do some searches against it for a very limited duration" it might be a good idea to give it a name that won't conflict/confuse with future point-in-time-related features.

@andrross
Copy link
Member

andrross commented Apr 8, 2022

Scroll API cannot share point in time context with other queries

The inability to share point-in-time contexts is mentioned several times, but what is the use case for sharing these contexts? The pagination use case makes total sense but I don't think that generally requires sharing the context. The ability to restrict maximum keep-alive duration also makes a lot of sense to put a cap on worst-case resource consumption. However, that limit is likely to be in tension with the usefulness of sharing the contexts if they are short-lived, so I'm curious about the use cases that are motivating the share-ability requirement.

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Apr 8, 2022

I think the major use case of share-ability is the ability to execute different types of queries and derive better insights on the same consistent view of data. Then some queries might fail or timeout due to various reasons. Once PIT is created retries become simpler as it allows user to resume queries on the same view

"Point in time" is likely a phrase to be overloaded in the future, first and most obvious to me is something like "point in time restore".

+1 on the thought, @andrross does /_search/pit or /_search/_point_in_time sounds better?

@andrross
Copy link
Member

andrross commented Apr 8, 2022

@Bukhtawar I definitely like /_search/_point_in_time better than just /_point_in_time.

One last naming nitpick, I do prefer spelling it out as opposed to using the acronym "pit", but either way we should be consistent. If we stick with "point_in_time" then that should be used in the search request as well.

@loretoparisi
Copy link

loretoparisi commented Aug 30, 2022

@rajkthakur I'm getting this error { Message: "Your request: '/_pit' is not allowed." } when querying AWS OpenSearch, that it should support releases 1.3, 1.2, 1.1, 1.0. According to latest AWS annoucement they have deployed OS 1.3, but there are any details about PiT support.
Which OS version has this PR?

@dhruv16dhr
Copy link

@rajkthakur I'm getting this error { Message: "Your request: '/_pit' is not allowed." } when querying AWS OpenSearch, that it should support releases 1.3, 1.2, 1.1, 1.0. According to latest AWS annoucement they have deployed OS 1.3, but there are any details about PiT support. Which OS version has this PR?

@loretoparisi AWS OpenSearch will support point in time search in upcoming release. It is not supported in AWS OpenSearch 1.3. Point-in-time search will be supported in OpenSource OpenSearch 2.3.0 release.

@dreamer-89
Copy link
Member

@rajkthakur: I see this issue is labeled for v2.3.0 release, which has code freeze today i.e. Sep 7. I see open backport PRs to 2.x. Can you please prioritize review/merge.

@dreamer-89 dreamer-89 added v2.4.0 'Issues and PRs related to version v2.4.0' and removed v2.3.0 'Issues and PRs related to version v2.3.0' labels Sep 8, 2022
@stephen-crawford
Copy link
Contributor

Hi @rajkthakur, just checking in from the security team to see if there is anything you need finalized from us before the upcoming 2.4 freeze.

@anasalkouz
Copy link
Member

anasalkouz commented Oct 28, 2022

@rajkthakur do you still track this for 2.4 release? code freeze on 11/3
Is there anything pending? otherwise, feel free to close it.

@dhruv16dhr
Copy link

@anasalkouz Yes we are tracking this for 2.4 release. Documentation is pending, we will be closing it before 11/3.
@bharath-techie Please check with @rajkthakur and close this by next week

@bharath-techie
Copy link
Contributor

Documentation PR - opensearch-project/documentation-website#1753

SingingTree added a commit to SingingTree/documentation-website that referenced this issue Apr 19, 2023
Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation.

Further details:
- I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change.
- ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).
SingingTree added a commit to SingingTree/documentation-website that referenced this issue Apr 19, 2023
Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation.

Further details:
- I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change.
- ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).

Signed-off-by: Bryce Seager van Dyk <[email protected]>
Naarcha-AWS pushed a commit to opensearch-project/documentation-website that referenced this issue May 18, 2023
Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation.

Further details:
- I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change.
- ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).

Signed-off-by: Bryce Seager van Dyk <[email protected]>
opensearch-trigger-bot bot pushed a commit to opensearch-project/documentation-website that referenced this issue May 18, 2023
Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation.

Further details:
- I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change.
- ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).

Signed-off-by: Bryce Seager van Dyk <[email protected]>
(cherry picked from commit 3470787)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Naarcha-AWS pushed a commit to opensearch-project/documentation-website that referenced this issue May 18, 2023
Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation.

Further details:
- I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change.
- ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).


(cherry picked from commit 3470787)

Signed-off-by: Bryce Seager van Dyk <[email protected]>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
vagimeli pushed a commit to opensearch-project/documentation-website that referenced this issue May 26, 2023
Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation.

Further details:
- I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change.
- ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).

Signed-off-by: Bryce Seager van Dyk <[email protected]>
harshavamsi pushed a commit to harshavamsi/documentation-website that referenced this issue Oct 31, 2023
Opensearch does not currently appear to support `_shard_doc` as part of Point In Time search, so remove references to it from the documentation.

Further details:
- I don't see any reference to `_shard_doc` in the code on [Opensearch's](https://github.com/opensearch-project/OpenSearch) main branch at time of proposing the change.
- ElasticSearch added `_shard_doc` in [7.12](elastic/elasticsearch-net#5337) and it looks like it was not added as part of Opensearch's [Point In Time work](opensearch-project/OpenSearch#1147).

Signed-off-by: Bryce Seager van Dyk <[email protected]>
@loretoparisi
Copy link

supported

@dhruv16dhr I've recently attempted again to use PIT using AWS OpenSearch / Kibana 2.5, but I'm getting

{ Message: "Your request: '/_pit' is not allowed." }

the currently installed version of AWS OS is

"version" : {
    "number" : "7.10.2",
    "build_snapshot" : false,
    "lucene_version" : "9.4.2",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing & Search v2.4.0 'Issues and PRs related to version v2.4.0'
Projects
None yet
Development

No branches or pull requests