Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Embed each dict in jsonline format #57

Open
abdul756 opened this issue May 27, 2024 · 6 comments
Open

How to Embed each dict in jsonline format #57

abdul756 opened this issue May 27, 2024 · 6 comments
Labels
question Further information is requested

Comments

@abdul756
Copy link

I am building a RAG app using llm-app that tells flight offers available between source and dest. When user asks please suggest chepeast flight between source and destination it should show fare and all the details of that flight.

I want to calculate the emdedding vectors of each dict of jsonline , how to achieve it.

Sample format
`{"flight_offer_id": "1", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T09:30:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T11:30:00", "carrierCode": "UK", "number": "822", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H", "segment_id": "1", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "2", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T20:30:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T22:30:00", "carrierCode": "UK", "number": "824", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H", "segment_id": "2", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "3", "fare_details": 67.02, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T06:45:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T08:50:00", "carrierCode": "UK", "number": "828", "aircraft_code": "320", "operating_carrierCode": "UK", "segment_duration": "PT2H5M", "segment_id": "7", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "4", "fare_details": 70.63, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T07:55:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T10:00:00", "carrierCode": "AI", "number": "571", "aircraft_code": "32N", "operating_carrierCode": "AI", "segment_duration": "PT2H5M", "segment_id": "8", "numberOfStops": 0, "blacklistedInEU": false}

{"flight_offer_id": "5", "fare_details": 70.63, "departure_iataCode": "MAA", "departure_terminal": "4", "departure_at": "2024-06-01T15:50:00", "arrival_iataCode": "BOM", "arrival_terminal": "2", "arrival_at": "2024-06-01T17:55:00", "carrierCode": "AI", "number": "672", "aircraft_code": "32N", "operating_carrierCode": "AI", "segment_duration": "PT2H5M", "segment_id": "9", "numberOfStops": 0, "blacklistedInEU": false}
`

@abdul756 abdul756 added the question Further information is requested label May 27, 2024
@dxtrous
Copy link
Member

dxtrous commented May 27, 2024

Hi @abdul756 not at all sure this is a case for vector search but if you want to do that, you may want to pass "metadata_column" to your chosen indexing approach https://pathway.com/developers/api-docs/indexing and use "metadata_filter" for query to be able to pass hard bounds on times and places etc.

As for extracting data from JSON elements into columns, this very short guide explains some possible ways - UDF being the most general: https://pathway.com/developers/user-guide/types-in-pathway/json_type

@abdul756
Copy link
Author

I will try and let you know, If i face any problem please help me

@abdul756
Copy link
Author

abdul756 commented May 29, 2024

HI @dxtrous Here is my table
` | price | itinearies
^6A0QZMJ... | "104.10" | [{"duration": "PT10H", "segments": [{"aircraft": {"code": "321"}, "arrival": {"at": "2024-06-01T14:30:00", "iataCode": "CJB"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-01T13:20:00", "iataCode": "MAA", "terminal": "4"}, "duration": "PT1H10M", "id": "3", "number": "429", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}, {"aircraft": {"code": "32N"}, "arrival": {"at": "2024-06-01T23:20:00", "iataCode": "BOM", "terminal": "2"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-01T21:35:00", "iataCode": "CJB"}, "duration": "PT1H45M", "id": "4", "number": "662", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}]}]

^SN0FH7F... | "104.10" | [{"duration": "PT21H35M", "segments": [{"aircraft": {"code": "321"}, "arrival": {"at": "2024-06-01T14:30:00", "iataCode": "CJB"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-01T13:20:00", "iataCode": "MAA", "terminal": "4"}, "duration": "PT1H10M", "id": "88", "number": "429", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}, {"aircraft": {"code": "32N"}, "arrival": {"at": "2024-06-02T10:55:00", "iataCode": "BOM", "terminal": "2"}, "blacklistedInEU": false, "carrierCode": "AI", "departure": {"at": "2024-06-02T09:00:00", "iataCode": "CJB"}, "duration": "PT1H55M", "id": "89", "number": "608", "numberOfStops": 0, "operating": {"carrierCode": "AI"}}]}]

^9KM937R... | "125.11" | [{"duration": "PT1H50M", "segments": [{"aircraft": {"code": "737"}, "arrival": {"at": "2024-06-01T22:50:00", "iataCode": "BOM", "terminal": "1"}, "blacklistedInEU": false, "carrierCode": "SG", "departure": {"at": "2024-06-01T21:00:00", "iataCode": "MAA", "terminal": "1"}, "duration": "PT1H50M", "id": "100", "number": "681", "numberOfStops": 0, "operating": {"carrierCode": "SG"}}]}]`

Now i if a user ask any questons related to flight which indexing i should use for example, if a user ask please get me details of chepeast flight or expensive flight it should display all details from itinearies column based on duration . Here https://pathway.com/developers/api-docs/indexing there are so many indexing algo please help me in chosing better algo for my use case and explain me how this data column and metadata column should be selected with the table i provided

@zxqfd555-pw
Copy link
Contributor

Hi Abdul,

You may start with the KNN LSH index for indexing the first attempt on indexing. After you have the whole process up and running, it may make sense to compare different indexes between themselves to fine-tune the application.

In the scenario you describe, you will also need an embedder to embed these JSONs containing information about flights. Some of the embedders are provided here, but alternatively you can implement your embedder as a UDF that takes a string or JSON and return its' embedding as a vector of floats. Please note that there is no native embedder here: this task requires you to use a third-party API, like one from OpenAPI.

Also, as Adrian mentioned above, this case may not fit the vector search. After you have the embeddings and the index which can be queried, there is no guarantee that this index will return the cheapest flight details for the given endpoints and date. While you could probably improve it with a RAG technique, it looks much more like a graph problem where the combination of a source and a timeslot (00:00-01:00, 01:00-02:00, etc) can be a node, while a flight between two sources can be an edge. Therefore if the vector search results don't suit you, it makes sense to look at this angle.

@abdul756
Copy link
Author

@zxqfd555-pw Am using embedder from openAI, for example am using pw.indexing.DataIndex(data_table, inner_index, embedder=None)](https://pathway.com/developers/api-docs/indexing#pathway.stdlib.indexing.DataIndex) I just need to know how to pass the innex index it will be just price or it will inlcude itinearies and how to use metadata_filter in this case

@zxqfd555-pw
Copy link
Contributor

The metadata filter would be needed if you index a set of files and would like the index to perform requests only on a specific subset matching a certain pattern. I would say it's not needed for the first attempt on the app.

I would suggest that you pass the embeddings of a full JSON payload as if you pass the price, that would clearly be not enough to answer the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants