Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest rst does not follow included references for extract #107

Closed
mpgreg opened this issue Nov 13, 2023 · 6 comments
Closed

Ingest rst does not follow included references for extract #107

mpgreg opened this issue Nov 13, 2023 · 6 comments
Assignees

Comments

@mpgreg
Copy link
Contributor

mpgreg commented Nov 13, 2023

extract_github_rst() does not follow includes or references to other rst docs. This means that much of the airflow docs content is not being ingested or is not able to reference to the correct page.

https://github.com/astronomer/ask-astro/blob/c45487c7f12a9424dbe885580c687e35e30b7de4/airflow/dags/ingestion/ask-astro-load-github.py#L46C10-L46C10

Need to ingest from scrape of airflow docs html pages instead. https://airflow.apache.org/docs/

@mpgreg
Copy link
Contributor Author

mpgreg commented Nov 13, 2023

Also need code to recursively walk the docs page and extract sub-pages too. Need html splitter code to split on h2 heading.

mpgreg added a commit to mpgreg/ask-astro-upstream that referenced this issue Nov 13, 2023
sunank200 pushed a commit that referenced this issue Nov 17, 2023
sunank200 pushed a commit that referenced this issue Nov 20, 2023
sunank200 pushed a commit that referenced this issue Nov 23, 2023
@sunank200
Copy link
Collaborator

Note:

  • After the implementation is complete, the data should be ingested to the dev database and the dev slackbot should be deployed.
  • This change should be tested by @vatsrahul1001 and after his take on quality of response, this should be merged
  • Create issue for testing if required

@pankajastro
Copy link
Collaborator

@sunank200 @mpgreg AFAIK we generate html from rst docs since we are ingesting html docs why do we need rst too or I'm missing something here

@mpgreg
Copy link
Contributor Author

mpgreg commented Dec 14, 2023

Yes, this issue was meant to be closed if/when we change to html ingest.

@pankajastro
Copy link
Collaborator

pankajastro commented Dec 15, 2023

Yes, this issue was meant to be closed if/when we change to html ingest.

cc: @sunank200 @phanikumv

@phanikumv
Copy link
Collaborator

Closing as discussed with Pankaj and Ankit in the sprint planning call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants