📄 Document Embedding Analysis

Harnessing embeddings for direct content comparison analysis.

🎤 Introduction

Given an outline of an article (all the headings and subheadings), how well can a Large Language Model (LLM) generate the content? That's the question my client wanted to investigate.

Taking the Wikipedia page for self-driving cars as an example, we see the following headings: Definitions, Automated driver assistance system, Autonomous vs. automated, Autonomous versus cooperative, etc. My client wanted to give these headings to ChatGPT, ask it to write the content, and compare that content with the ground truth.

My goal was to build a dataset for him. I used Wikipedia articles, patents, and Arxiv papers. I extracted the headings and subheadings. Then, for any sections that were longer than 512 tokens, I split them up and gave them new unique titles.

After that, I created embeddings for everything extracted (headings and content) and analysed them, e.g., calculating the Rouge-L, MAUVE and cosine similarity scores. Some scores would only accept a max of 512 tokens (hence why I had to split them above).

⭐ Review

My client was delighted with the work and left the following review.

💻 How to Run the Code

1. Download code + create env

$ git clone https://github.com/codeananda/document_extraction.git
$ cd document_extraction
$ pip install -r requirements.txt

2. Set your OpenAI key

Go to https://platform.openai.com/account/api-keys to create a new key (if you don't have one already).
Rename .env_template to .env
Add your key in next to OPENAI_API_KEY

🤔 Questions

What do I need to modify if I want to change `max_section_length` to > 512

You must remove/comment out parts of the code that only accept a max of 512 tokens as input: e5-base-v2 and MAUVE.

Remove e5-base-v2 embeddings code

This is found in _gen_embed_section_content and generate_embeddings_plan_and_section_content and is defined like so

embed_e5 = HuggingFaceEmbeddings(
    model_name="intfloat/e5-base-v2", encode_kwargs={"normalize_embeddings": True}
)

Comment that out and all references to it.

Remove all references to _embedding_2
Remove mauve calculations (Crtl + F and search 'mauve' to find them)

How do I manually edit text extracted from pdfminer?

Run load_arxiv_paper(path_to_paper)
Write output to disk using json.dump
Modify the Content key.
Load content from disk using json.load
See extract_plan_and_content for functions to pass to next

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
data		data
output		output
.env_template		.env_template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arxiv_work.ipynb		arxiv_work.ipynb
main.py		main.py
patent_work.ipynb		patent_work.ipynb
requirements.txt		requirements.txt
tests.py		tests.py
wikipedia_work.ipynb		wikipedia_work.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Document Embedding Analysis

🎤 Introduction

⭐ Review

💻 How to Run the Code

1. Download code + create env

2. Set your OpenAI key

🤔 Questions

What do I need to modify if I want to change `max_section_length` to > 512

How do I manually edit text extracted from pdfminer?

About

Languages

License

codeananda/document_embedding_analysis

Folders and files

Latest commit

History

Repository files navigation

📄 Document Embedding Analysis

🎤 Introduction

⭐ Review

💻 How to Run the Code

1. Download code + create env

2. Set your OpenAI key

🤔 Questions

What do I need to modify if I want to change max_section_length to > 512

How do I manually edit text extracted from pdfminer?

About

Resources

License

Stars

Watchers

Forks

Languages

What do I need to modify if I want to change `max_section_length` to > 512