Skip to content

pablocael/gpt-pdf-organizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT PDF Organizer

GPT PDF Organizer is an automatic pdf file organizer that will query ChatGpt API to perform information retrieval about the contents of the files in order to properly classify them and organize them into folders.

The organization process can be customized using a config file in yaml format. This customization involves how the new organized pdf file names will and subfolders will be constructed from extracted properties.

Table of Contents

Usage

gpt-pdf-organizer --input-path INPUT_PATH \
 --output-folder OUTPUT_FOLDER \
 [--config-file CONFIG_FILE]

Default config-file used will be a local './config.yaml' file if no other is passed as argument.

API Cost of Usage and Tokens

The ChatGpt API uses tokens to calculate the API request call cost. Tokens are not always the same as words. This tool performs a query to ChatGpt API to ask questions about the content of a PDF file, and will use tokens in two ways:

  • The query prompt content, sent to GPT to ask information the PDF content.
  • The PDF content itself

The PDF content sent is not the full PDF text but rather the first config.maxNumTokens tokens that will be extracted from the PDF file.

The total amount of tokens spent during a single request for each file is then config.maxNumTokens + promptTokenSize, where promptTokenSize is the of the prompt itself and spans 281 tokens.

For example, if one chooses to use config.maxNumTokens=1500 tokens to read the PDF content, the total amount of tokens spent in the request will be 1500+281=1781.

The cost of usage also depends on the model used. One can configure the model by setting llmModelName in the config file.

In summary, the total cost for organizing your files will depend on 3 factors:

  • The total number of files to organize
  • The chosen model (e.g: gpt-4 or gpt-3.5-turbo, etc). See OpenAPI Models for all the models.
  • The number of tokens to use for extracting PDF content (see maxNumTokens). The larger maxNumTokens value, the more accurate will be the classification of the content, but also more costly the api call will be.

Make sure to add enough credits to your ChatGpt account and setting up your apiKey before using.

If one wants to try to use a smaller prompt, one can also try to customize the prompt. See Prompt Customization.

Configuration File

The configuration file is an yaml file that supports the following properties:

Property Type Description
apiKey string The API key used for authentication. If the OPENAI_API_KEY environment variable is set, it will take precedence over this configuration.
maxNumTokens integer The maximum number of tokens to use for extracting content from the PDF file. Only the first maxNumTokens tokens will be extract from the PDF file. The larger the number of tokens, the more accurate the results for classification, but more costly will be the request.
llmModelName string The name of the language model to use, e.g., "gpt-3.5-turbo".
logLevel string The level of logging to output, e.g., "info".
organizer.moveInsteadOfCopy boolean When set to true, files in the original folder are moved to the output destination instead of being copied. The default is false. Be aware that will remove the original file after moving it to the classified folder structure
organizer.subfoldersFromAttributes List[string] Determines the structure of subfolders in the output directory based on content attributes such as "content_type", "author", "topic", "sub_topic", and "year". The nesting order of subfolders is the same as the order of the attributes in the config file (e.g: attr1/attri2/...). If an attribute cannot be inferred from the llm api, it will be null and it will be replaced with "unknown_<attribute_name>" in the folder path. The default is "content_type" if this property is not specified.
organizer.filenameFromAttributes List[string] Configures the filename based on content attributes like "title", "content_type", "author", "topic", "sub_topic", and "year". The attribute order within the filename is the same as the order the attributes appear in the config file. The original title attribute is always used, and if not set, other attributes are appended to the left of the title. If an attribute is null (e.g: cannot be inferred from the llm api), it will be replaced with "unknown_<attribute_name>". The default is ["title"] if not specified.
organizer.filenameAttributeSeparator string Specifies the separator used to join attributes in the filename. The default is "-", but any string not containing invalid filename characters is supported.

The attributes used in config organizer.subfoldersFromAttributes and organizer.filenameFromAttributes are:

  • Title
  • Year of Publication
  • Author(s)
  • Main Topic
  • Content Type: Can be 'Book' or 'Article'
  • Sub Topic (or Sub Area within Main Topic).

Main Topic and Sub Topic are always classified according to the following list, following Topic/Sub Topic nesting:

  • Natural Sciences
    • Physics
    • Chemistry
    • Biology
    • Earth Sciences
  • Formal Sciences
    • Mathematics
    • Computer Science
    • Statistics
  • Applied Sciences
    • Engineering
    • Medicine
    • Agriculture
  • Social Sciences
    • Psychology
    • Economics
    • Sociology
  • Interdisciplinary Fields
    • Environmental Science
    • Biotechnology
    • Neuroscience
  • Humanities
    • Science and technology studies
    • History and philosophy of science
    • Digital humanities

Example of Usage

Classifying a Single File

gpt-file-organizer --input-path="/home/me/Documents/afile.pdf"  \
 --output-folder="/home/me/Documents/classified/"

Classifying All Files in a Folder

To classify a whole folder, just use same command passing a folder instead of a file. The application will automatically detect if the input is a folder or a file.

gpt-file-organizer --input-path="/home/me/Documents/pdfs/"  \
 --output-folder="/home/me/Documents/classified/"

Example of Output File Structure

The above example is an theoretical output for classified files that uses only year and title in output filename, and content_type/sub_topic as subfolders (this can be configured in config file):

  • Original Files:
pdfs/
├── file1_arxiv.pdf
├── article_202_arxiv.pdf
├── dimensionality.pdf
├── afile.pdf
...
  • Classified Output Files:
output_dir/
├── article/
│ ├── mathematics/
│ │     ├── minimum_curves_in_manifolds.pdf
│ │     ├── the_great_article_of_somethin.pdf
│ ├── computer_science/
│ │     ├── the_sift_algorithm.pdf
│ │     ├── viola_jones_algorithm.pdf
│ ...
├── book/
│ ├── mathematics/
│ │     ├── calculus_volume_19.pdf
│ │     ├── fourier_transforms_200.pdf
│ ...
├── unclassified/
│ ├── afile.pdf

Limitations

For some PDF files, the content may possibly not be extracted due to the quality of the scanned image or lacking of metadata. For those cases, the file will be sent to a unclassified folder within the output directory.

Another case where files may not be classified is due to prompt response inaccuracy: depending on the model used and the number of tokens used to extract the content (see maxNumTokens in config), the prompt api may not be able to classify the content.

Installation

pip install gpt-pdf-organizer

Development

Prompt Customization

This step is optional and a default prompt is already tweaked to extract good results from ChatGpt API. If however, one wants to change the prompt, follow the steps bellow.

To customize the prompt that will be performed to each file, one can do it in two ways:

  • By changing the prompt in the prompt_builder module: this will keep current GPT prompter but allows to change the prompt sent to the GPT api.
  • By passing a different PromptQuerier class instance to Application when it starts on main entrypoint: this allows integrating different apis other than ChatGpt, such as bing or bard.

About

A chat-gpt based pdf organizer.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages