This script processes files recursively from a specified directory using an API, logs results in a local SQLite database, and provides options for retrying failed or pending files. It includes features for skipping specific files, generating reports, and running multiple API calls in parallel.
- Parallel Processing: Process files in parallel, with the number of parallel calls configurable.
- Status Tracking: Tracks the execution status, results, and time taken for each file in an SQLite database.
- Retry Logic: Options to retry failed or pending files, or to skip them.
- Detailed Reporting: Prints a summary of file processing and provides a detailed report.
- Polling: Polls the API until the result is complete, with customizable intervals.
Ensure you have the required dependencies installed:
pip install -r requirements.txt
The script uses a local SQLite database (file_processing.db
) with the following schema:
- file_status:
id
(INTEGER): Primary keyfile_name
(TEXT): Unique name of the fileexecution_status
(TEXT): Status of the file (STARTING
,COMPLETED
,ERROR
, etc.)result
(TEXT): API result in JSON formattime_taken
(REAL): Time taken to process the filestatus_code
(INTEGER): API status codestatus_api_endpoint
(TEXT): API endpoint for checking statustotal_embedding_cost
(REAL): Total cost incurred for embeddings.total_embedding_tokens
(INTEGER): Total tokens used for embeddings.total_llm_cost
(REAL): Total cost incurred for LLM operations.total_llm_tokens
(INTEGER): Total tokens used for LLM operations.error_message
(TEXT): Details of errors ifexecution_status
isERROR
; otherwise NULL.updated_at
(TEXT): Last updated timestampcreated_at
(TEXT): Creation timestamp
Run the script with the following options:
python main.py -h
This will display detailed usage information.
-e
,--api_endpoint
: API endpoint for processing files.-k
,--api_key
: API key for authenticating API calls.-f
,--input_folder_path
: Folder path containing the files to process.
-t
,--api_timeout
: Timeout (in seconds) for API requests (default: 10).-i
,--poll_interval
: Interval (in seconds) between API status polls (default: 5).-p
,--parallel_call_count
: Number of parallel API calls (default: 10).--csv_report
: Path to export the detailed report as a CSV file.--db_path
: Path where the SQlite DB file is stored (default: './file_processing.db')--recursive
: Recursively identify and process files from the input folder path (default: False)--retry_failed
: Retry processing of failed files.--retry_pending
: Retry processing of pending files by making new requests.--skip_pending
: Skip processing of pending files.--skip_unprocessed
: Skip unprocessed files when retrying failed files.--log_level
: Log level (default:INFO
).--print_report
: Print a detailed report of all processed files at the end.--exclude_metadata
: Exclude metadata on tokens consumed and the context passed to LLMs for prompt studio exported tools in the result for each file.--no_verify
: Disable SSL certificate verification. (By default, SSL verification is enabled.)
To process files in the directory /path/to/files
using the provided API:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files
To retry files that previously encountered errors:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --retry_failed
To skip files that are still pending:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --skip_pending
To process 20 files in parallel:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files -p 20
To generate and display a detailed report at the end of the run:
python main.py -e https://api.example.com/process -k your_api_key -f /path/to/files --print_report
- Database: Results and statuses are stored in a local SQLite database (
file_processing.db
). - Logging: Logs are printed to stdout with configurable log levels (e.g.,
DEBUG
,INFO
,ERROR
).
Status 'COMPLETED': 50
Status 'ERROR': 10
Status 'PENDING': 5
For more detailed output, you can use the --print_report
option to get a per-file breakdown.
The following statuses are tracked for each file during processing:
- STARTING: Initial state when processing begins.
- EXECUTING: File is currently being processed.
- PENDING: File processing is pending or waiting for external actions.
- ERROR: File processing encountered an error.
- COMPLETED: File was processed successfully and will not be processed again unless forced by rerun options.
For more about statuses : API Docs
On Slack, join great conversations around LLMs, their ecosystem and leveraging them to automate the previously unautomatable!
Unstract client: Learn more about Unstract API clint.
Unstract Cloud: Signup and Try!.
Unstract developer documentation: Learn more about Unstract and its API.