A FastAPI service wrapping Microsoft's markitdown library. This service provides a robust API for converting various file formats and web content to clean, formatted Markdown with enterprise-grade features.
Current version: Refer to settings.VERSION
in the code.
- Swagger UI: Available at
/docs
(configurable viasettings.DOCS_URL
) - ReDoc: Available at
/redoc
(configurable viasettings.REDOC_URL
) - OpenAPI Schema: Available at
/openapi.json
(configurable viasettings.OPENAPI_URL
)
On startup, the application:
- Initializes logging
- Sets up the log directory
- Initializes the database
- Checks log rotation configuration
- Logs startup status and configuration details
On shutdown, the application:
- Flushes all logs
- Logs the shutdown status
The application includes:
- A global exception handler for unhandled exceptions
- A request validation error handler for invalid request parameters
The application logs the following audit events:
- Service startup
- Service shutdown
- Health checks
These logs include detailed information about the application state and configuration.
The service supports various file formats for conversion. The exact list is defined in settings.SUPPORTED_EXTENSIONS
.
Key dependencies include:
- FastAPI
- Uvicorn
- SQLModel
- Pydantic
- Typer
- Rich
- Microsoft's MarkItDown library
For a complete list, refer to requirements.txt
in the project root.
I have a software engineering background but I am not a Python expert. I've been using AI to help build features into MarkItLikeItsHot. While AI is helpful, it doesn't write great code. I've found adding tests that show exactly what I want helps get better results.... most of the time. Some of these tests were also written with AI help, so they need work too.
The early tests take a black box testing approach and call the API endpoints directly on the test container, which is slow. This made sense at the time (and I think still does) but it does make the tests run slowly.
Despite this I've got some solid features working:
- A command line tool for managing users and API keys
- The option to turn API key requirements on/off
- Adjustable rate limits
- A working conversion API
- Admin tools
- Different levels of logging across dev, test and production
- Convert multiple file formats to Markdown (PDF, DOCX, PPTX, HTML, etc.)
- Process direct text/HTML input
- Convert web pages with special Wikipedia handling
- Clean and standardize Markdown output
- API Key Authentication
- Rate Limiting
- Audit Logging
- Health Monitoring
- Comprehensive Error Handling
- CORS Support
- Interactive CLI for user and API key management
- Database initialization and maintenance tools
- System health checks and diagnostics
- Log rotation and cleanup
On the first run of an environment the default admin API key will be generated and viewable in the logs.
# Start development environment with hot reload
./run.sh dev
# Start production environment
./run.sh prod
To run the full test suite, use the following command:
# Run test suite
./run.sh -r test
This command is consistent with how you start other environments. The -r
flag stands for "run". You can refer to the run.sh
script for more details on how environments are started and managed.
The test suite includes various tests to ensure the functionality and reliability of the application. These tests cover different aspects such as API endpoints, database operations, and core functionalities.
To view the test logs, you can use:
# View test container logs
./run.sh -l test
If you need to clean up the test environment, use:
# Wipe test container
./run.sh -c test
Note: Running tests may take some time, especially if they include integration or end-to-end tests. Ensure you have a stable environment before running the test suite.
A separate script is provided to test the rate limiting functionality: markitdown-service/tests/test_rate_limit.sh
. This script sends multiple requests to the API to verify that rate limiting is working correctly.
To use this script:
-
Open the script file and set your API key:
API_KEY="YOUR-API-KEY-HERE"
-
Run the script:
bash markitdown-service/tests/test_rate_limit.sh
The script will send 70 requests to the API with a short delay between each request. It will display the status of each request, including:
- HTTP status code
- Remaining rate limit
- Total rate limit
- Reset time for the rate limit
At the end, it will provide a summary of successful requests, rate-limited requests, and any errors encountered.
This test is useful for verifying that the rate limiting feature is functioning as expected and for fine-tuning rate limit settings.
On the first run of the development and production environments this will also show you the default admin API key generated on first run for the admin user. Subsequent restarts will not generate a new default API key, so make a note of it as these are not shown again.
# View production container logs
./run.sh -l prod
# View development container logs
./run.sh -l dev
# View test container logs
./run.sh -l test
# -C (big C) to wipe all environments
./run.sh -C
# Wipe development container
./run.sh -c dev
# Wipe production container
./run.sh -c prod
# Wipe test container
./run.sh -c test
You can use the administration CLI on the development and production environments to manage users and api keys.
# Use the CLI on production
./run.sh -i prod
# Use the CLI on development
./run.sh -i dev
All endpoints require an API key (when enabled):
-H "X-API-Key: your-api-key"
curl -X POST "http://localhost:8000/api/v1/convert/file" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: multipart/form-data" \
-F "file=@path/to/document.docx"
curl -X POST "http://localhost:8000/api/v1/convert/text" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{"content": "<h1>Hello World</h1><p>This is a test</p>"}'
curl -X POST "http://localhost:8000/api/v1/convert/url" \
-H "X-API-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
curl -X GET "http://localhost:8000/health"
The health check endpoint provides detailed information about the system status, including:
- Overall health status
- Application version
- Current environment
- Authentication status
- Supported file formats
- Database connection status
- Logging status
- Rate limiting configuration
- API key authentication settings
API key authentication can be disabled for development or testing purposes:
- Environment Variable Method:
export API_KEY_AUTH_ENABLED=false
./run.sh dev
- Docker Compose Method: Add to your service in docker-compose.yml:
environment:
- API_KEY_AUTH_ENABLED=false
- .env File Method: Create or modify .env in the project root:
API_KEY_AUTH_ENABLED=false
Rate limiting is applied to all requests, regardless of whether API key authentication is enabled or disabled. The rate limiting is based on either the API key (if present) or the client's IP address. The current rate limiting settings are:
- Default rate limit: 30 requests per 60 seconds
- Endpoint-specific rate limits:
/api/v1/convert/url
: 60 requests per 60 seconds/api/v1/convert/file
: 60 requests per 60 seconds/api/v1/convert/text
: 60 requests per 60 seconds
Rate limiting can be enabled or disabled using the RATE_LIMITING_ENABLED
setting.
Certain endpoints can be excluded from rate limiting. Currently, all endpoints under /api/v1/admin/*
are excluded.
For testing purposes, there are separate rate limit settings:
- Test rate limit: 5 requests per 5 seconds
To modify these settings, you can use environment variables:
environment:
- RATE_LIMITING_ENABLED=true
- RATE_LIMIT_DEFAULT_RATE=30
- RATE_LIMIT_DEFAULT_PERIOD=60
These settings define the number of requests allowed per time period for each unique API key or IP address. You can adjust these values to suit your needs.
When API key authentication is disabled, you may want to restrict CORS:
environment:
- ALLOWED_ORIGINS=["http://localhost:3000"]
- ALLOWED_METHODS=["GET", "POST"]
- ALLOWED_HEADERS=["*"]
Audit logging tracks all API operations regardless of authentication status:
environment:
- AUDIT_LOG_ENABLED=true
- AUDIT_LOG_RETENTION_DAYS=90
Complete example for local development:
services:
markitdown-dev:
environment:
- ENVIRONMENT=development
- API_KEY_AUTH_ENABLED=false
- LOG_LEVEL=DEBUG
- RATE_LIMIT_REQUESTS=1000
- RATE_LIMIT_PERIOD=hour
- ALLOWED_ORIGINS=["*"]
- AUDIT_LOG_ENABLED=true
- DATABASE_URL=sqlite:///./data/dev_api_keys.db
Check current security settings:
# Using the CLI
./run.sh -i dev
# Select "Show Version Info" from the menu
# Or using curl
curl http://localhost:8000/health
The health endpoint response includes authentication status:
{
"status": "healthy",
"version": "1.0.0",
"environment": "development",
"auth_enabled": false,
"database": "connected",
"rate_limit": {
"requests": 1000,
"period": "hour"
}
}
The easiest way to access the interactive CLI is:
# Launch interactive CLI for development environment
./run.sh -i dev
# Or for production
./run.sh -i prod
If needed, you can still access the CLI directly via Docker:
sudo docker exec -it markitlikeitshot-markitdown-dev-1 python manage.py interactive
# Create new user
python manage.py users create --name "John Doe" --email "[email protected]"
# List users
python manage.py users list
# View user details
python manage.py users info 1
# Create API key
python manage.py apikeys create --name "Test Key" --user-id 1
# List API keys
python manage.py apikeys list
# Deactivate API key
python manage.py apikeys deactivate 1
# Initialize database
python manage.py init
# Check system health
python manage.py check
# Clean old logs
python manage.py clean
ENVIRONMENT
: development/production/testAPI_KEY_AUTH_ENABLED
: Enable/disable API key authenticationLOG_LEVEL
: DEBUG/INFO/WARNING/ERROR/CRITICALRATE_LIMIT_REQUESTS
: Number of requests allowed per periodRATE_LIMIT_PERIOD
: Time period for rate limitingLOG_DIR
: Directory for storing log filesALLOWED_ORIGINS
: List of allowed origins for CORSALLOWED_METHODS
: List of allowed HTTP methods for CORSALLOWED_HEADERS
: List of allowed headers for CORSAPI_KEY_HEADER_NAME
: Name of the header used for API key authentication
The application uses Pydantic's BaseSettings for managing these environment variables. Refer to app/core/config/settings.py
for a complete list of configurable settings.
The application uses a logging system with different loggers for various components:
- Main application logger
- API logger
- Database logger
- Security logger
Logs are stored in the directory specified by LOG_DIR
. Log rotation is configured for production and development environments.
- Documents: .pdf, .docx, .pptx, .xlsx
- Audio: .wav, .mp3
- Images: .jpg, .jpeg, .png
- Web: .html, .htm
- Data: .txt, .csv, .json, .xml
/markitlikeitshot
├── ai_dump.md
├── ai_dump.sh
├── docker-compose.yml
├── LICENSE
├── markitdown-service
│ ├── app
│ │ ├── api
│ │ │ └── v1
│ │ │ └── endpoints
│ │ │ ├── admin.py
│ │ │ └── conversion.py
│ │ ├── cli
│ │ │ ├── commands
│ │ │ │ ├── api_key.py
│ │ │ │ ├── logs.py
│ │ │ │ └── user.py
│ │ │ ├── interactive.py
│ │ │ ├── manage.py
│ │ │ └── utils
│ │ │ └── menu_utils.py
│ │ ├── core
│ │ │ ├── audit
│ │ │ │ ├── actions.py
│ │ │ │ ├── audit.py
│ │ │ │ └── __init__.py
│ │ │ ├── config
│ │ │ │ ├── __init__.py
│ │ │ │ └── settings.py
│ │ │ ├── errors
│ │ │ │ ├── base.py
│ │ │ │ ├── exceptions.py
│ │ │ │ ├── handlers.py
│ │ │ │ └── __init__.py
│ │ │ ├── __init__.py
│ │ │ ├── logging
│ │ │ │ ├── config.py
│ │ │ │ ├── formatters.py
│ │ │ │ ├── __init__.py
│ │ │ │ └── management.py
│ │ │ ├── rate_limiting
│ │ │ │ ├── limiter.py
│ │ │ │ └── middleware.py
│ │ │ ├── security
│ │ │ │ ├── api_key.py
│ │ │ │ └── user.py
│ │ │ └── validation
│ │ │ ├── __init__.py
│ │ │ └── validators.py
│ │ ├── db
│ │ │ ├── init_db.py
│ │ │ ├── __init__.py
│ │ │ └── session.py
│ │ ├── main.py
│ │ └── models
│ │ ├── auth
│ │ │ ├── api_key.py
│ │ │ └── user.py
│ │ └── __init__.py
│ ├── docker
│ │ ├── config
│ │ │ ├── log-maintenance
│ │ │ └── logrotate.conf
│ │ └── start.sh
│ ├── Dockerfile
│ ├── logs
│ │ ├── app_development.log
│ │ ├── audit_development.log
│ │ ├── audit_test.log
│ │ ├── audit_test.log.2024-12-27
│ │ ├── audit_test.log.2024-12-28
│ │ ├── cli_development.log
│ │ └── sql_development.log
│ ├── manage.py
│ ├── pytest.ini
│ ├── requirements.txt
│ └── tests
│ ├── api
│ ├── conftest.py
│ └── fixtures
├── README.md
└── run.sh
This project requires Python 3.8 or higher. You can check your Python version by running:
python --version
Contributions to MarkItLikeItsHot are welcome! Please follow these steps to contribute:
- Fork the repository
- Create a new branch for your feature or bug fix
- Make your changes and commit them with a clear commit message
- Push your changes to your fork
- Create a pull request to the main repository
Please ensure your code adheres to the project's coding standards and includes appropriate tests.
Here are some common issues and their solutions:
-
API Key Authentication Issues
- Ensure the API key is correctly set in the request header
- Check if API key authentication is enabled in the configuration
-
Rate Limiting Errors
- Check the current rate limit settings in the configuration
- If you're hitting limits too quickly, consider adjusting the settings or optimizing your requests
-
File Conversion Errors
- Ensure the file format is supported (check the list of supported file types)
- Verify the file is not corrupted or empty
-
Database Connection Issues
- Check if the database URL is correctly set in the configuration
- Ensure the database server is running and accessible
For more specific issues, please check the application logs or create an issue on the project's GitHub repository.
This project is licensed under the MIT License. See the LICENSE file for details.