A toolkit for automatically collecting and simple recognizing captchas from Melon Ticket - Global. The tool includes two main functional modules: captcha collection and recognition.
├── collect_captcha.py # Captcha collection script
├── quick_ocr.py # Captcha recognition script
├── requirements.txt # Project dependencies
├── saved_captcha/ # Directory for saving captcha images
└── processed_captcha/ # Directory for processed captcha images
pip install -r requirements.txt
- Visit Tesseract-OCR Download Page
- Download and install Tesseract-OCR (Windows users should choose the Windows version)
- Default installation path:
C:\Program Files\Tesseract-OCR\
- Remember the installation path as it will be needed later
Run the following command in terminal:
"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="C:/Users/your_username/AppData/Local/Google/Chrome/User Data"
Note:
- Replace "your_username" with your actual Windows username
- If Chrome is installed in a different location, modify the path accordingly
Enter the Chrome browser, open the Melon Ticket - Global website, log in, and navigate to the ticket purchase popup page to ensure that the captcha image is displayed correctly.
- Run the collection script:
python collect_captcha.py
- The script will automatically:
- Connect to the opened Chrome browser
- Create saved_captcha folder (if it doesn't exist)
- Wait for user to press Enter to start collection
- Automatically collect 200 captcha images
- Save images in the saved_captcha folder
- Run the recognition script:
python quick_ocr.py
- The script will automatically:
- Check necessary directories and Tesseract installation
- Read images from saved_captcha
- Preprocess and perform OCR on each image
- Save original images to processed_captcha folder with recognition results as filenames
- Function: Automatically collect captcha images
- Workflow:
- Connect to opened Chrome browser
- Wait for user confirmation to start collection
- Loop to collect captcha images
- Automatically click refresh button for new captchas
- Save images to saved_captcha folder
- Function: Recognize captcha image content
- Workflow:
- Read images from saved_captcha
- Preprocess images (add white background, invert colors, etc.)
- Use Tesseract-OCR for text recognition
- Save original images to processed_captcha folder with recognition results as names
-
Environment Requirements:
- Python 3.6 or higher
- Chrome browser
- Windows operating system (other systems need path modifications)
-
Common Issues:
- If "Tesseract-OCR not found" appears, check installation path
- If unable to connect to Chrome, verify browser is properly launched
- If recognition rate is unsatisfactory, image preprocessing parameters may need adjustment
-
Folder Description:
- saved_captcha: Stores original captcha images
- processed_captcha: Stores processed images (named with recognized text)
- temp: Temporary folder (automatically created and deleted during execution)
This tool is for educational and research purposes only. Users should comply with the website's terms of service and relevant laws and regulations. The authors are not responsible for any misuse or potential consequences of using this tool.
For issues or suggestions, please submit an Issue.