Reynolds Journalism Institute Photo Archive Automated Assessment System
The Reynolds Journalism Institute has several terrabytes worth of photographs taken over the years. The topics covered span across all possible topics including sports, public debate, animals, etc. Specifically the request has been to make it easier for the RJI archivist to sort through images and determine good images to keep and images that should be discarded. Following up this problem, it was requested a program be written to determine better quality photos from a set to show to editors making their life easier.
Therefore, there are 2 key goals to this project:
- Remove easy to recognize bad quality images (blurry, black screens, etc.)
- Rank images in similar groupings
To go about this is has been determined that the program be split into 3 separate components:
- Filter Out the Easy to Recognize "Bad" Images
- Apply histogram equalization
- Apply a Laplacian filter
- Take the variance of the resulting matrix versus a set threshold
- Cluster Remaining Images
- Dimension Reduction Method:
- Resize Image to 720x420
- Apply PCA to take the eigenvectors corresponding to 70% of the variance
- Cluster the transformed images using DBSCAN
- Deep Learning Method:
- Apply ResNet without the final layer
- Take the resulting feature map and cluster using DBSCAN
- Dimension Reduction Method:
- Rank Images In Clusters:
- TBD
Most of the project runs through python scripts. The required libraries are listed in the requirements.txt. A setup.py shall be implemented in the future.
Needs to be optimized!
Needs to be optimized!
EDA is located in these two jupyter notebooks 1 and 2. They reveal that there is not much difference between prelabeled 1s and 7s.
Results are stored in both csvs and tensorboard logging