- Speaker: Philipp Zumstein
- Venue: Social Science Data Lab, MZES, Mannheim
- Date: March 15th, 2017, at 12 noon
- Location: MZES, A-231
Most methods for data-driven research (including Big Data, Data Science, and Digital Humanities) work primarily on text data or numbers. However, there is also a lot of information which is only available in printed books or newspapers. This information has to be first digitized and then further processed to extract the text or data. The main focus of the talk is optical character recognition (OCR). We will see the OCR workflow in general, discuss some OCR software, and how you can use these tools practically. Building such an infrastructure or performing these initial steps may need a reasonable amount of time and resources, or also be a project itself. The Mannheim University Library has in this area some infrastructure projects which are briefly mentioned.
- optical character recognition (OCR)
- infrastructure for research
- data-driven research
- View HTML-presentation online:
- on GitHub pages: https://socialsciencedatalab.github.io/building-infrastructure-for-data-driven-research/
- on slides.com: http://slides.com/zuphilip/ssdl-2017/#/
- View and Download PDF:
- on speakerdeck: https://speakerdeck.com/zuphilip/building-infrastructure-for-data-driven-research
- direct: /pdf/ssdl2017.pdf
- Source files:
- /docs/index.html content (CC-BY)
- /docs/css/theme/white-blue-montserrat.css layout info
- all other files in /docs from reveal.js (MIT)
- OCR Software
- OCR in general
- Links to awesome OCR projects: https://github.com/kba/awesome-ocr
- collected list of publications for OCR by @OCR-D-project: https://www.zotero.org/groups/ocr-d/items/
- Some of our projects
Feel free to ask also questions here by opening up a new issue and we can continue discussion.