Watch first the video animation Kanta-asiakkuuden jäljet!
S-group: Fill in a form in a customer service desk.
K-group: Fill in a paper form and mail it to X (sorry forgot the details, trying to search for the link).
Both S and K will send you your data in a paper format via mail. This makes processing the data much harder but not impossible. In the future the data will hopefully be provided in a convenient machine readable format.
Some news about the data here and here
In short, you need to
- Scan your data
- Use some Optical character recognition (OCR) tool to convert scanned data into a machine readable format
- Process and analyse the converted data
Here's an example workflow that worked for me
- Scan your data into a PDF
- Use Tesseract for OCR
- Use R for processing and analysing the data
- R script with a lot of different processing stages
- More details in the end of this page!
See the video animation Kanta-asiakkuuden jäljet!
Here are also some visualizations of the data:
Some tips and details of installing and using the tools on OSX 1.8.5.
Installation
- Useful instructions here
- Additionally, the Finnish language pack is needed
- PDFTK also needed
- This blog post was helpful for getting started with Tesseract
Running OCR
- If data is given in a table format with borders, OCR will be in trouble. There might be some option for Tesseract to adapt to this, but at least I didn't find anything. So I ended up removing the horizontal lines in R, which was not trivial since the lines were not exactly horizontal but a bit tilted instead
- It would have also been useful too add custom vocabulary such as "supermarket", but I did not get this to work with Tesseract (some hints here)
- Used R package playitbyr
- Needs Csound, installing instructions here
- Note! playitbyr does not work with Csound 6, so install version 5 instead!
test edit