This is a module that works with audiogrep to create timestamped transcripts of audio files. It is very experimental and in the early stages of development.
This script also assumes you're getting a transcript from audiogrep, so first you need to install that.
Audiogrep requires pip, ffmpeg, and pocketsphinx. DISCLAIMER: the audiogrep installation instructions in general assume you're using a mac.
You'll need Python 3 and lxml to run this script (see lxml install instructions in the next paragraph). If you have Python 2.x installed on your machine, you can also install Python 3 to run alongside it; there are lots of instructions on the internet about how to do this. Because instructions vary depending on what OS you're using, use Google and your judgement to decide which instructions to use.
To use this script (assuming 2.x is your primary version of Python), use the command python3
rather than python
. Likewise, to install lxml, try typing python3 -m pip install lxml
-- this should let you install lxml in a way that it will work with Python 3.
You should be all set to create transcripts at this point.
Once you have audiogrep and its dependencies installed, you'll follow several steps to create plain text of the transcript, as well as some structured XML that the Islandora Oral Histories module can use.
- Assuming you don't have transcript files yet, use audiogrep to create the initial transcript files (per the audiogrep readme):
audiogrep --input path/to/*.mp3 --transcribe
This will transcribe all the audio files in a given directory. Budget some time for this; each transcript will take less than the total time of the audio recording, but could take up to half as long. In general, if you have lots of audio files, expect that this step will be lengthy.
- To transcribe one file at a time, use
cd
to get to the folder containing the audio files, and then:
audiogrep --input filename.mp3 --transcribe
where "filename" is the name of the file you want to transcribe.
- Download make_transcripts and then copy the folder with you mp3s into
make_transcripts/transcription
- Open
transcript_parsing.py
in your preferred text editor and change thepath
variable (on line 13) that says "path/to/your/mp3s" to the actual filepath you'll be using - Navigate to
transcription
using the command line and then type ``python3 transcript_parsing.py```
You should get derivative folders for each audio recording, containing:
- a copy of the mp3 audio file
- a copy of the transcription.txt file generated by audiogrep
- a derivative .txt file with timestamp information (no chunks of text)
- a derivative .txt file with just text
- a derivative .xml file with timestamp information
- a structured XML document (filename_transcript.xml) for use with the Islandora Oral Histories module
The XML transcript output is intended to fit the Islandora Oral Histories Module. The Oral Histories Module has the following dependencies (per their github page):