Skip to content

MSU-Libraries/transcriptinator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This is a module that works with audiogrep to create timestamped transcripts of audio files. It is very experimental and in the early stages of development.

Installation

This script also assumes you're getting a transcript from audiogrep, so first you need to install that.

Audiogrep requires pip, ffmpeg, and pocketsphinx. DISCLAIMER: the audiogrep installation instructions in general assume you're using a mac.

You'll need Python 3 and lxml to run this script (see lxml install instructions in the next paragraph). If you have Python 2.x installed on your machine, you can also install Python 3 to run alongside it; there are lots of instructions on the internet about how to do this. Because instructions vary depending on what OS you're using, use Google and your judgement to decide which instructions to use.

To use this script (assuming 2.x is your primary version of Python), use the command python3 rather than python. Likewise, to install lxml, try typing python3 -m pip install lxml -- this should let you install lxml in a way that it will work with Python 3.

You should be all set to create transcripts at this point.

Creating Transcripts

Once you have audiogrep and its dependencies installed, you'll follow several steps to create plain text of the transcript, as well as some structured XML that the Islandora Oral Histories module can use.

  • Assuming you don't have transcript files yet, use audiogrep to create the initial transcript files (per the audiogrep readme):
audiogrep --input path/to/*.mp3 --transcribe

This will transcribe all the audio files in a given directory. Budget some time for this; each transcript will take less than the total time of the audio recording, but could take up to half as long. In general, if you have lots of audio files, expect that this step will be lengthy.

  • To transcribe one file at a time, use cd to get to the folder containing the audio files, and then:
audiogrep --input filename.mp3 --transcribe

where "filename" is the name of the file you want to transcribe.

  • Download make_transcripts and then copy the folder with you mp3s into make_transcripts/transcription
  • Open transcript_parsing.py in your preferred text editor and change the path variable (on line 13) that says "path/to/your/mp3s" to the actual filepath you'll be using
  • Navigate to transcription using the command line and then type ``python3 transcript_parsing.py```

You should get derivative folders for each audio recording, containing:

  • a copy of the mp3 audio file
  • a copy of the transcription.txt file generated by audiogrep
  • a derivative .txt file with timestamp information (no chunks of text)
  • a derivative .txt file with just text
  • a derivative .xml file with timestamp information
  • a structured XML document (filename_transcript.xml) for use with the Islandora Oral Histories module

Use with Islandora

The XML transcript output is intended to fit the Islandora Oral Histories Module. The Oral Histories Module has the following dependencies (per their github page):

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%