Skip to content

Latest commit

 

History

History
88 lines (72 loc) · 2.63 KB

doc.md

File metadata and controls

88 lines (72 loc) · 2.63 KB

Texygen Tutorial

How to run?

#usage:
python main.py -g <GAN type> -t <training method> -d <data location>
  -g <GAN type>  
    specify the GAN type in the experiment
    <GAN type> = seqgan | maligan | rankgan | leakgan | gsgan | textgan | mle
  -t <training method>
    specify the traning method in the experiment
    <training method> = oracle | cfg | real
    default is oracle
  -d <data location>
    use user's own dataset
    only avaiable with real data training 
    default is 'data/image_coco.txt'

Quick tutorial

The basic usage is

python main.py -g <GAN type>

where now we support six kinds of GAN, whose names are seqgan, leakgan, maligan, rankgan, textgan and gsgan, along with pure mle training

Under this command, you will initialize an oracle LSTM, and fit the model to learn the pattern generated by the oracle.

Use your own data

Besides oracle LSTM, you can train the models with your own natural language dataset.

python main.py -g <GAN type> -t real -d <your data base location> 

Or just use default database (we offer you image coco language dataset)

python main.py -g <GAN type> -t real 
  • Note that if you want to use Chinese as training data, you need to segment characters first, like in data/shi.txt. In utils/text_process we provide chinese_process function for you.

The max length of generated samples are determined by the maximum sentence length of training set.

Context Free Grammar (CFG) training

TexyGen also allows the model to learn certain context free grammar.

python main.py -g <GAN type> -t cfg

Now TexyGen use the following CFG rule with maximum depth to be 7.

cfg_grammar = """
 S -> S PLUS x | S SUB x |  S PROD x | S DIV x | x | '(' S ')'
 PLUS -> '+'
 SUB -> '-'
 PROD -> '*'
 DIV -> '/'
 x -> 'x' | 'y'
"""

Where to find the experiment results.

The results will be saved as experiment-XXgan.csv, a comma-separated values file, with first column be the training epoch, the others be the scores of metrics at each epoch.

And if you train the models using real-world data, you can get the generated text at save\test_file.txt

The log will also be printed on the console.

Examples

# run MLE training on oracle LSTM
python main.py 
# run seqGAN training on oracle LSTM
python main.py -g seqgan
# run leakGAN training on image_coco
python main.py -g leakgan -t real
# run maliGAN training on certain dataset
python main.py -g leakgan -t real -d data/my_own_data
# run textGAN training on CFG
python main.py -g textgan -t cfg