|
1 | 1 | # Targeted Sentiment Analysis PLAYground
|
2 | 2 |
|
3 | 3 | A codebase to bring together different embeddings, datasets and models and efficiently carry out experiments with them.
|
4 |
| - |
5 |
| -## Getting Started |
6 |
| - |
7 |
| -### Pre-requisites |
8 |
| - |
9 |
| -The project uses |
10 |
| - |
11 |
| -* Python 3.6 |
12 |
| -* Pipenv |
13 |
| - |
14 |
| -### Installing |
15 |
| - |
16 |
| -Setup should be straight forward enough using `pipenv`, from the directory of the project run the following command |
17 |
| - |
18 |
| -````Bash |
19 |
| -pipenv install |
20 |
| -```` |
21 |
| - |
22 |
| -### Running Experiments |
23 |
| - |
24 |
| -Setting up experiments takes place in `main.py`. |
25 |
| - |
26 |
| -Each `Experiment` object takes the following |
27 |
| -* An `Embedding` |
28 |
| -* A `Dataset` |
29 |
| -* A `Model` |
30 |
| -* A `RunConfig` (optional) |
31 |
| - |
32 |
| -Experiments can then be run on an experiment instance using `experiment.run(job, steps)` |
33 |
| - |
34 |
| -The job options avaiable are |
35 |
| -* `'train'`, steps **must** be provided. |
36 |
| -* `'eval'`, model **must** have been previously trained. |
37 |
| -* `'train+eval'`, steps **must** be provided, **This is the most easy and straight forward approach** |
38 |
| - |
39 |
| - |
40 |
| -## Experiment Process |
41 |
| - |
42 |
| -### Embedding |
43 |
| - |
44 |
| -All embedding files need to have a path as follows: `embeddings\data\<name>\<version>.txt` |
45 |
| - |
46 |
| -The **path** as shown above is passed on to the `Embedding` constructor |
47 |
| - |
48 |
| -The **name** and **version** parts of the path are assigned to the Embedding object internally as identifiers. |
49 |
| - |
50 |
| -### Dataset |
51 |
| - |
52 |
| -Datasets are initialized with a path as follows: `datasets\data\<name>\` |
53 |
| - |
54 |
| -The **path** and a **parser** is passed on to the `Dataset` constructor |
55 |
| - |
56 |
| -Internally the system looks for the first files in that directory with the words `train` and `test` in there name. |
57 |
| - |
58 |
| -The system then parses these files with the **specified parser** to generate all the required features and labels. |
59 |
| - |
60 |
| -### Model |
61 |
| - |
62 |
| -All models should be defined under `models\<group>\<model>.py` |
63 |
| - |
64 |
| -The `group` can be whatever, for implemented models thus far a reference key to the original paper of the model is being used. eg. `Tang2016a` |
65 |
| - |
66 |
| -All models must inherit from the base `Model` class |
67 |
| - |
68 |
| -All models must contain implementations for the following functions: |
69 |
| - |
70 |
| -* `_params` |
71 |
| -* `_feature_columns` |
72 |
| -* `_train_input_fn` |
73 |
| -* `_eval_input_fn` |
74 |
| -* `_model_fn` |
75 |
| - |
76 |
| -All of these functions must return the respective parts of the tensorflow model. |
77 |
| - |
78 |
| -Everything else is taken care of internally by the system. |
79 |
| - |
80 |
| -### Experiment Results |
81 |
| - |
82 |
| -After running an experiment, results are written to `experiments\data\<group>\<model>\` |
83 |
| - |
84 |
| -The directory contains some extra statistics files, and markdown of functions used for future reference. |
85 |
| - |
86 |
| -The **Tensorboard logdir** is named `tb_summary\` |
87 |
| - |
88 |
| -To launch tensorboard after an experiment run the following, |
89 |
| - |
90 |
| -````Bash |
91 |
| -tensorboard --logdir experiments\data\<group>\<model>\tb_summary |
92 |
| -```` |
93 |
| - |
94 |
| -Alternatively, the `experiment.run()` method takes an optional `start_tb` boolean parameter. |
95 |
| - |
96 |
| -If set to true, the process will launch the tensorboard page and start the tensorboard process automatically. |
97 |
| - |
98 |
| -The tensorboard page will fail to load from the get go until the process starts, it should reload automatically within a few seconds. |
99 |
| - |
100 |
| -## Performance |
101 |
| - |
102 |
| -Extracting features and parsing the dataset files takes a while. |
103 |
| - |
104 |
| -To help with this, the `Dataset` class internally generates a myriad of files during its initial execution to use them in future runs. |
105 |
| - |
106 |
| -All of the generated files are stored under `datasets\data\<name>\_generated` |
107 |
| - |
108 |
| -The files saved include: |
109 |
| - |
110 |
| -* `corpus.csv` |
111 |
| - * A comma seperated file of all unique tokens appearing in the dataset, and their word count |
112 |
| -* `train_dict.pkl` and `test_dict.pkl` |
113 |
| - * Pickle binary files of the parsed datasets in dictionary format |
114 |
| - * These would still need to be mapped to indices of a specific embedding |
115 |
| -* `<embedding_name>\partial_<embedding_version>.txt` |
116 |
| - * A filtered down version of an embedding containing only the words that are in the dataset corpus |
117 |
| - * This will be loaded instead of the full embedding in future runs |
118 |
| -* `<embedding_name>\projection_meta.tsv` |
119 |
| - * Can be loaded into the tensorboard projection tab as labels to when viewing an embedding. |
120 |
| -* `<embedding_name>\train.pkl` and `<embedding_name>\test.pkl` |
121 |
| - * These are the actual embedding IDs for the dataset |
122 |
| - * This process takes hours the first time, but subsequently features are loaded instantly through these files. |
123 |
| - * Naturall, these files remain valid so long as the partial file remains the same, otherwise the IDs will not reflect the correct words in the embedding. |
124 |
| - |
125 |
| -## Notes |
126 |
| - |
127 |
| -I am working on generating the files mentioned above and making them available online, or in the repo itself, so you don't have to go through the parsing process the first time. |
0 commit comments