Skip to content

Commit cda69b0

Browse files
committed
trying to get partial embedding variables working on distributed architecture
1 parent 5dba6a3 commit cda69b0

File tree

8 files changed

+37
-162
lines changed

8 files changed

+37
-162
lines changed

README.md

-124
Original file line numberDiff line numberDiff line change
@@ -1,127 +1,3 @@
11
# Targeted Sentiment Analysis PLAYground
22

33
A codebase to bring together different embeddings, datasets and models and efficiently carry out experiments with them.
4-
5-
## Getting Started
6-
7-
### Pre-requisites
8-
9-
The project uses
10-
11-
* Python 3.6
12-
* Pipenv
13-
14-
### Installing
15-
16-
Setup should be straight forward enough using `pipenv`, from the directory of the project run the following command
17-
18-
````Bash
19-
pipenv install
20-
````
21-
22-
### Running Experiments
23-
24-
Setting up experiments takes place in `main.py`.
25-
26-
Each `Experiment` object takes the following
27-
* An `Embedding`
28-
* A `Dataset`
29-
* A `Model`
30-
* A `RunConfig` (optional)
31-
32-
Experiments can then be run on an experiment instance using `experiment.run(job, steps)`
33-
34-
The job options avaiable are
35-
* `'train'`, steps **must** be provided.
36-
* `'eval'`, model **must** have been previously trained.
37-
* `'train+eval'`, steps **must** be provided, **This is the most easy and straight forward approach**
38-
39-
40-
## Experiment Process
41-
42-
### Embedding
43-
44-
All embedding files need to have a path as follows: `embeddings\data\<name>\<version>.txt`
45-
46-
The **path** as shown above is passed on to the `Embedding` constructor
47-
48-
The **name** and **version** parts of the path are assigned to the Embedding object internally as identifiers.
49-
50-
### Dataset
51-
52-
Datasets are initialized with a path as follows: `datasets\data\<name>\`
53-
54-
The **path** and a **parser** is passed on to the `Dataset` constructor
55-
56-
Internally the system looks for the first files in that directory with the words `train` and `test` in there name.
57-
58-
The system then parses these files with the **specified parser** to generate all the required features and labels.
59-
60-
### Model
61-
62-
All models should be defined under `models\<group>\<model>.py`
63-
64-
The `group` can be whatever, for implemented models thus far a reference key to the original paper of the model is being used. eg. `Tang2016a`
65-
66-
All models must inherit from the base `Model` class
67-
68-
All models must contain implementations for the following functions:
69-
70-
* `_params`
71-
* `_feature_columns`
72-
* `_train_input_fn`
73-
* `_eval_input_fn`
74-
* `_model_fn`
75-
76-
All of these functions must return the respective parts of the tensorflow model.
77-
78-
Everything else is taken care of internally by the system.
79-
80-
### Experiment Results
81-
82-
After running an experiment, results are written to `experiments\data\<group>\<model>\`
83-
84-
The directory contains some extra statistics files, and markdown of functions used for future reference.
85-
86-
The **Tensorboard logdir** is named `tb_summary\`
87-
88-
To launch tensorboard after an experiment run the following,
89-
90-
````Bash
91-
tensorboard --logdir experiments\data\<group>\<model>\tb_summary
92-
````
93-
94-
Alternatively, the `experiment.run()` method takes an optional `start_tb` boolean parameter.
95-
96-
If set to true, the process will launch the tensorboard page and start the tensorboard process automatically.
97-
98-
The tensorboard page will fail to load from the get go until the process starts, it should reload automatically within a few seconds.
99-
100-
## Performance
101-
102-
Extracting features and parsing the dataset files takes a while.
103-
104-
To help with this, the `Dataset` class internally generates a myriad of files during its initial execution to use them in future runs.
105-
106-
All of the generated files are stored under `datasets\data\<name>\_generated`
107-
108-
The files saved include:
109-
110-
* `corpus.csv`
111-
* A comma seperated file of all unique tokens appearing in the dataset, and their word count
112-
* `train_dict.pkl` and `test_dict.pkl`
113-
* Pickle binary files of the parsed datasets in dictionary format
114-
* These would still need to be mapped to indices of a specific embedding
115-
* `<embedding_name>\partial_<embedding_version>.txt`
116-
* A filtered down version of an embedding containing only the words that are in the dataset corpus
117-
* This will be loaded instead of the full embedding in future runs
118-
* `<embedding_name>\projection_meta.tsv`
119-
* Can be loaded into the tensorboard projection tab as labels to when viewing an embedding.
120-
* `<embedding_name>\train.pkl` and `<embedding_name>\test.pkl`
121-
* These are the actual embedding IDs for the dataset
122-
* This process takes hours the first time, but subsequently features are loaded instantly through these files.
123-
* Naturall, these files remain valid so long as the partial file remains the same, otherwise the IDs will not reflect the correct words in the embedding.
124-
125-
## Notes
126-
127-
I am working on generating the files mentioned above and making them available online, or in the repo itself, so you don't have to go through the parsing process the first time.

gcp/_cmd.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
from os import system
22

33

4-
system("""gcloud ml-engine jobs submit training testing_large_embeddings_6 \
5-
--job-dir=gs://tsaplay-bucket/testing_large_embeddings_6 \
4+
system("""gcloud ml-engine jobs submit training testing_2worker_6partial100bs4 \
5+
--job-dir=gs://tsaplay-bucket/testing_2worker_6partial100bs4 \
66
--module-name=tsaplay.task \
77
--staging-bucket=gs://tsaplay-bucket/ \
88
--packages=/Users/seanbugeja/Code/Msc/dist/tsaplay-0.1.dev0.tar.gz \

gcp/_config.json

+6-6
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,16 @@
66
},
77
"trainingInput": {
88
"scaleTier": "CUSTOM",
9-
"masterType": "complex_model_m_gpu",
10-
"workerType": "complex_model_m_gpu",
11-
"parameterServerType": "complex_model_m_gpu",
12-
"workerCount": 1,
13-
"parameterServerCount": 1,
9+
"masterType": "standard_gpu",
10+
"workerType": "standard_gpu",
11+
"parameterServerType": "standard_gpu",
12+
"workerCount": 2,
13+
"parameterServerCount": 6,
1414
"pythonVersion": "3.5",
1515
"runtimeVersion": "1.10",
1616
"region": "europe-west1",
1717
"args": [
18-
"--embedding=wiki-50",
18+
"--embedding=commoncrawl-840",
1919
"--datasets=dong",
2020
"--model=lcrrot",
2121
"--batch-size=25",

submit_job.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -211,11 +211,11 @@ def write_gcloud_config(args):
211211
"labels": {"type": "dev", "owner": "sean"},
212212
"trainingInput": {
213213
"scaleTier": "CUSTOM",
214-
"masterType": "complex_model_m_gpu",
215-
"workerType": "complex_model_m_gpu",
216-
"parameterServerType": "complex_model_m_gpu",
217-
"workerCount": 1,
218-
"parameterServerCount": 1,
214+
"masterType": "standard_gpu",
215+
"workerType": "standard_gpu",
216+
"parameterServerType": "standard_gpu",
217+
"workerCount": 2,
218+
"parameterServerCount": 6,
219219
"pythonVersion": "3.5",
220220
"runtimeVersion": "1.10",
221221
"region": "europe-west1",
@@ -263,7 +263,7 @@ def write_gcloud_cmd_script(args):
263263

264264
if platform() == "MacOS":
265265
system('echo "{}" | pbcopy'.format(submit_job_cmd))
266-
cprnt(bow="Copied to clipboard!")
266+
cprnt(bow="Copied to clipboard.")
267267

268268

269269
def main(args):

tsaplay/embeddings.py

+4-8
Original file line numberDiff line numberDiff line change
@@ -76,23 +76,19 @@ def num_shards(self):
7676
@property
7777
def initializer(self):
7878
shape = (self.vocab_size, self.dim_size)
79+
partition_size = int(self.vocab_size / 6)
7980

8081
def _init(shape=shape, dtype=tf.float32, partition_info=None):
8182
return self.vectors
8283

83-
# return _init
84-
return lambda: self.vectors
85-
86-
@property
87-
def partitioned_initializer(self):
88-
partition_size = int(self.vocab_size / self.num_shards)
89-
shape = (self.vocab_size, self.dim_size)
90-
9184
def _init_part(shape=shape, dtype=tf.float32, partition_info=None):
9285
part_offset = partition_info.single_offset(shape)
9386
this_slice = part_offset + partition_size
9487
return self.vectors[part_offset:this_slice]
9588

89+
def _init_const():
90+
return self.vectors
91+
9692
return _init_part
9793

9894
@source.setter

tsaplay/experiments.py

+1-2
Original file line numberDiff line numberDiff line change
@@ -118,8 +118,7 @@ def _update_export_models_config(self):
118118

119119
def _initialize_model_run_config(self, config_dict):
120120
default_config = {
121-
"model_dir": join(self._experiment_dir),
122-
# "model_dir": join(self._experiment_dir, "tb_summary"),
121+
"model_dir": join(self._experiment_dir, "tb_summary"),
123122
# "save_checkpoints_steps": 100,
124123
"save_summary_steps": 25,
125124
# "log_step_count_steps": 25,

tsaplay/utils/addons.py

+3-3
Original file line numberDiff line numberDiff line change
@@ -168,11 +168,11 @@ def early_stopping(model, features, labels, spec, params):
168168
@only(["TRAIN"])
169169
def profiling(model, features, labels, spec, params):
170170
train_hooks = list(spec.training_hooks) or []
171+
timeline_dir = join(model.run_config.model_dir)
172+
makedirs(timeline_dir)
171173
train_hooks += [
172174
tf.train.ProfilerHook(
173-
save_steps=100,
174-
output_dir=model.run_config.model_dir,
175-
show_memory=True,
175+
save_steps=300, output_dir=timeline_dir, show_memory=True
176176
)
177177
]
178178
return spec._replace(training_hooks=train_hooks)

tsaplay/utils/decorators.py

+15-11
Original file line numberDiff line numberDiff line change
@@ -41,30 +41,34 @@ def wrapper(self, features, labels, mode, params):
4141
embeddings = tf.get_variable(
4242
"embeddings",
4343
shape=[vocab_size, dim_size],
44-
# initializer=embedding_init,
44+
initializer=embedding_init,
45+
partitioner=tf.fixed_size_partitioner(num_shards=6),
4546
trainable=trainable,
4647
dtype=tf.float32,
4748
)
4849

49-
def init_embeddings(sess):
50-
value = embedding_init()
51-
sess.run(embeddings.initializer, {embeddings.initial_value: value})
52-
5350
embedded_sequences = {}
5451
for key, value in features.items():
5552
if "_ids" in key:
5653
component = key.replace("_ids", "")
5754
embdd_key = component + "_emb"
58-
embedded_sequence = tf.contrib.layers.embed_sequence(
59-
ids=value,
60-
initializer=embeddings,
61-
scope="embedding_layer",
62-
reuse=True,
55+
# embedded_sequence = tf.contrib.layers.embed_sequence(
56+
# ids=value,
57+
# initializer=embeddings,
58+
# scope="embedding_layer",
59+
# reuse=True,
60+
# )
61+
embedded_sequence = tf.nn.embedding_lookup(
62+
params=embeddings, ids=value
6363
)
6464
embedded_sequences[embdd_key] = embedded_sequence
6565
features.update(embedded_sequences)
6666
spec = model_fn(self, features, labels, mode, params)
67-
spec = scaffold_init_fn_on_spec(spec, init_embeddings)
67+
68+
# def init_embeddings(sess):
69+
# value = embedding_init()
70+
# sess.run(embeddings.initializer, {embeddings.initial_value: value})
71+
# spec = scaffold_init_fn_on_spec(spec, init_embeddings)
6872
return spec
6973

7074
return wrapper

0 commit comments

Comments
 (0)