trying to get partial embedding variables working on distributed architecture

SijanC147 · SijanC147 · commit cda69b0571a3 · 2018-10-21T16:57:09.000+02:00
diff --git a/README.md b/README.md
@@ -1,127 +1,3 @@
 # Targeted Sentiment Analysis PLAYground
 
 A codebase to bring together different embeddings, datasets and models and efficiently carry out experiments with them. 
-
-## Getting Started 
-
-### Pre-requisites 
-
-The project uses
-
-*   Python 3.6
-*   Pipenv 
-
-### Installing 
-
-Setup should be straight forward enough using `pipenv`, from the directory of the project run the following command 
-
-````Bash
-pipenv install
-````
-
-### Running Experiments
-
-Setting up experiments takes place in `main.py`. 
-
-Each `Experiment` object takes the following 
-*   An `Embedding`
-*   A `Dataset`
-*   A `Model`
-*   A `RunConfig` (optional)
-
-Experiments can then be run on an experiment instance using `experiment.run(job, steps)`
-
-The job options avaiable are
-*   `'train'`, steps **must** be provided.
-*   `'eval'`, model **must** have been previously trained. 
-*   `'train+eval'`, steps **must** be provided, **This is the most easy and straight forward approach**
-
-
-## Experiment Process
-
-### Embedding
-
-All embedding files need to have a path as follows: `embeddings\data\<name>\<version>.txt`
-
-The **path** as shown above is passed on to the `Embedding` constructor
-
-The **name** and **version** parts of the path are assigned to the Embedding object internally as identifiers. 
-
-### Dataset
-
-Datasets are initialized with a path as follows: `datasets\data\<name>\`
-
-The **path** and a **parser** is passed on to the `Dataset` constructor
-
-Internally the system looks for the first files in that directory with the words `train` and `test` in there name.
-
-The system then parses these files with the **specified parser** to generate all the required features and labels. 
-
-### Model
-
-All models should be defined under `models\<group>\<model>.py`
-
-The `group` can be whatever, for implemented models thus far a reference key to the original paper of the model is being used. eg. `Tang2016a`
-
-All models must inherit from the base `Model` class
-
-All models must contain implementations for the following functions: 
-
-*   `_params`
-*   `_feature_columns`
-*   `_train_input_fn`
-*   `_eval_input_fn`
-*   `_model_fn`
-
-All of these functions must return the respective parts of the tensorflow model. 
-
-Everything else is taken care of internally by the system.
-
-### Experiment Results
-
-After running an experiment, results are written to `experiments\data\<group>\<model>\`
-
-The directory contains some extra statistics files, and markdown of functions used for future reference. 
-
-The **Tensorboard logdir** is named `tb_summary\`
-
-To launch tensorboard after an experiment run the following, 
-
-````Bash
-tensorboard --logdir experiments\data\<group>\<model>\tb_summary
-````
-
-Alternatively, the `experiment.run()` method takes an optional `start_tb` boolean parameter.
-
-If set to true, the process will launch the tensorboard page and start the tensorboard process automatically. 
-
-The tensorboard page will fail to load from the get go until the process starts, it should reload automatically within a few seconds. 
-
-## Performance 
-
-Extracting features and parsing the dataset files takes a while.
-
-To help with this, the `Dataset` class internally generates a myriad of files during its initial execution to use them in future runs. 
-
-All of the generated files are stored under `datasets\data\<name>\_generated`
-
-The files saved include: 
-
-*   `corpus.csv` 
-    *   A comma seperated file of all unique tokens appearing in the dataset, and their word count
-*   `train_dict.pkl` and `test_dict.pkl`
-    *   Pickle binary files of the parsed datasets in dictionary format
-    *   These would still need to be mapped to indices of a specific embedding
-*   `<embedding_name>\partial_<embedding_version>.txt`
-    *   A filtered down version of an embedding containing only the words that are in the dataset corpus
-    *   This will be loaded instead of the full embedding in future runs
-*  `<embedding_name>\projection_meta.tsv`
-    *   Can be loaded into the tensorboard projection tab as labels to when viewing an embedding.
-*  `<embedding_name>\train.pkl` and `<embedding_name>\test.pkl`
-    *   These are the actual embedding IDs for the dataset
-    *   This process takes hours the first time, but subsequently features are loaded instantly through these files.
-    *   Naturall, these files remain valid so long as the partial file remains the same, otherwise the IDs will not reflect the correct words in the embedding.
-
-## Notes
-
-I am working on generating the files mentioned above and making them available online, or in the repo itself, so you don't have to go through the parsing process the first time.
diff --git a/gcp/_cmd.py b/gcp/_cmd.py
@@ -1,8 +1,8 @@
 from os import system
 
 
-system("""gcloud ml-engine jobs submit training testing_large_embeddings_6 \
---job-dir=gs://tsaplay-bucket/testing_large_embeddings_6 \
+system("""gcloud ml-engine jobs submit training testing_2worker_6partial100bs4 \
+--job-dir=gs://tsaplay-bucket/testing_2worker_6partial100bs4 \
 --module-name=tsaplay.task \
 --staging-bucket=gs://tsaplay-bucket/ \
 --packages=/Users/seanbugeja/Code/Msc/dist/tsaplay-0.1.dev0.tar.gz \
diff --git a/gcp/_config.json b/gcp/_config.json
@@ -6,16 +6,16 @@
     },
     "trainingInput": {
         "scaleTier": "CUSTOM",
-        "masterType": "complex_model_m_gpu",
-        "workerType": "complex_model_m_gpu",
-        "parameterServerType": "complex_model_m_gpu",
-        "workerCount": 1,
-        "parameterServerCount": 1,
+        "masterType": "standard_gpu",
+        "workerType": "standard_gpu",
+        "parameterServerType": "standard_gpu",
+        "workerCount": 2,
+        "parameterServerCount": 6,
         "pythonVersion": "3.5",
         "runtimeVersion": "1.10",
         "region": "europe-west1",
         "args": [
-            "--embedding=wiki-50",
+            "--embedding=commoncrawl-840",
             "--datasets=dong",
             "--model=lcrrot",
             "--batch-size=25",
diff --git a/submit_job.py b/submit_job.py
@@ -211,11 +211,11 @@ def write_gcloud_config(args):
         "labels": {"type": "dev", "owner": "sean"},
         "trainingInput": {
             "scaleTier": "CUSTOM",
-            "masterType": "complex_model_m_gpu",
-            "workerType": "complex_model_m_gpu",
-            "parameterServerType": "complex_model_m_gpu",
-            "workerCount": 1,
-            "parameterServerCount": 1,
+            "masterType": "standard_gpu",
+            "workerType": "standard_gpu",
+            "parameterServerType": "standard_gpu",
+            "workerCount": 2,
+            "parameterServerCount": 6,
             "pythonVersion": "3.5",
             "runtimeVersion": "1.10",
             "region": "europe-west1",
@@ -263,7 +263,7 @@ def write_gcloud_cmd_script(args):
 
     if platform() == "MacOS":
         system('echo "{}" | pbcopy'.format(submit_job_cmd))
-        cprnt(bow="Copied to clipboard!")
+        cprnt(bow="Copied to clipboard.")
 
 
 def main(args):
diff --git a/tsaplay/embeddings.py b/tsaplay/embeddings.py
@@ -76,23 +76,19 @@ def num_shards(self):
     @property
     def initializer(self):
         shape = (self.vocab_size, self.dim_size)
+        partition_size = int(self.vocab_size / 6)
 
         def _init(shape=shape, dtype=tf.float32, partition_info=None):
             return self.vectors
 
-        # return _init
-        return lambda: self.vectors
-
-    @property
-    def partitioned_initializer(self):
-        partition_size = int(self.vocab_size / self.num_shards)
-        shape = (self.vocab_size, self.dim_size)
-
         def _init_part(shape=shape, dtype=tf.float32, partition_info=None):
             part_offset = partition_info.single_offset(shape)
             this_slice = part_offset + partition_size
             return self.vectors[part_offset:this_slice]
 
+        def _init_const():
+            return self.vectors
+
         return _init_part
 
     @source.setter
diff --git a/tsaplay/experiments.py b/tsaplay/experiments.py
@@ -118,8 +118,7 @@ def _update_export_models_config(self):
 
     def _initialize_model_run_config(self, config_dict):
         default_config = {
-            "model_dir": join(self._experiment_dir),
-            # "model_dir": join(self._experiment_dir, "tb_summary"),
+            "model_dir": join(self._experiment_dir, "tb_summary"),
             # "save_checkpoints_steps": 100,
             "save_summary_steps": 25,
             # "log_step_count_steps": 25,
diff --git a/tsaplay/utils/addons.py b/tsaplay/utils/addons.py
@@ -168,11 +168,11 @@ def early_stopping(model, features, labels, spec, params):
 @only(["TRAIN"])
 def profiling(model, features, labels, spec, params):
     train_hooks = list(spec.training_hooks) or []
+    timeline_dir = join(model.run_config.model_dir)
+    makedirs(timeline_dir)
     train_hooks += [
         tf.train.ProfilerHook(
-            save_steps=100,
-            output_dir=model.run_config.model_dir,
-            show_memory=True,
+            save_steps=300, output_dir=timeline_dir, show_memory=True
         )
     ]
     return spec._replace(training_hooks=train_hooks)
diff --git a/tsaplay/utils/decorators.py b/tsaplay/utils/decorators.py
@@ -41,30 +41,34 @@ def wrapper(self, features, labels, mode, params):
             embeddings = tf.get_variable(
                 "embeddings",
                 shape=[vocab_size, dim_size],
-                # initializer=embedding_init,
+                initializer=embedding_init,
+                partitioner=tf.fixed_size_partitioner(num_shards=6),
                 trainable=trainable,
                 dtype=tf.float32,
             )
 
-        def init_embeddings(sess):
-            value = embedding_init()
-            sess.run(embeddings.initializer, {embeddings.initial_value: value})
-
         embedded_sequences = {}
         for key, value in features.items():
             if "_ids" in key:
                 component = key.replace("_ids", "")
                 embdd_key = component + "_emb"
-                embedded_sequence = tf.contrib.layers.embed_sequence(
-                    ids=value,
-                    initializer=embeddings,
-                    scope="embedding_layer",
-                    reuse=True,
+                # embedded_sequence = tf.contrib.layers.embed_sequence(
+                #     ids=value,
+                #     initializer=embeddings,
+                #     scope="embedding_layer",
+                #     reuse=True,
+                # )
+                embedded_sequence = tf.nn.embedding_lookup(
+                    params=embeddings, ids=value
                 )
                 embedded_sequences[embdd_key] = embedded_sequence
         features.update(embedded_sequences)
         spec = model_fn(self, features, labels, mode, params)
-        spec = scaffold_init_fn_on_spec(spec, init_embeddings)
+
+        # def init_embeddings(sess):
+        #     value = embedding_init()
+        #     sess.run(embeddings.initializer, {embeddings.initial_value: value})
+        # spec = scaffold_init_fn_on_spec(spec, init_embeddings)
         return spec
 
     return wrapper