Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Tango #5162

Merged
merged 166 commits into from
Aug 5, 2021
Merged

Tango #5162

merged 166 commits into from
Aug 5, 2021

Conversation

dirkgr
Copy link
Member

@dirkgr dirkgr commented Apr 28, 2021

1700 lines is pretty uncomfortable to review, so let me attempt a guide:

Steps

The most important thing is the Step class. It defines one step in a workflow. Users are expected to just write a run() method. The run() method must have parameters with type hints. from_params() reads those type hints to construct Steps. If the run() method takes a parameter of type T, then from_params() assumes the constructor of that step takes a Union[T, Step[T]]. In other words, you can provide the T directly, or you can put in a Step that outputs a T. Making Steps the input is how you define a DAG of tasks. The Step code makes sure to replace inputs of type Step with the Step's results before the run() method runs.

Hopefully most of this magic will remain hidden to users, but as a reviewer you have to understand it.

Steps also store some settings:

  • DETERMINSTIC describes whether this step can be relied upon to produce the same results every time when given the same inputs. If this is False, the step can't be cached, and neither can any step that depends on it.
  • CACHEABLE provides a direct way to turn off caching. For example, a step that reads a HuggingFace dataset doesn't need to be cached, because HuggingFace datasets already have their own caching mechanism. But it's still a deterministic step, and all following steps are allowed to cache.
  • VERSION is optional, but recommended. This gives the user a way to tell Tango that a step has changed during development, and should now be recomputed. This doesn't invalidate the old results, so when you revert your code, the old cache entries will stick around and be picked up.
  • FORMAT: See below

Those settings above are per Step class. Every instance of Step has some more settings. You can override the format and whether to cache the results, and there are two more:

  • step_name allows you to give your step a useful name that you can use to refer to it. This name will be used in many places, like names of directories for results, log messages, etc. If this is not given, the step's unique id stands in.
  • produce_results specifies whether this is a step where we care about the the results. For example, you might build a long pipeline of steps, but you really only want to look at the evaluation at the end. In that case, the evaluation step is the only step where you would set produce_results to True. If none of your steps have produce_results set to True, Tango will do nothing. It will only run steps that are necessary to produce the results you need.

Every step has a unique id of the form f"{self.__class__.__name__}-{self.VERSION}-{hash of input}".

RefStep

_RefStep is a fake stand-in for a real step. This is used when parsing a DAG of steps. Every step gets parsed on its own, and references to other steps are parsed as a _RefStep. Then later, when all the steps are known, _RefSteps get resolved to real steps. A final DAG should never contain a _RefStep.

Formats

The Format classes are basically mappings from Python objects to disk and back. They have a read() and a write() method. They read and write directories, not files. At the end of the write() method, the object must be completely written, but the read() method can return an object that will do the actual reading lazily. We might use this for reading datasets, where a read() method can return immediately, and return an object that reads actual instances later.

Every Step defines the Format it wants to use to serialize and deserialize its result.

DillFormat is the default format for all steps. It uses dill (a better version of pickle) to serialize and deserialize objects. It is surprisingly flexible and can handle almost all Python objects (including (some) functions).

StepCache

StepCache is a mapping from instances of Step to the results of that step. There are two implementations, MemoryStepCache and DirectoryStepCache. DirectoryStepCache is the component that uses the Format classes to cache outputs in a directory. When you run allennlp tango -s serialization_dir, then the step cache ends up in serialization_dir/step_cache.

Dataset

Tango datasets are a bit different, in that they contain all splits simultaneously, as well as a vocabulary. The definition of AllenNlpDataset is this:

@dataclass
class AllenNlpDataset:
    splits: Mapping[str, Sequence[Any]]
    vocab: Optional[Vocabulary] = None
    metadata: Mapping[str, Any] = field(default_factory=dict)

For bigger datasets, the idea is that we write something that returns an object which can read instances lazily. For example, You could imagine that the Sequence[Any] that contains the instances refers to a directory on disk somewhere, and reads one file per instance. But that's all future stuff. In this PR, all the Sequences are Lists.

One important thing is this: In Tango, all datasets are map-style datasets (to use PyTorch terminology). Iterator-style datasets ("lazy datasets", as we call them in AllenNLP) are a pain, and are not necessary given the right tooling.

Dataloader

TangoDataLoader is simpler than the original AllenNLP DataLoader. Original data loaders are responsible for a ton of stuff, and have a bit of a messy API owing to the history of their development. Tango data loaders can be simpler because they don't have to deal with lazy datasets or multiprocessing. All they do is make batches out of Sequence[Instance].

The old MultiprocessingDataLoader handles a lot of scenarios:

  • normal, equal-sized batches
  • shuffling
  • manually setting the number of batches per epoch
  • using a PyTorch sampler

I have split these out into separate data loader classes, which are composable (i..e, one data loader feeds into another). If you think composing data loaders is too complicated for researchers who'd rather not think about it, I'm open to that argument. I think I'd want to keep this API internally because it makes for small classes that are easy to understand, but maybe we write an interface that makes it easier to configure these.

Example Steps

TrainingStep

This step takes basically all the inputs that GradientDescentTrainer needs, plus a dataset and split name, and trains a model on it. It adapters to GradientDescentTrainer. In another iteration, I want to switch to getting rid of the adapter and having the training code directly in the step, since I am pretty unhappy with the inconsistent trainer API we have.

EvaluationStep

This step takes a model, a dataset, and the name of a split, and produces an evaluation result. It's roughly equivalent to allennlp evaluate.

HuggingfaceDataset

This step loads a HuggingFace dataset and puts it into AllenNLP dataset format. It's a one-liner of a step.

HuggingfaceTokenize

This step uses the HuggingFace tokenizers to turn every str in a given dataset into a TransformerTextField. That sounds like it would be super useful, but I actually ended up not using it for PIQA. So far I have found that I always need some more specialized treatment of the dataset before turning it into the input to a model.

@dirkgr dirkgr self-assigned this Apr 28, 2021
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all my comments! LGTM ✅

readonly=read_only,
lock=use_lock,
)
_active_tensor_caches[self.lmdb_env.path()] = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we remove entries from _active_tensor_caches? I see it in an older commit, but not in the latest one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_active_tensor_caches is a WeakValueDict, which removes entries automatically when the values are GCd.

def __new__(cls, filename: Union[str, PathLike], *, read_only: bool = False, **kwargs):
# This mechanism makes sure we re-use open lmdb file handles. Lmdb has a problem when the same file is
# opened by the same process multiple times. This is our workaround.
filename = str(filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably normalize filename to an absolute path here, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do you one better and do it by inode: 153bade

That way even symlinks and hard links work correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@dirkgr dirkgr enabled auto-merge (squash) August 5, 2021 18:02
@dirkgr dirkgr merged commit 311f110 into main Aug 5, 2021
@dirkgr dirkgr deleted the Tango branch August 5, 2021 18:11
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants