type object 'EnglishDefaults' has no attribute 'tag_map' #8

karthikeyansam · 2021-05-05T13:38:41Z

when using spacy 3.0.6 I am getting below error

type object 'EnglishDefaults' has no attribute 'tag_map'

import spacy
from spacy_conll import ConllFormatter


nlp = spacy.load('en_core_web_sm')
conllformatter = ConllFormatter(nlp,
                                ext_names={'conll_pd': 'pandas'},
                                conversion_maps={'lemma': {'-PRON-': 'PRON'}})
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies.')
print(doc._.pandas)

only changes is instead of loading linked en model I have loaded en_core_web_sm

The text was updated successfully, but these errors were encountered:

BramVanroy · 2021-05-05T13:41:09Z

Thanks for the report @karthikeyansam! I haven't had time to port this repo to spacy v3 but it is on my to-do list. I'll probably have time to do so somewhere in June.

narayanacharya6 · 2021-05-07T06:15:45Z

Hey @BramVanroy, can I pick this one up? My plan is to use the in-built default morphologizer and move away from the tagmap in ConllFormatter used to build the morphology.

BramVanroy · 2021-05-07T07:45:29Z

Hey @BramVanroy, can I pick this one up? My plan is to use the in-built default morphologizer and move away from the tagmap in ConllFormatter used to build the morphology.

Sounds good! Do make sure you build in a check that outputs _ if no morphologizer component is available in the pipeline. You can make a PR whenever you are ready, although I am not sure when I can review it due to time constraints.

narayanacharya6 · 2021-05-07T07:46:01Z

Ok, so I initially thought this was a quick fix and I could contribute to the library but I was wrong. The most minimal way to fix the issue and use the library with spaCy>3.0 (and how I ended up using it in my project) would be to change the ConllFormatter to use the default morphologizer pipe for morphology.

A minimal example:

# noinspection PyUnresolvedReferences
import ConllFormatter
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('conll', after='parser')
return nlp

Updated ConllFormatter file below:

from collections import OrderedDict
from typing import Dict, Optional, Union

from spacy.language import Language
from spacy.tokens import Doc, Span, Token

COMPONENT_NAME = "conll_formatter"
CONLL_FIELD_NAMES = [
    "id",
    "form",
    "lemma",
    "upostag",
    "xpostag",
    "feats",
    "head",
    "deprel",
    "deps",
    "misc",
]

try:
    import pandas as pd

    PD_AVAILABLE = True
except ImportError:
    PD_AVAILABLE = False

@Language.factory("conll", default_config={})
def create_srl_component(nlp: Language, name: str):
    return ConllFormatter(nlp)

class ConllFormatter:
    """Pipeline component for spaCy that adds CoNLL-U properties to a Doc, its sentence `Span`s, and Tokens.
       By default, the custom properties `conll` and `conll_str` are added. If `pandas` is installed,
       `conll_pd` is added as well.

       - `conll`: raw CoNLL format
           - in `Token`: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
             values.
           - in sentence `Span`: a list of its tokens' `conll` dictionaries (list of dictionaries).
           - in a `Doc`: a list of its sentences' `conll` lists (list of list of dictionaries).
       - `conll_str`: string representation of the CoNLL format
           - in `Token`: tab-separated representation of the contents of the CoNLL fields ending with a newline.
           - in sentence `Span`: the expected CoNLL format where each row represents a token. When
             `ConllFormatter(include_headers=True)` is used, two header lines are included as well, as per the
             `CoNLL format`_.
           - in `Doc`: all its sentences' `conll_str` combined and separated by new lines.
       - `conll_pd`: `pandas` representation of the CoNLL format
           - in `Token`: a `Series` representation of this token's CoNLL properties.
           - in sentence `Span`: a `DataFrame` representation of this sentence, with the CoNLL names as column
             headers.
           - in `Doc`: a concatenation of its sentences' `DataFrame`'s, leading to a new a `DataFrame` whose
             index is reset.
       """

    name = COMPONENT_NAME

    def __init__(
        self,
        nlp: Language,
        *,
        conversion_maps: Optional[Dict[str, Dict[str, str]]] = None,
        ext_names: Optional[Dict[str, str]] = None,
        include_headers: bool = False,
        disable_pandas: bool = False,
    ):
        """ ConllFormatter constructor.
        :param nlp: an initialized spaCy nlp object
        :param conversion_maps: two-level dictionary that contains a field_name (e.g. 'lemma', 'upostag')
               on the first level, and the conversion map on the second.
               E.g. {'lemma': {'-PRON-': 'PRON'}} will map the lemma '-PRON-' to 'PRON'
        :param ext_names: dictionary containing names for the custom spaCy extensions. You can rename the following
               extensions: conll, conll_pd, conll_str.
               E.g. {'conll': 'conll_dict', 'conll_pd': 'conll_pandas'} will rename the properties accordingly
        :param include_headers: whether to include the CoNLL headers in the conll_str string output. These consist
               of two lines containing the sentence id and the text as per the CoNLL format
               https://universaldependencies.org/format.html#sentence-boundaries-and-comments.
        :param disable_pandas: whether to disable pandas integration even if it is installed. This is particularly
               useful to avoid issues when using multiprocessing.
        """

        # Set custom attribute names
        self._ext_names = {"conll_str": "conll_str", "conll": "conll", "conll_pd": "conll_pd"}
        if ext_names:
            self._ext_names = self._merge_dicts_strict(self._ext_names, ext_names)

        self._conversion_maps = conversion_maps

        self.include_headers = include_headers
        self.disable_pandas = disable_pandas
        # Initialize extensions
        self._set_extensions()

    def __call__(self, doc: Doc):
        """Runs the pipeline component, adding the extensions to Underscore ._.. Adds a string representation,
           string representation containing a header, and a tuple representation of the CoNLL format to the
           given Doc and its sentences.
        :param doc: the input Doc
        :return: the modified Doc containing the newly added extensions
        """
        # We need to hook the extensions again when using
        # multiprocessing in Windows
        # see: https://github.com/explosion/spaCy/issues/4903
        # fixed in: https://github.com/explosion/spaCy/pull/5006
        # Leaving this here for now, for older versions of spaCy
        self._set_extensions()

        for sent_idx, sent in enumerate(doc.sents, 1):
            self._set_span_conll(sent, sent_idx)

        doc._.set(
            self._ext_names["conll"], [s._.get(self._ext_names["conll"]) for s in doc.sents]
        )
        doc._.set(
            self._ext_names["conll_str"],
            "\n".join([s._.get(self._ext_names["conll_str"]) for s in doc.sents]),
        )

        if PD_AVAILABLE and not self.disable_pandas:
            doc._.set(
                self._ext_names["conll_pd"],
                pd.concat(
                    [s._.get(self._ext_names["conll_pd"]) for s in doc.sents]
                ).reset_index(drop=True),
            )

        return doc

    def _map_conll(self, token_conll_d: Dict[str, Union[str, int]]):
        """Maps labels according to a given `self._conversion_maps`.
            This can be useful when users want to change the output labels of a
            model to their own tagset.

        :param token_conll_d: a token's conll representation as dict (field_name: value)
        :return: the modified dict where the labels have been replaced according to the converison maps
        """
        for k, v in token_conll_d.items():
            try:
                token_conll_d[k] = self._conversion_maps[k][v]
            except KeyError:
                continue

        return token_conll_d

    def _set_extensions(self):
        """Sets the default extensions if they do not exist yet."""
        for obj in Doc, Span, Token:
            if not obj.has_extension(self._ext_names["conll_str"]):
                obj.set_extension(self._ext_names["conll_str"], default=None)
            if not obj.has_extension(self._ext_names["conll"]):
                obj.set_extension(self._ext_names["conll"], default=None)

            if PD_AVAILABLE and not self.disable_pandas:
                if not obj.has_extension(self._ext_names["conll_pd"]):
                    obj.set_extension(self._ext_names["conll_pd"], default=None)

    def _set_span_conll(self, span: Span, span_idx: int = 1):
        """Sets a span's properties according to the CoNLL-U format.
        :param span: a spaCy Span
        :param span_idx: optional index, corresponding to the n-th sentence
                         in the parent Doc
        """
        span_conll_str = ""
        if self.include_headers:
            span_conll_str += f"# sent_id = {span_idx}\n# text = {span.text}\n"

        for token_idx, token in enumerate(span, 1):
            self._set_token_conll(token, token_idx)

        span._.set(self._ext_names["conll"], [t._.get(self._ext_names["conll"]) for t in span])
        span_conll_str += "".join([t._.get(self._ext_names["conll_str"]) for t in span])
        span._.set(self._ext_names["conll_str"], span_conll_str)

        if PD_AVAILABLE and not self.disable_pandas:
            span._.set(
                self._ext_names["conll_pd"],
                pd.DataFrame([t._.get(self._ext_names["conll"]) for t in span]),
            )

    def _set_token_conll(self, token: Token, token_idx: int = 1):
        """Sets a token's properties according to the CoNLL-U format.
        :param token: a spaCy Token
        :param token_idx: optional index, corresponding to the n-th token in the sentence Span
        """
        if token.dep_.lower().strip() == "root":
            head_idx = 0
        else:
            head_idx = token.head.i + 1 - token.sent[0].i

        token_conll = (
            token_idx,
            token.text,
            token.lemma_,
            token.pos_,
            token.tag_,
            token.morph,
            head_idx,
            token.dep_,
            "_",
            "_" if token.whitespace_ else "SpaceAfter=No",
        )

        # turn field name values (keys) and token values (values) into dict
        token_conll_d = OrderedDict(zip(CONLL_FIELD_NAMES, token_conll))

        # convert properties if needed
        if self._conversion_maps:
            token_conll_d = self._map_conll(token_conll_d)

        token._.set(self._ext_names["conll"], token_conll_d)
        token_conll_str = "\t".join(map(str, token_conll_d.values())) + "\n"
        token._.set(self._ext_names["conll_str"], token_conll_str)

        if PD_AVAILABLE and not self.disable_pandas:
            token._.set(self._ext_names["conll_pd"], pd.Series(token_conll_d))

        return token

    @staticmethod
    def _is_number(s: str):
        """Checks whether a string is actually a number.
        :param s: string to test
        :return: whether or not 's' is a number
        """
        try:
            float(s)
            return True
        except ValueError:
            return False

    @staticmethod
    def _merge_dicts_strict(d1: Dict, d2: Dict):
        """Merge two dicts in a strict manner, i.e. the second dict overwrites keys
           of the first dict but all keys in the second dict have to be present in
           the first dict.
        :param d1: base dict which will be overwritten
        :param d2: dict with new values that will overwrite d1
        :return: the merged dict (but d1 will be modified in-place anyway!)
        """
        for k, v in d2.items():
            if k not in d1:
                raise KeyError(
                    f"This key does not exist in the original dict. Valid keys are {list(d1.keys())}"
                )
            d1[k] = v

        return d1

So what does not work for the upgrade to spaCy-v3:

The library has too many moving pieces (so many dependencies for tokenization), with spacy-udpipe being one which does not support spaCy-v3. These not being a part of the requirements.txt just makes it more difficult to set up the repo locally for contributing and testing. I understand that there might be reasons to not package these dependencies within the build but I would appreciate it if there was a contributing file that has instructions for setting up the project and running the tests.
Also, spaCy-v3 does not allow spacy.load("en"), so you have to either use spacy.blank("en") and then add all the required pipes using the add_pipe() method or use spacy.load("en_core_web_sm") to get one of the default pipelines. This means a lot of changes rippling out of init_parser which uses model_or_lang to load models across the different libraries spacy, stanza, etc.

Thanks for the library! It saved me a good amount of time. I will look to address the above issues when I get some time but they might take longer than I had initially anticipated so if anyone else wants to pick this up I hope my comments would prove helpful. Meanwhile, people can work around with changed ConllFormatter with the sample usage shown above.

BramVanroy · 2021-05-07T07:51:37Z

Thanks for the comments nevertheless @narayanacharya6. I have indeed been holding off refactoring for v3 because the changes are subtle but relatively far-going - for which I do not have the time currently.

My hope is that spacy-udpipe get a v3 release soon-ish, so that I can just push a v3 spacy-conll version for all spacy-* libs, but dropping support for spacy-stanfordnlp (in favour of spacy-stanza, which already supports v3).

If you can wait until June, there is no need for you to take on this PR. I will work on it then. I will also create contributing guidelines. To be honest, I did not think this library would get a lot of usage but it seems I was wrong. I'll try to make it more open for contributions.

BramVanroy · 2021-06-29T15:38:25Z

Hi @karthikeyansam @narayanacharya6

Can you try the current master branch or the pre-release on pip?

pip uninstall spacy_conll
pip install spacy_conll --pre

I'll let this pre-release out in the wild for a week or so before pushing it as the production version. If you experience any issues, please let me know!

narayanacharya6 · 2021-06-29T17:07:03Z

Hey @BramVanroy, I'll try it out over the weekend, thanks! :)

narayanacharya6 · 2021-07-28T19:28:51Z

Hey @BramVanroy I finally got a chance to update to the new version. Pretty straightforward to pull out my hacks and use the lib as intended. No issues so far, thanks!

BramVanroy self-assigned this May 5, 2021

BramVanroy mentioned this issue Jun 29, 2021

Upgrade to v3 of spaCy and other changes #10

Merged

BramVanroy closed this as completed in #10 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

type object 'EnglishDefaults' has no attribute 'tag_map' #8

type object 'EnglishDefaults' has no attribute 'tag_map' #8

karthikeyansam commented May 5, 2021

BramVanroy commented May 5, 2021

narayanacharya6 commented May 7, 2021

BramVanroy commented May 7, 2021 •

edited

Loading

narayanacharya6 commented May 7, 2021 •

edited

Loading

BramVanroy commented May 7, 2021

BramVanroy commented Jun 29, 2021

narayanacharya6 commented Jun 29, 2021

narayanacharya6 commented Jul 28, 2021

type object 'EnglishDefaults' has no attribute 'tag_map' #8

type object 'EnglishDefaults' has no attribute 'tag_map' #8

Comments

karthikeyansam commented May 5, 2021

BramVanroy commented May 5, 2021

narayanacharya6 commented May 7, 2021

BramVanroy commented May 7, 2021 • edited Loading

narayanacharya6 commented May 7, 2021 • edited Loading

BramVanroy commented May 7, 2021

BramVanroy commented Jun 29, 2021

narayanacharya6 commented Jun 29, 2021

narayanacharya6 commented Jul 28, 2021

BramVanroy commented May 7, 2021 •

edited

Loading

narayanacharya6 commented May 7, 2021 •

edited

Loading