Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

type object 'EnglishDefaults' has no attribute 'tag_map' #8

Closed
karthikeyansam opened this issue May 5, 2021 · 8 comments · Fixed by #10
Closed

type object 'EnglishDefaults' has no attribute 'tag_map' #8

karthikeyansam opened this issue May 5, 2021 · 8 comments · Fixed by #10
Assignees

Comments

@karthikeyansam
Copy link

when using spacy 3.0.6 I am getting below error

type object 'EnglishDefaults' has no attribute 'tag_map'

import spacy
from spacy_conll import ConllFormatter


nlp = spacy.load('en_core_web_sm')
conllformatter = ConllFormatter(nlp,
                                ext_names={'conll_pd': 'pandas'},
                                conversion_maps={'lemma': {'-PRON-': 'PRON'}})
nlp.add_pipe(conllformatter, after='parser')
doc = nlp('I like cookies.')
print(doc._.pandas)

only changes is instead of loading linked en model I have loaded en_core_web_sm

@BramVanroy BramVanroy self-assigned this May 5, 2021
@BramVanroy
Copy link
Owner

Thanks for the report @karthikeyansam! I haven't had time to port this repo to spacy v3 but it is on my to-do list. I'll probably have time to do so somewhere in June.

@narayanacharya6
Copy link

Hey @BramVanroy, can I pick this one up? My plan is to use the in-built default morphologizer and move away from the tagmap in ConllFormatter used to build the morphology.

@BramVanroy
Copy link
Owner

BramVanroy commented May 7, 2021

Hey @BramVanroy, can I pick this one up? My plan is to use the in-built default morphologizer and move away from the tagmap in ConllFormatter used to build the morphology.

Sounds good! Do make sure you build in a check that outputs _ if no morphologizer component is available in the pipeline. You can make a PR whenever you are ready, although I am not sure when I can review it due to time constraints.

@narayanacharya6
Copy link

narayanacharya6 commented May 7, 2021

Ok, so I initially thought this was a quick fix and I could contribute to the library but I was wrong. The most minimal way to fix the issue and use the library with spaCy>3.0 (and how I ended up using it in my project) would be to change the ConllFormatter to use the default morphologizer pipe for morphology.

A minimal example:

# noinspection PyUnresolvedReferences
import ConllFormatter
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('conll', after='parser')
return nlp

Updated ConllFormatter file below:

from collections import OrderedDict
from typing import Dict, Optional, Union

from spacy.language import Language
from spacy.tokens import Doc, Span, Token

COMPONENT_NAME = "conll_formatter"
CONLL_FIELD_NAMES = [
    "id",
    "form",
    "lemma",
    "upostag",
    "xpostag",
    "feats",
    "head",
    "deprel",
    "deps",
    "misc",
]

try:
    import pandas as pd

    PD_AVAILABLE = True
except ImportError:
    PD_AVAILABLE = False

@Language.factory("conll", default_config={})
def create_srl_component(nlp: Language, name: str):
    return ConllFormatter(nlp)

class ConllFormatter:
    """Pipeline component for spaCy that adds CoNLL-U properties to a Doc, its sentence `Span`s, and Tokens.
       By default, the custom properties `conll` and `conll_str` are added. If `pandas` is installed,
       `conll_pd` is added as well.

       - `conll`: raw CoNLL format
           - in `Token`: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
             values.
           - in sentence `Span`: a list of its tokens' `conll` dictionaries (list of dictionaries).
           - in a `Doc`: a list of its sentences' `conll` lists (list of list of dictionaries).
       - `conll_str`: string representation of the CoNLL format
           - in `Token`: tab-separated representation of the contents of the CoNLL fields ending with a newline.
           - in sentence `Span`: the expected CoNLL format where each row represents a token. When
             `ConllFormatter(include_headers=True)` is used, two header lines are included as well, as per the
             `CoNLL format`_.
           - in `Doc`: all its sentences' `conll_str` combined and separated by new lines.
       - `conll_pd`: `pandas` representation of the CoNLL format
           - in `Token`: a `Series` representation of this token's CoNLL properties.
           - in sentence `Span`: a `DataFrame` representation of this sentence, with the CoNLL names as column
             headers.
           - in `Doc`: a concatenation of its sentences' `DataFrame`'s, leading to a new a `DataFrame` whose
             index is reset.
       """

    name = COMPONENT_NAME

    def __init__(
        self,
        nlp: Language,
        *,
        conversion_maps: Optional[Dict[str, Dict[str, str]]] = None,
        ext_names: Optional[Dict[str, str]] = None,
        include_headers: bool = False,
        disable_pandas: bool = False,
    ):
        """ ConllFormatter constructor.
        :param nlp: an initialized spaCy nlp object
        :param conversion_maps: two-level dictionary that contains a field_name (e.g. 'lemma', 'upostag')
               on the first level, and the conversion map on the second.
               E.g. {'lemma': {'-PRON-': 'PRON'}} will map the lemma '-PRON-' to 'PRON'
        :param ext_names: dictionary containing names for the custom spaCy extensions. You can rename the following
               extensions: conll, conll_pd, conll_str.
               E.g. {'conll': 'conll_dict', 'conll_pd': 'conll_pandas'} will rename the properties accordingly
        :param include_headers: whether to include the CoNLL headers in the conll_str string output. These consist
               of two lines containing the sentence id and the text as per the CoNLL format
               https://universaldependencies.org/format.html#sentence-boundaries-and-comments.
        :param disable_pandas: whether to disable pandas integration even if it is installed. This is particularly
               useful to avoid issues when using multiprocessing.
        """

        # Set custom attribute names
        self._ext_names = {"conll_str": "conll_str", "conll": "conll", "conll_pd": "conll_pd"}
        if ext_names:
            self._ext_names = self._merge_dicts_strict(self._ext_names, ext_names)

        self._conversion_maps = conversion_maps

        self.include_headers = include_headers
        self.disable_pandas = disable_pandas
        # Initialize extensions
        self._set_extensions()

    def __call__(self, doc: Doc):
        """Runs the pipeline component, adding the extensions to Underscore ._.. Adds a string representation,
           string representation containing a header, and a tuple representation of the CoNLL format to the
           given Doc and its sentences.
        :param doc: the input Doc
        :return: the modified Doc containing the newly added extensions
        """
        # We need to hook the extensions again when using
        # multiprocessing in Windows
        # see: https://github.com/explosion/spaCy/issues/4903
        # fixed in: https://github.com/explosion/spaCy/pull/5006
        # Leaving this here for now, for older versions of spaCy
        self._set_extensions()

        for sent_idx, sent in enumerate(doc.sents, 1):
            self._set_span_conll(sent, sent_idx)

        doc._.set(
            self._ext_names["conll"], [s._.get(self._ext_names["conll"]) for s in doc.sents]
        )
        doc._.set(
            self._ext_names["conll_str"],
            "\n".join([s._.get(self._ext_names["conll_str"]) for s in doc.sents]),
        )

        if PD_AVAILABLE and not self.disable_pandas:
            doc._.set(
                self._ext_names["conll_pd"],
                pd.concat(
                    [s._.get(self._ext_names["conll_pd"]) for s in doc.sents]
                ).reset_index(drop=True),
            )

        return doc

    def _map_conll(self, token_conll_d: Dict[str, Union[str, int]]):
        """Maps labels according to a given `self._conversion_maps`.
            This can be useful when users want to change the output labels of a
            model to their own tagset.

        :param token_conll_d: a token's conll representation as dict (field_name: value)
        :return: the modified dict where the labels have been replaced according to the converison maps
        """
        for k, v in token_conll_d.items():
            try:
                token_conll_d[k] = self._conversion_maps[k][v]
            except KeyError:
                continue

        return token_conll_d

    def _set_extensions(self):
        """Sets the default extensions if they do not exist yet."""
        for obj in Doc, Span, Token:
            if not obj.has_extension(self._ext_names["conll_str"]):
                obj.set_extension(self._ext_names["conll_str"], default=None)
            if not obj.has_extension(self._ext_names["conll"]):
                obj.set_extension(self._ext_names["conll"], default=None)

            if PD_AVAILABLE and not self.disable_pandas:
                if not obj.has_extension(self._ext_names["conll_pd"]):
                    obj.set_extension(self._ext_names["conll_pd"], default=None)

    def _set_span_conll(self, span: Span, span_idx: int = 1):
        """Sets a span's properties according to the CoNLL-U format.
        :param span: a spaCy Span
        :param span_idx: optional index, corresponding to the n-th sentence
                         in the parent Doc
        """
        span_conll_str = ""
        if self.include_headers:
            span_conll_str += f"# sent_id = {span_idx}\n# text = {span.text}\n"

        for token_idx, token in enumerate(span, 1):
            self._set_token_conll(token, token_idx)

        span._.set(self._ext_names["conll"], [t._.get(self._ext_names["conll"]) for t in span])
        span_conll_str += "".join([t._.get(self._ext_names["conll_str"]) for t in span])
        span._.set(self._ext_names["conll_str"], span_conll_str)

        if PD_AVAILABLE and not self.disable_pandas:
            span._.set(
                self._ext_names["conll_pd"],
                pd.DataFrame([t._.get(self._ext_names["conll"]) for t in span]),
            )

    def _set_token_conll(self, token: Token, token_idx: int = 1):
        """Sets a token's properties according to the CoNLL-U format.
        :param token: a spaCy Token
        :param token_idx: optional index, corresponding to the n-th token in the sentence Span
        """
        if token.dep_.lower().strip() == "root":
            head_idx = 0
        else:
            head_idx = token.head.i + 1 - token.sent[0].i

        token_conll = (
            token_idx,
            token.text,
            token.lemma_,
            token.pos_,
            token.tag_,
            token.morph,
            head_idx,
            token.dep_,
            "_",
            "_" if token.whitespace_ else "SpaceAfter=No",
        )

        # turn field name values (keys) and token values (values) into dict
        token_conll_d = OrderedDict(zip(CONLL_FIELD_NAMES, token_conll))

        # convert properties if needed
        if self._conversion_maps:
            token_conll_d = self._map_conll(token_conll_d)

        token._.set(self._ext_names["conll"], token_conll_d)
        token_conll_str = "\t".join(map(str, token_conll_d.values())) + "\n"
        token._.set(self._ext_names["conll_str"], token_conll_str)

        if PD_AVAILABLE and not self.disable_pandas:
            token._.set(self._ext_names["conll_pd"], pd.Series(token_conll_d))

        return token

    @staticmethod
    def _is_number(s: str):
        """Checks whether a string is actually a number.
        :param s: string to test
        :return: whether or not 's' is a number
        """
        try:
            float(s)
            return True
        except ValueError:
            return False

    @staticmethod
    def _merge_dicts_strict(d1: Dict, d2: Dict):
        """Merge two dicts in a strict manner, i.e. the second dict overwrites keys
           of the first dict but all keys in the second dict have to be present in
           the first dict.
        :param d1: base dict which will be overwritten
        :param d2: dict with new values that will overwrite d1
        :return: the merged dict (but d1 will be modified in-place anyway!)
        """
        for k, v in d2.items():
            if k not in d1:
                raise KeyError(
                    f"This key does not exist in the original dict. Valid keys are {list(d1.keys())}"
                )
            d1[k] = v

        return d1

So what does not work for the upgrade to spaCy-v3:

  1. The library has too many moving pieces (so many dependencies for tokenization), with spacy-udpipe being one which does not support spaCy-v3. These not being a part of the requirements.txt just makes it more difficult to set up the repo locally for contributing and testing. I understand that there might be reasons to not package these dependencies within the build but I would appreciate it if there was a contributing file that has instructions for setting up the project and running the tests.
  2. Also, spaCy-v3 does not allow spacy.load("en"), so you have to either use spacy.blank("en") and then add all the required pipes using the add_pipe() method or use spacy.load("en_core_web_sm") to get one of the default pipelines. This means a lot of changes rippling out of init_parser which uses model_or_lang to load models across the different libraries spacy, stanza, etc.

Thanks for the library! It saved me a good amount of time. I will look to address the above issues when I get some time but they might take longer than I had initially anticipated so if anyone else wants to pick this up I hope my comments would prove helpful. Meanwhile, people can work around with changed ConllFormatter with the sample usage shown above.

@BramVanroy
Copy link
Owner

Thanks for the comments nevertheless @narayanacharya6. I have indeed been holding off refactoring for v3 because the changes are subtle but relatively far-going - for which I do not have the time currently.

My hope is that spacy-udpipe get a v3 release soon-ish, so that I can just push a v3 spacy-conll version for all spacy-* libs, but dropping support for spacy-stanfordnlp (in favour of spacy-stanza, which already supports v3).

If you can wait until June, there is no need for you to take on this PR. I will work on it then. I will also create contributing guidelines. To be honest, I did not think this library would get a lot of usage but it seems I was wrong. I'll try to make it more open for contributions.

@BramVanroy
Copy link
Owner

Hi @karthikeyansam @narayanacharya6

Can you try the current master branch or the pre-release on pip?

pip uninstall spacy_conll
pip install spacy_conll --pre

I'll let this pre-release out in the wild for a week or so before pushing it as the production version. If you experience any issues, please let me know!

@narayanacharya6
Copy link

Hey @BramVanroy, I'll try it out over the weekend, thanks! :)

@narayanacharya6
Copy link

Hey @BramVanroy I finally got a chance to update to the new version. Pretty straightforward to pull out my hacks and use the lib as intended. No issues so far, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants