Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load schema files for pykwalify to avoid global yaml usage #7970

Merged
merged 2 commits into from
Feb 17, 2021

Conversation

wochinge
Copy link
Contributor

@wochinge wochinge commented Feb 16, 2021

Proposed changes:

  • pykwalify uses a global ruamel.yaml instance. This instance sets object attributes while reading files (e.g. the YAML schema files itself!) . This can lead to problem when we have multiple concurrent yaml validations. Fix: Either load the yaml files for pykwalify (what I did in my PR) or provide the schema files in json

Status (please check what you already did):

  • added some tests for the functionality
  • updated the documentation
  • updated the changelog (please check changelog for instructions)
  • reformat files using black (please check Readme for instructions)

# Load schema content using our YAML loader as `pykwalify` uses a global instance
# which can fail when used concurrently
schema_content = rasa.shared.utils.io.read_yaml_file(schema_file)
schema_utils_content = rasa.shared.utils.io.read_yaml_file(schema_utils_file)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively we can convert all schema files to json as this would achieve the same effect and might be better re performance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even better: define the schemas as Python objects - this saves us reading things entirely. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is related to my insights in #7892 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might make sense to define schemas as Python objects (don't see any need to go with JSON in this case) — but we can also create an issue and do it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I'll add a changelog, create a new issue + merge 👍🏻

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wochinge wochinge requested a review from alwx February 17, 2021 08:32
@wochinge
Copy link
Contributor Author

@alwx Not ready for review yet, I rather want your input before I start making bigger changes

@wochinge wochinge marked this pull request as ready for review February 17, 2021 08:40
@wochinge wochinge requested a review from a team February 17, 2021 08:40
@joejuzl
Copy link
Contributor

joejuzl commented Feb 17, 2021

I wonder if this is the best fix? Maybe it would be safer going forward to just stop using multithreading? We could easily run into more issues like this in the future unless we ensure the threadsafety of all the code we write, and all the libraries we use! We could use multiprocessing instead (should be a very easy change).

@wochinge
Copy link
Contributor Author

@joejuzl Great suggestion! Unfortunately this is not as easy expected.

  1. Pickle error for the local function run: That one was easy to fix
  2. Pickle error for the decorared function as it's a local function as well (all part of create_app)🤯
  3. Pickle error for the request object.

That's the code I used:

def _run_in_process(f, request: Request, *args: Any, **kwargs: Any) -> None:
    # This is a new thread, so we need to create and set a new event loop
    thread_loop = asyncio.new_event_loop()
    asyncio.set_event_loop(thread_loop)

    try:
        return thread_loop.run_until_complete(f(request, *args, **kwargs))
    finally:
        thread_loop.close()


def run_in_thread(f: Callable[..., Coroutine]) -> Callable:
    """Decorator which runs request on a separate thread.

    Some requests (e.g. training or cross-validation) are computional intense requests.
    This means that they will block the event loop and hence the processing of other
    requests. This decorator can be used to process these requests on a separate thread
    to avoid blocking the processing of incoming requests.

    Args:
        f: The request handler function which should be decorated.

    Returns:
        The decorated function.
    """

    @wraps(f)
    async def decorated_function(
        request: Request, *args: Any, **kwargs: Any
    ) -> HTTPResponse:
        with concurrent.futures.ProcessPoolExecutor() as pool:
            return await request.app.loop.run_in_executor(
                pool, functools.partial(_run_in_process, f, request, *args, **kwargs)
            )

    return decorated_function

Do you have an idea? Otherwise I'd suggest to go with the current solution for now. We can still iterate on the solution in case we see threading errors elsewhere

@joejuzl
Copy link
Contributor

joejuzl commented Feb 17, 2021

@wochinge ahaa - that's annoying 🤦
Then yeah, let's fix it like this and create an issue to followup.

@wochinge wochinge merged commit 1fe10db into 2.3.x Feb 17, 2021
@wochinge wochinge deleted the fast-schema-validation-fix branch February 17, 2021 10:34
@wochinge
Copy link
Contributor Author

it was a very nice bug 😄 And now I have more ideas / opinions about our YAML validation.

Follow up issue #7980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants