Skip to content
This repository has been archived by the owner on Dec 10, 2018. It is now read-only.

PR to clean up global state in thriftpy.parser.parser? #268

Open
kurtbrose opened this issue Dec 6, 2016 · 11 comments
Open

PR to clean up global state in thriftpy.parser.parser? #268

kurtbrose opened this issue Dec 6, 2016 · 11 comments

Comments

@kurtbrose
Copy link

https://github.com/eleme/thriftpy/blob/develop/thriftpy/parser/parser.py#L479-L481

there is global state in thriftpy.parser.parser

Specifically, if an exception is raised by parser.parse(data), thrift_stack.pop() never happens and "dead" modules are left in the thrift_stack global.

I'd love to migrate our usage off of facebook thrift and onto thriftpy. However, the non-reentrancy of the parser is a problem. I am happy to fix it, but from a maintainability perspective I don't want to migrate us onto a fork.

Would you be open to a PR that makes the parser more object oriented and removes global state?

@hit9
Copy link
Contributor

hit9 commented Dec 6, 2016

Thanks, pull requests are welcomed!

@kurtbrose
Copy link
Author

Awesome, looking forward to working with you! :-)

@kurtbrose
Copy link
Author

kurtbrose commented Dec 7, 2016

This may be a bit extreme, but I got into the guts and PLY really punishes you for not having global state.

Here's what I have so far, migrating to Parsley:
https://gist.github.com/kurtbrose/1a99167917998647a4707b5d475edbbe

Most parsing tests are passing. Before proposing this as a real PR, I'll get 100% of tests passing, and also do a refactor of the code to get it as close as possible to the existing variable and function names. (My biggest concern is that such a major change would make the code-base unfamiliar for core developers.)

(Also, performance is similar between ply and parsley libraries; my test .thrift file is 25% faster in parsley but this isn't apples to apples until all of the correct in memory classes are set up.)

@lxyu
Copy link
Contributor

lxyu commented Dec 7, 2016

if an exception is raised by parser.parse(data), thrift_stack.pop() never happens and "dead" modules are left in the thrift_stack global.

@kurtbrose so would it make sense if we only fix this bug by wrap a try..catch on parser.parse?

@kurtbrose
Copy link
Author

kurtbrose commented Dec 7, 2016

@lxyu that's a good question. Here are some scenarios:

minimum change

A try..finally around the push / pop. Limitation: parsing is not thread safe.

reset global state + lock

At the beginning of every parse, reset the global variables in thriftpy.parser.parser and also put a lock around the parse function to ensure multiple threads can't modify the global state at the same time. Limitation: cache remains a global. Also, using locks opens the door for deadlocks when interacting with multiple threads and other libraries that use locks.

no global state parsing

(My proposal.) Parsing functions are all purely functional -- the inner parse() function takes a thrift file and produces a module object. The current global state lives on a Parser object. The lifetime and scope of Parser instances still needs to be managed -- but, a single global instance in thriftpy.parser.parser would work without impacting the architecture outside of the parser module.

My use case is I'm trying to migrate a large existing code-base from a very old version of facebook thrift (r821160) to thriftpy. There are ~300 .thrift files some of which have parse errors because the compiler has become more strict over time.

It's your call. I believe that switching ply for parsley to remove global state is the most direct and robust solution. I don't know how attached you are to ply -- it is possible to remove global state and still use the ply parser but would require ugly hacks, whereas the parsely API encourages good object oriented + functional management of state. If removing ply is a non-starter, my opinion would be to use try..finally + lock.

@lxyu
Copy link
Contributor

lxyu commented Dec 8, 2016

@kurtbrose

Basically, if one thrift file failed to parse, the best way to do is throw a exception and stop early. Only continue when the .thrift file fixed correctly. It doesn't make sense to me to continue a server or client when a thrift file failed in parsing stage.

For the thread-local part, yes currently the parser is not thread-safe, it needs some fix to archive it. And I have 2 thoughts on this problem.

  1. Even in multi thread case, you don't need to parse a thrift module twice. You can always parse the module once and use it in other threads. (also for multi thread / multi process use case, you may want to refer to eleme/gunicorn_thrift)
  2. If you really want to make the parser thread-safe, do you think by making all global state thread local can solve your problem?

@lxyu lxyu reopened this Dec 8, 2016
@kurtbrose
Copy link
Author

@lxyu

Those are good points. In a typical usage scenario (services communicating to each other) the protocol spec is fixed in advance and if it has any problems you want to abort. And if this was C++ or Java that would be the end of the story :-)

But, in Python we can do so much more. For example, in my early phases of development I'll typically use a REPL or Jupyter Notebook to iteratively play with code. In that environment, having isolated repeatable functions even if an exception is raised is very important. You'll execute a function, get an exception, try tweaking it and execute again.

For example, the way I first encountered these issues was trying to figure out the correct set of directories to pass to the parser so that the thrift include headers would work. It would have been very helpful in that case had there not been global state.

Regarding using thread local -- I think an approach with closures is the best way to avoid global state if sticking with PLY. (As mentioned in the PLY docs http://www.dabeaz.com/ply/ply.html#ply_nn18 .) It's basically an engineering trade-off -- what is the most elegant way to achieve the desired outcome?

Using parsley results in very simple and elegant usage patterns. For example:

   _GRAMMAR('struct Foo { 1: bar.Bar first 2: string second }').Struct()

This returns a struct -- subclass of thriftpy.thrift.TPayload, _ttype set to thriftpy.thrift.TType.STRUCT. Not that you'd want to parse fragments like this in isolation, but it makes testing and development much easier. The code also is only about 1/3rd the size.

I'm finishing wiring things up and getting all the tests passing, then I'll do a diff minimization in git to make sure there are no incidental changes -- I want parser.py to still "feel" like the same module as much as possible. At that point it's up to you.

@kurtbrose
Copy link
Author

@kurtbrose
Copy link
Author

Just as an FYI, the current parser fails on ThriftTest.thrift from the apache thrift IDL documentation.

I think that alone might make the change worth it :-)

Doing final integration now, will probably have PR tomorrow afternoon.

@kurtbrose
Copy link
Author

@wvonnegut-zz
Copy link

Have you considered defining the parsing grammar in a class and keeping its state local to the instance? You would use it via yacc.yacc(module=ThriftpyParser ())

http://www.dabeaz.com/ply/ply.html#ply_nn18

This would also greatly simplify #259

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants