-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Performance issue in DefaultFieldSet due to the usage of SimpleDateFormat #1694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Markus Bernhardt commented We analyzed a small batch which reads and writes 1 million rows with YourKit and found this initializer costs roughly 6% of the overall execution time of our batch. Couldn't that at least be replaced with something more performant like:
So only one SimpleDateFormat get initialized per worker thread. BTW the same problem existst for the NumberFormat instance in that class. His initializer costs another 3% of the overall execution time of our batch. Perhaps you could change that accordingly:
And last but not leats, one of the constructors does a unneccessary double initialization, which costs in our batch another 3% of the overall execution.
What do you think about those changes? |
Brian Tarbox commented I think those are fine changes and would likely address the problem at hand. |
Michael Minella commented A couple thoughts:
|
Brian Tarbox commented I don't have the exact numbers handy anymore (entered the bug a while ago), but I recall the system was spinning at 99% cpu traceable directly to this. WRT to solutions...in my own local copy of DefaultFieldSet I went for simple and took advantage of the fact that the SimpleDateFormater in question always uses a hardcoded format of "yyyy-MM-dd", and so did the following for the standard case:
and then added a setter for DateFormat in case someone wanted a different format. This solved the problem for us. |
Michael Minella commented I see. With regards to the solution you used, that wouldn't work on the general case since SimpleDateFormat isn't thread safe. For the setter, that is already there per my previous comment. I definitely think we can update the DefaultFieldSet to not create the formatter by default and have the factory inject it if there is not another to be injected. |
Markus Bernhardt commented Hi Michael, I found this ticket, because we are evaluating at the moment to switch our batches to Spring Batch. I ported a small batch that reads roughly 1.5 million rows (350MB) fixed length data in EBCDIC from the file system, does some not too complicated computations and writes them back. I have implemented one prototype based on the Spring Batch samples, one in plain old java. So far so good. The problem is, the plain old java solution does the job in under 6 seconds. The Spring Batch based solution took in the first version 28 Minutes. So we fired up YourKit and looked into it. We found lots of database transactions. After a little while we found that we were reading the data in chunks of 1 row. So for every row the application context was stored. By increasing the chunk size to 100,000 rows, this problem was solved and the overall executing time sank to a little over 4 minutes. Still ways to slow. So we looked again into it with yourkit and found, that 12% of the overall runtime, thats 30 seconds, are spent initializing this two formatters. Only the FormatterLineAggregator.doAggregate is worse performance wise in our code. Regarding your points:
|
Brian Tarbox commented Michael, I'll mention that we also had to create our own DefaultFieldSetFactory that was called by the file tokenizer. We had to do this to have a place to do the injection of the NumberFormat object (i.e. the SimpleDateFormat). To Markus's point we also had to create a new BufferedReaderFactory to give the FlatFileItemReader a large buffer with which to read files. We were reading files over the network which the roundtrip time dominated and the default size of a buffered reader (1024 bytes I think....) was way too small. I'm hoping that the end resolution to this jira issue will address all of these highly-related issues. Thanks. |
Jimmy Praet commented The way the framework currently allows you to override the default DateFormat and NumberFormat through DefaultFieldSetFactory.setDateFormat() and DefaultFieldSetFactory.setNumberFormat() is also quite risky. As both of these SimpleDateFormat and NumberFormat classes aren't thread safe, you should pay attention as a user to define these beans with scope="step". The javadoc of DefaultFieldSetFactory is also incorrect: it states that the NumberFormat defaults to the default locale, and the DateFormat defaults to yyyy/MM/dd. But in fact the NumberFormat defaults to Locale.US and the DateFormat defaults to yyyy-MM-dd. |
Resolves spring-projects#1694 Signed-off-by: Fabrice Bibonne <[email protected]>
Brian Tarbox opened BATCH-1902 and commented
DefaultFieldSet creates a SimpleDateFormat (SDF) object as a member variable. This means it can not be overridden. SDF eventually calls TimeZone which calls getDefaultInAppContext. This is a static, synchronized and very slow method. This results in extremely slow reads.
To fix the problem in my own code means I have to basically make a copy of the entire DefaultFieldSet class. If the SDF were injected then I could change the slow behavior without having to copy the whole class.
I spoke with Gary Gregory (Spring Batch In Action) and he liked the idea of this change.
Affects: 2.1.8
Reference URL: http://stackoverflow.com/questions/12984345/java-7-calendar-getinstance-timezone-gettimezone-got-synchronized-and-slow-any
0 votes, 5 watchers
The text was updated successfully, but these errors were encountered: